Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k-anonimity and l-diversity #307

Open
mattiamazzoli opened this issue Dec 10, 2024 · 1 comment
Open

k-anonimity and l-diversity #307

mattiamazzoli opened this issue Dec 10, 2024 · 1 comment

Comments

@mattiamazzoli
Copy link

mattiamazzoli commented Dec 10, 2024

I'm using the benchmarks evaluation function to test some privacy metrics.

The Docs state that k-anonimity and l-diversity are ratios of the respective metrics between the original and synthetic dataset. However, in the benchmarks evaluation function I find both k-anonymity.gt and k-anomymity.syn, hence computing k-anon for both original and synthetic datasets.
In my case, I get k-anonimity = 59 for the original and k = 58.5 for the synthetic dataset, but is this actually a ratio? And with respect to what?

Moreover, k-anonymity and l-diversity should report different information, while for some reason they assume the same values all the time. Is k-anonymity computed only on the sensitive attributes without quasi-identifiers?
It would be great to improve consistency between documentation and github/readme description of metrics.

Screenshots

Screenshot uploaded below
Screenshot 2024-12-10 at 12 41 32
Screenshot 2024-12-10 at 12 42 06
Screenshot 2024-12-10 at 12 44 06

System Information

  • AWS cloud environment
  • Language Version: Python 3.10.8, IPython: 8.29.0, jupyterlab: 1.2.21
  • Package Manager Version: conda 24.7.1
  • Browser: Firefox
@robsdavis
Copy link
Contributor

Thanks for your help in cleaning up the documentation. The correct definition is the on in the README. I will update the documentation to match.

Why does 𝑙=𝑘

The values of k-anonymity and l-diversity can be identical in specific scenarios where the conditions for both metrics align perfectly. Here's how this can happen:

  1. Equivalence Classes with Exactly k Records:
  • Suppose the dataset is structured such that each equivalence class (group of records with identical QI values) contains exactly 𝑘 records.

  • Implication: The dataset satisfies k-anonymity by definition since each record is part of a group of size
    𝑘.

  1. Distinct Sensitive Attribute Values within Each Equivalence Class:
  • Within each of these equivalence classes, if every record has a unique value for the sensitive attribute(s), then:
  • Number of Distinct Sensitive Values (𝑙) = Number of Records (𝑘)
  • Implication: This satisfies l-diversity with 𝑙=𝑘, as there are 𝑘 distinct sensitive values within each equivalence class.
  1. Resulting Equality:
  • Since each equivalence class has exactly 𝑘 records and 𝑙=𝑘, both k-anonymity and l-diversity metrics will yield the same value for the dataset.
  • Example: If 𝑘=3 and within every group of 3 records sharing the same QIs, there are 3 distinct sensitive attribute values, then both k-anonymity and l-diversity will report a value of 3.

Do you think that this is the case for you data?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants