k-anonimity and l-diversity #307

mattiamazzoli · 2024-12-10T11:48:36Z

I'm using the benchmarks evaluation function to test some privacy metrics.

The Docs state that k-anonimity and l-diversity are ratios of the respective metrics between the original and synthetic dataset. However, in the benchmarks evaluation function I find both k-anonymity.gt and k-anomymity.syn, hence computing k-anon for both original and synthetic datasets.
In my case, I get k-anonimity = 59 for the original and k = 58.5 for the synthetic dataset, but is this actually a ratio? And with respect to what?

Moreover, k-anonymity and l-diversity should report different information, while for some reason they assume the same values all the time. Is k-anonymity computed only on the sensitive attributes without quasi-identifiers?
It would be great to improve consistency between documentation and github/readme description of metrics.

Screenshots

Screenshot uploaded below

System Information

AWS cloud environment
Language Version: Python 3.10.8, IPython: 8.29.0, jupyterlab: 1.2.21
Package Manager Version: conda 24.7.1
Browser: Firefox

robsdavis · 2025-01-09T16:31:52Z

Thanks for your help in cleaning up the documentation. The correct definition is the on in the README. I will update the documentation to match.

Why does 𝑙=𝑘

The values of k-anonymity and l-diversity can be identical in specific scenarios where the conditions for both metrics align perfectly. Here's how this can happen:

Equivalence Classes with Exactly k Records:

Suppose the dataset is structured such that each equivalence class (group of records with identical QI values) contains exactly 𝑘 records.
Implication: The dataset satisfies k-anonymity by definition since each record is part of a group of size
𝑘.

Distinct Sensitive Attribute Values within Each Equivalence Class:

Within each of these equivalence classes, if every record has a unique value for the sensitive attribute(s), then:
Number of Distinct Sensitive Values (𝑙) = Number of Records (𝑘)
Implication: This satisfies l-diversity with 𝑙=𝑘, as there are 𝑘 distinct sensitive values within each equivalence class.

Resulting Equality:

Since each equivalence class has exactly 𝑘 records and 𝑙=𝑘, both k-anonymity and l-diversity metrics will yield the same value for the dataset.
Example: If 𝑘=3 and within every group of 3 records sharing the same QIs, there are 3 distinct sensitive attribute values, then both k-anonymity and l-diversity will report a value of 3.

Do you think that this is the case for you data?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

k-anonimity and l-diversity #307

k-anonimity and l-diversity #307

mattiamazzoli commented Dec 10, 2024 •

edited

Loading

robsdavis commented Jan 9, 2025

k-anonimity and l-diversity #307

k-anonimity and l-diversity #307

Comments

mattiamazzoli commented Dec 10, 2024 • edited Loading

Screenshots

System Information

robsdavis commented Jan 9, 2025

Why does 𝑙=𝑘

mattiamazzoli commented Dec 10, 2024 •

edited

Loading