k-anonymity

From Wikipedia, the free encyclopedia
Jump to: navigation, search

k-anonymity is a property possessed by certain anonymized data. The concept of k-anonymity was first introduced by Latanya Sweeney and Pierangela Samarati in a paper published in 1998[1] as an attempt to solve the problem: "Given person-specific field-structured data, produce a release of the data with scientific guarantees that the individuals who are the subjects of the data cannot be re-identified while the data remain practically useful."[2][3][4] A release of data is said to have the k-anonymity property if the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appear in the release. The various procedures and programs for generating anonymised data providing k-anonymity protection have been patented in the United States (Patent 7,269,578).[5]

Methods for k-anonymization[edit]

In the context of k-anonymization problems, a database is a table with n rows and m columns. Each row of the table represents a record relating to a specific member of a population and the entries in the various rows need not be unique. The values in the various columns are the values of attributes associated with the members of the population. The following table is a nonanonymized database consisting of the patient records of some fictitious hospital in Kochi.

Name Age Gender State of domicile Religion Disease
Ramsha 29 Female Tamil Nadu Hindu Cancer
Yadu 24 Female Kerala Hindu Viral infection
Salima 28 Female Tamil Nadu Muslim TB
Sunny 27 Male Karnataka Parsi No illness
Joan 24 Female Kerala Christian Heart-related
Bahuksana 23 Male Karnataka Buddhist TB
Rambha 19 Male Kerala Hindu Cancer
Kishor 29 Male Karnataka Hindu Heart-related
Johnson 17 Male Kerala Christian Heart-related
John 19 Male Kerala Christian Viral infection

There are 6 attributes and 10 records in this data. There are two common methods for achieving k-anonymity for some value of k.

  1. Suppression: In this method, certain values of the attributes are replaced by an asterisk '*'. All or some values of a column may be replaced by '*'. In the anonymized table below, we have replaced all the values in the 'Name' attribute and all the values in the 'Religion' attribute with a '*'.
  2. Generalization: In this method, individual values of attributes are replaced by with a broader category. For example, the value '19' of the attribute 'Age' may be replaced by ' ≤ 20', the value '23' by '20 < Age ≤ 30' , etc.

The next table shows the anonymized database.

Name Age Gender State of domicile Religion Disease
* 20 < Age ≤ 30 Female Tamil Nadu * Cancer
* 20 < Age ≤ 30 Female Kerala * Viral infection
* 20 < Age ≤ 30 Female Tamil Nadu * TB
* 20 < Age ≤ 30 Male Karnataka * No illness
* 20 < Age ≤ 30 Female Kerala * Heart-related
* 20 < Age ≤ 30 Male Karnataka * TB
* Age ≤ 20 Male Kerala * Cancer
* 20 < Age ≤ 30 Male Karnataka * Heart-related
* Age ≤ 20 Male Kerala * Heart-related
* Age ≤ 20 Male Kerala * Viral infection

This data has 2-anonymity with respect to the attributes 'Age', 'Gender' and 'State of domicile' since for any combination of these attributes found in any row of the table there are always at least 2 rows with those exact attributes. The attributes available to an adversary are called "quasi-identifiers". Each "quasi-identifier" tuple occurs in at least k records for a dataset with k-anonymity.[6]

Meyerson and Williams (2004) demonstrated that optimal k-anonymity is an NP-hard problem, however heuristic methods such as k-Optimize as given by Bayardo and Agrawal (2005) often yields effective results.[7][8] A practical approximation algorithm that enables solving the k-anonymization problem with an approximation guarantee of O(log k) was presented by Kenig and Tassa.[9]

Caveats[edit]

Because k-anonymization does not include any randomization, attackers can still make inferences about data sets that may harm individuals. For example, if the 19-year-old John from Kerala is known to be in the database above, then it can be reliably said that he has either cancer, a heart-related disease, or a viral infection.

K-anonymization is not a good method to anonymize high-dimensional datasets.[10] For example, researchers showed that, given 4 points, the unicity of mobile phone datasets (, k-anonymity when ) can be as high as 95%.[11]

It has also been shown that k-anonymity can skew the results of a data set if it disproportionately suppresses and generalizes data points with unrepresentative characteristics.[12] The suppression and generalization algorithms used to k-anonymize datasets can be altered, however, so that they do not have such a skewing effect.[13]

See also[edit]

References[edit]

  1. ^ Samarati, Pierangela; Sweeney, Latanya (1998). "Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression" (PDF). Harvard Data Privacy Lab. Retrieved April 12, 2017. 
  2. ^ L. Sweeney. "Database Security: k-anonymity". Retrieved 19 January 2014. 
  3. ^ L. Sweeney. k-anonymity: a model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10 (5), 2002; 557-570.
  4. ^ P. Samarati. Protecting Respondents' Identities in Microdata Release. IEEE Transactions on Knowledge and Data Engineering archive Volume 13 Issue 6, November 2001.
  5. ^ "Systems and methods for de-identifying entries in a data source". United States Patents and Trademarks Office. Retrieved 19 January 2014. 
  6. ^ Narayanan, Arvind; Shmatikov, Vitaly. "Robust De-anonymization of Large Sparse Datasets" (PDF). 
  7. ^ Roberto J. Bayardo; Rakesh Agrawal (2005). "Data Privacy through Optimal k-anonymization" (PDF). ICDE '05 Proceedings of the 21st International Conference on Data Engineering: 217–28. ISBN 0-7695-2285-8. ISSN 1084-4627. doi:10.1109/ICDE.2005.42. Data de-identification reconciles the demand for release of data for research purposes and the demand for privacy from individuals. This paper proposes and evaluates an optimization algorithm for the powerful de-identification procedure known as k-anonymization. A k-anonymized dataset has the property that each record is indistinguishable from at least k - 1 others. Even simple restrictions of optimized k-anonymity are NP-hard, leading to significant computational challenges. We present a new approach to exploring the space of possible anonymizations that tames the combinatorics of the problem, and develop data-management strategies to reduce reliance on expensive operations such as sorting. Through experiments on real census data, we show the resulting algorithm can find optimal k-anonymizations under two representative cost measures and a wide range of k. We also show that the algorithm can produce good anonymizations in circumstances where the input data or input parameters preclude finding an optimal solution in reasonable time. Finally, we use the algorithm to explore the effects of different coding approaches and problem variations on anonymization quality and performance. To our knowledge, this is the first result demonstrating optimal k-anonymization of a nontrivial dataset under a general model of the problem. 
  8. ^ Adam Meyerson; Ryan Williams (2004). "On the Complexity of Optimal K-Anonymity" (PDF). PODS '04 Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. New York, NY: ACM: 223–8. ISBN 158113858X. doi:10.1145/1055558.1055591. The technique of k-anonymization has been proposed in the literature as an alternative way to release public information, while ensuring both data privacy and data integrity. We prove that two general versions of optimal k-anonymization of relations are NP-hard, including the suppression version which amounts to choosing a minimum number of entries to delete from the relation. We also present a polynomial time algorithm for optimal k-anonymity that achieves an approximation ratio independent of the size of the database, when k is constant. In particular, it is a O(k log k)-approximation where the constant in the big-O is no more than 4. However, the runtime of the algorithm is exponential in k. A slightly more clever algorithm removes this condition, but is a O(k logm)-approximation, where m is the degree of the relation. We believe this algorithm could potentially be quite fast in practice. 
  9. ^ Kenig, Batya; Tassa, Tamir (2012). "A practical approximation algorithm for optimal k-anonymity". Data Mining and Knowledge Discovery. 25: 134–168. 
  10. ^ Aggarwal, Charu C. "On k-Anonymity and the Curse of Dimensionality". 
  11. ^ de Montjoye, Yves-Alexandre; César A. Hidalgo; Michel Verleysen; Vincent D. Blondel (March 25, 2013). "Unique in the Crowd: The privacy bounds of human mobility". Nature srep. doi:10.1038/srep01376. 
  12. ^ Angiuli, Olivia; Joe Blitzstein; Jim Waldo. "How to De-Identify Your Data". ACM Queue. ACM. 
  13. ^ Angiuli, Olivia; Jim Waldo (June 2016). "Statistical Tradeoffs between Generalization and Suppression in the De-Identification of Large-Scale Data Sets". IEEE Computer Society Intl Conference on Computers, Software, and Applications.