Data Re-Identification

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

Data Re-Identification is the practice of matching anonymous data (also known as de-identified data) with publicly available information, or auxiliary data, in order to discover the individual to which the data belongs to. This is a concern because companies with privacy policies, health care providers, and financial institutions may release the data they collect after the data has gone through the de-identification process.[1] The de-identification process involves masking, generalizing or deleting both direct and indirect identifiers; the definition of this process is not universal, however.[2] Information in the public domain, even seemingly anonymized, may thus be re-identified in combination with other pieces of available data and basic computer science techniques.[3] The Common Rule Agencies, a collection of multiple U.S. federal agencies and departments including the U.S. Department of Health and Human Services, speculate that re-identification is becoming gradually easier because of "big data" - the abundance and constant collection and analysis of information along the evolution of technologies and the advances of algorithms.[4] However, others have claimed that de-identification is a safe and effective data liberation tool and do not view re-identification as a concern.[5]

A 2000 study found that 87 percent of the U.S. population can be identified using a combination of their gender, birthdate and zip code.[5] Others do not think that re-identification is a serious threat, and call it a "myth"; they claim that the combination of zip code, date of birth and gender is rare or partially complete, such as only the year and month birth without the date, or the county name instead of the specific zip code, thus the risk of such re-identification is reduced in many instances.[5]

Legal protections of data in the United States[edit]

Existing privacy regulations typically protect information that has been modified, so that the data is deemed anonymized, or de-identified. For financial information, the Federal Trade Commission permits its circulation if it is de-identified and aggregated.[1] The Gramm Leach Bliley Act (GLBA), which mandates financial institutions give consumers the opportunity to opt out of having their information shared with third parties, does not cover de-identified data if the information is aggregate and does not contain personal identifiers, since this data is not treated as personally identifiable information.[1]

Educational records[edit]

In terms of university records, authorities both on the state and federal level have shown an awareness about issues of privacy in education and a distaste for institutions' disclosure of information. The U.S. Department of Education has provided guidance about data discourse and identification, instructing educational institutions to be sensitive to the risk of re-identification of anonymous data by cross-referencing with auxiliary data, to minimize the amount of data in the public domain by decreasing publication of directory information about students and institutional personnel, and to be consistent in the processes of de-identification.[6]

Medical records[edit]

Medical information of patients are becoming increasingly available on the Internet, on free and publicly accessing platforms such as HealthData.gov and PatientsLikeMe, encouraged by government open data policies and data-sharing initiatives spearheaded by the private sector. While this level of accessibility yields many benefits, concerns regarding discrimination and privacy have been raised.[7] Protections on medical records and consumer data from pharmacies are stronger compared to those for other kinds of consumer data. The Health Insurance Portability and Accountability Act (HIPAA) protects the privacy of identifiable data about health, but authorize information release to third parties if de-identified. In addition, it mandates that patients receive breach notifications should there be more than a low probability that the patient's information was inappropriately disclosed or utilized without sufficient mitigation of the harm to him or her.[8] The likelihood of re-identification is a factor in determining the probability that the patient's information has been compromised. Commonly, pharmacies sell de-identified information to data mining companies that sell to pharmaceutical companies in turn.[1]


There have been state laws enacted to ban data mining of medical information, but they were struck down by federal courts in Maine and New Hampshire on First Amendment grounds. Another federal court on another case used "illusive" to describe concerns about privacy of patients and did not recognize the risks of re-identification.[1]

Biospecimen[edit]

The Notice of Proposed Rule Making, published by the Common Rule Agencies in September 2015, expanded the umbrella term of "human subject" in research to include biospecimens, or materials taken from the human body - blood, urine, tissue etc. This mandates that researchers using biospecimens must follow the stricter requirements of doing researcher with human subjects. The rationale for this is the increased risk of re-identification of biospecimen.[4] The final revisions affirmed this regulation.[9]

Re-identification efforts[edit]

There have been a sizable amount of successful attempts of re-identification in different fields. Even if it is not easy for a lay person to break anonymity, once the steps to do so are disclosed and learnt, there is no need for higher level knowledge to access information in a database. Sometimes, technical expertise is not even needed if a population has a unique combination of identifiers.[1]

Health records[edit]

In the mid 1990s, a government agency in Massachusetts called Group Insurance Commission (GIC), which purchased health insurance for employees of the state, decided to release records of hospital visits to any researcher who requested the data, at no cost. GIC assured that the patient's privacy was not a concern since it had removed identifiers such as name, addresses, social security numbers. However, information such as zip codes, birth date and sex remained untouched. The GIC assurance was reinforced by the then governor of Massachusetts, William Weld. Latanya Sweeney, a graduate student at the time, put her mind to picking out the governor's records in the GIC data. By combining the GIC data with the voter database of the city Cambridge, which she purchased for 20 dollars, Governor Weld's record was discovered with ease.[10]

In 1997, a researcher successfully de-anonymized medical records using voter databases.[1]

In 2001, Professor Latanya Sweeney again successfully matched anonymized hospital visit records in the state of Washington to individual persons using the state's voting records 43% of the time.[11]

There are existing algorithms used to re-identify patient with prescription drug information.[1]

Consumer habits and practices[edit]

Two researchers at the University of Texas, Arvind Narayanan and Professor Vitaly Shmatikov, were able to re-identity some portion of anonymized Netflix movie-ranking data with individual consumers on the streaming website. The data was released by Netflix 2006 after de-identification, which consisted of replacing individual names with random numbers and moving around personal details. The two researchers de-anonymized some of the data by comparing it with non-anonymous IMDb (Internet Movie Database) users’ movie ratings. Very little information from the database, it was found, was needed to identify the subscriber.[1] In the resulting research paper, there were startling reveletions of how easy it is to re-identify Netflix users. For example, simply knowing data about only two movies a user has reviewed, including the precise rating and the date of rating give or take three days allows for 68% re-identification success.[10]

In 2006, after AOL published its users' search queries, data that was anonymized prior to the public release, New York Times reporters successfully carried out re-identification of individuals by taking groups of searches made by anonymized users.[1] AOL had attempted to suppress identifying information including usernames and IP addresses, but had replaced these with unique identification numbers to preserve the utility of this data for researchers. Bloggers, after the release, poured over the data, either trying to identify specific users with this content, or to point out entertaining, depressing, or shocking search queries, examples of which include "how to kill you wife," "depression and medical leave," "car crash photos." Two reporters, Michael Barbro and Tom Zeller, were able to track down a 62 year old widow named Thelma Arnold from recognizing clues to the identity of User 417729 search histories. Arnold acknowledged that she was the author of the searches, confirming that re-identification is possible.[12]

Consequences[edit]

The individuals whose data is re-identified is also at risk of having their information, with their identity attached it, sold to organizations they do not want possessing private information about their finances, health or preferences. The release of this data may cause anxiety, shame or embarrassment. Once an individual's privacy has been breached as a result of re-identification, future breaches become much easier: once an link is made between one piece of data and a person's real identity, any association between the data and an anonymous identity breaks anonymity of the person.[1]

Re-identification may expose companies which have pledged to assure anonymity to increased liability to contract or to tort and cause them to violate their privacy policies by having released information to third parties that can identify users after re-identification. Not only will they violate internal policies, institutions may also violate state and federal laws, such laws concerning financial confidentiality or medical privacy.[1]

Remedies[edit]

To address the drawbacks of re-identification, there have been multiple proposals put forward:

  • Higher standards and uniform definition of de-identification while retaining data utility: the definition of de-identification should balance privacy protections to reduce re-identification risk with the refusal of companies to delete data [2]
  • Heightened privacy protections of anonymized information [1]
  • Tighter security for databases that store anonymized information [1]
  • Strong ban on malicious re-identification, the passing of broader anti-discrimination and privacy legislation that ensures privacy protections as well as encourage participation in data sharing projects and endeavors, as well as establishment of uniform data protection standards in academic communities, such as in the scientific community, in order to minimize privacy violations [13]
  • A focus on the process of creating data-release policies: making sure de-identification rhetoric is accurate, drawing up contracts that prohibit re-identification attempts and dissemination of sensitive information, establishing data enclaves, and utilizing data-based strategies to match required protection standards to the level of risk.[14]
  • Differential Privacy

While a complete ban on re-identification has been urged, the enforcement of this is impossible. However, there are ways for lawmakers to combat and punish re-identification efforts, if and when they are exposed:pair a ban with harsher penalties and stronger enforcement by the Federal Trade Commission and the Federal Bureau of Investigation, grant victims of re-identification a right of action against those who re-identify them, mandate software audit trails for people who utilize and analyze anonymized data. A small-scale re-identification ban may also be imposed on trusted recipients of particular databases, such as government data miners. This ban would be much easier to enforce and may discourage re-identification in other spheres and in the future.[10]

See also[edit]

References[edit]

  1. ^ a b c d e f g h i j k l m n Porter, Christine. 2008 "CONSTITUTIONAL AND REGULATORY: De-Identified Data and Third Party Data Mining: The Risk of Re-Identification of Personal Information." University of Washington Shidler Journal of Law, Commerce & Technology. Retrieved March 26, 2017.
  2. ^ a b Lagos, Yianni. 2014. "SYMPOSIUM: TAKING THE PERSONAL OUT OF DATA: MAKING SENSE OF DE-IDENTIFICATION." Indiana Law Review. Retrieved March 26, 2017.
  3. ^ McGeveran, William. 2011. "PRIVACY, DEMOCRACY, AND ELECTIONS: MRS. MCINTYRE'S PERSONA: BRINGING PRIVACY THEORY TO ELECTION LAW." William & Mary Bill of Rights Journal. Retrieved March 26, 2017.
  4. ^ a b Groden, Samantha, Summer Martin, and Rebecca Merrill. 2016. "Proposed Changes to the Common Rule: A Standoff Between Patient Rights and Scientific Advances?" Journal of Health & Life Sciences Law. Retrieved March 26, 2017.
  5. ^ a b c Richardson, Victor, Sallie Milam, and Denise Chrylser. 2015. "Is Sharing De-identined Data Legal? The State of Public Health Confidentiality Laws and Their Interplay with Statistical Disclosure Limitation Techniques." Journal of Law, Medicine & Ethics. Retrieved March 26, 2017.
  6. ^ Peltz, Richard. 2009. "Beyond the Final Frontier: A "Post-Racial" America?: The Responsibilities of Citizens: From the Ivory Tower to the Glass House: Access to "De-Identified" Public University Admission Records to Study Affirmative Action." Harvard Journal on Racial and Ethic Justice. Retrieved March 26, 2017.
  7. ^ Hoffman, Sharona. 2015. "Citizen Science: The Law and Ethics of Public Access to Medical Big Data." Berkeley Technology Law Journal. Retrieved March 26, 2017.
  8. ^ Greenberg, Yelena. 2016. "RECENT CASE DEVELOPMENTS: Increasing Recognition of "Risk of Harm" as an Injury Sufficient to Warrant Standing in Class Action Medical Data Breach Cases." American Journal of Law & Medicine. Retrieved March 26, 2017.
  9. ^ 24 C.F.R. § .104 2017.
  10. ^ a b c Ohm, Paul. 2010. “Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization.” UCLA Law Review. Retrieved March 26, 2017.
  11. ^ Sweeney L. Only You, Your Doctor and Many Others May Know. Technology Science. 2015092903. September 25, 2015.
  12. ^ ibid.
  13. ^ Ahn, Sejin. 2015. “COMMENT: Whose Genome Is It Anyway?: Re-identification and Privacy Protection in Public and Participatory Genomics.” San Diego Law Review. Retrieved March 26, 2017.
  14. ^ Rubinstein, Ira S, and Hartzog, Woodrow. 2016. “Anonymization and Risk” Washington Law Review. Retrieved March 26, 2017.