De-identification is the process used to prevent a person’s identity from being connected with information. Common uses of de-identification include human subject research for the sake of privacy for research participants. Common strategies for de-identifying datasets include deleting or masking personal identifiers, such as name and social security number, and suppressing or generalizing quasi-identifiers, such as date of birth and zip code. The reverse process of defeating de-identification to identify individuals is known as re-identification. Several successful re-identifications attempts  have purported to doubt on the effectiveness of de-identification in protecting individuals' privacy. A systematic review of the evidence found that published re-identification attacks were performed on data sets that were not de-identified properly (using recognized standards).
The United States President's Council of Advisors on Science and Technology and others have recently deemed de-identification "somewhat useful as an added safeguard" but not "a useful basis for policy" as "it is not robust against near‐term future re‐identification methods".
A survey is conducted, such as a census, to collect information about a group of people. To encourage participation and to protect the privacy of survey respondents, the researchers attempt to design the survey in such a way that people can participate in the survey and when the result is published it will not be possible to match any participant's individual response with any data published in the result.
Anonymization and de-identification
Anonymization refers to irreversibly severing a data set from the identity of the data contributor in a study to prevent any future re-identification, even by the study organizers under any condition. De-identification is also a severing of a data set from the identity of the data contributor, but may include preserving identifying information which could only be re-linked by a trusted party in certain situations. There is a debate in the technology community of whether data that can be re-linked, even by a trusted party, should ever be considered de-identified.
Whenever a person participates in genetics research the donation of a biological specimen often results in the creation of a large amount of personalized data. Such data is uniquely difficult to de-identify.
Anonymization of genetic data is particularly difficult because of the huge amount of genotypic information in biospecimens, the ties that specimens often have to medical history, and the advent of modern bioinformatics tools for data mining. There have been demonstrations that data for individuals in aggregate collections of genotypic data sets can be tied to the identities of the specimen donors.
Some researchers have suggested that it is not reasonable to ever promise participants in genetics research that they can retain their anonymity, but instead such participants should be taught the limits of using coded identifiers in a de-identification process.
De-identification laws in the United States of America
The HIPAA Privacy Rule provides mechanisms for using and disclosing health data responsibly without the need for patient consent. These mechanisms center on two HIPAA de-identification standards – Safe Harbor and the Expert Determination Method. Safe Harbor relies on the removal of specific patient identifiers (e.g. name, phone number, email address, etc.) while the Expert Determination Method requires knowledge and experience with generally accepted statistical and scientific principles and methods to render information not individually identifiable.
The safe harbor method uses a list approach to de-identification and has two requirements:
- The removal or generalization of 18 elements from the data.
- That the Covered Entity or Business Associate does not have actual knowledge that the residual information in the data could be used alone, or in combination with other information, to identify an individual. Safe Harbor is a highly prescriptive approach to de-identification. Under this method, all dates must be generalized to year and zip codes reduced to three digits. The same approach is used on the data regardless of the context. Even if the information is to be shared with a trusted researcher who wishes to analyze the data for seasonal variations in acute respiratory cases and, thus, requires the month of hospital admission, this information cannot be provided; only the year of admission would be retained.
Expert Determination takes a risk-based approach to de-identification that applies current standards and best practices from the research to determine the likelihood that a person could be identified from their protected health information. This method requires that a person with appropriate knowledge of and experience with generally accepted statistical and scientific principles and methods render the information not individually identifiable. It requires:
- That the risk is very small that the information could be used alone, or in combination with other reasonably available information, by an anticipated recipient to identify an individual who is a subject of the information;
- Documents the methods and results of the analysis that justify such a determination.
Research on decedents
The key law about research in electronic health record data is HIPAA Privacy Rule. This law allows use of electronic health record of deceased subjects for research (HIPAA Privacy Rule (section 164.512(i)(1)(iii))).
- Sweeney, L. (2000). "Simple Demographics Often Identify People Uniquely". Data Privacy Working Paper. 3.
- de Montjoye, Y.-A. (2013). "Unique in the crowd: The privacy bounds of human mobility". Nature S.Rep. 3. doi:10.1038/srep01376.
- de Montjoye, Y.-A. (2015). "Unique in the shopping mall: On the reidentifiability of credit card metadata". Science. 347.
- Narayanan, A. (2006). "How to break anonymity of the netflix prize dataset". arXiv: .
- El Emam, Khaled (2011). "A Systematic Review of Re-Identification Attacks on Health Data". PLOS ONE. 10 (4).
- PCAST. "Report to the President - Big Data and Privacy: A technological perspective" (PDF). Retrieved 28 March 2016.
- Godard, B. A.; Schmidtke, J. R.; Cassiman, J. J.; Aymé, S. G. N. (2003). "Data storage and DNA banking for biomedical research: Informed consent, confidentiality, quality issues, ownership, return of benefits. A professional perspective". European Journal of Human Genetics. 11: S88–122. doi:10.1038/sj.ejhg.5201114. PMID 14718939.
- Fullerton, S. M.; Anderson, N. R.; Guzauskas, G.; Freeman, D.; Fryer-Edwards, K. (2010). "Meeting the Governance Challenges of Next-Generation Biorepository Research". Science Translational Medicine. 2 (15): 15cm3. doi:10.1126/scitranslmed.3000361. PMC . PMID 20371468.
- McMurry, AJ; Gilbert, CA; Reis, BY; Chueh, HC; Kohane, IS; Mandl, KD (2007). "A self-scaling, distributed information architecture for public health, research, and clinical care". J Am Med Inform Assoc. 14: 527–33. doi:10.1197/jamia.M2371. PMC . PMID 17460129.
- Nicholson, S.; Smith, C. A. (2006). "Using lessons from health care to protect the privacy of library users: Guidelines for the de-identification of library data based on HIPAA". Proceedings of the American Society for Information Science and Technology. 42: n/a. doi:10.1002/meet.1450420106.
- McGuire, A. L.; Gibbs, R. A. (2006). "GENETICS: No Longer De-Identified". Science. 312 (5772): 370–371. doi:10.1126/science.1125339. PMID 16627725.
- Thorisson, G. A.; Muilu, J.; Brookes, A. J. (2009). "Genotype–phenotype databases: Challenges and solutions for the post-genomic era". Nature Reviews Genetics. 10 (1): 9–18. doi:10.1038/nrg2483. PMID 19065136.
- Homer, N.; Szelinger, S.; Redman, M.; Duggan, D.; Tembe, W.; Muehling, J.; Pearson, J. V.; Stephan, D. A.; Nelson, S. F.; Craig, D. W. (2008). Visscher, Peter M., ed. "Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays". PLoS Genetics. 4 (8): e1000167. doi:10.1371/journal.pgen.1000167. PMC . PMID 18769715.
- "De-Identification 201". Privacy Analytics. 2015.
- 45 C.F.R. 164.512)
- Simson L. Garfinkel (2015-12-16). "NISTIR 8053, De-Identification of Personal Information" (PDF). NIST. Retrieved 2016-01-03.
- A training series on United States government de-identification standards
- Guidance Regarding Methods for De-identification of Protected Health Information
- Paul Ohm (UCLA Law Review 1701, 2010). "Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization" Retrieved 2016-09-30.