Pseudonymization

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

Pseudonymization is a data management and de-identification procedure by which personally identifiable information fields within a data record are replaced by one or more artificial identifiers, or pseudonyms. A single pseudonym for each replaced field or collection of replaced fields makes the data record less identifiable while remaining suitable for data analysis and data processing.

Pseudonymization can be one way to comply with the European Union's new General Data Protection Regulation demands for secure data storage of personal information.[1] Pseudonymized data can be restored to its original state with the addition of information which then allows individuals to be re-identified, while anonymized data can never be restored to its original state.[2]

Data fields[edit]

The choice of which data fields are to be pseudonymized is partly subjective. Less selective fields, such as Birth Date or Postal Code are often also included because they are usually available from other sources and therefore make a record easier to identify. Pseudonymizing these less identifying fields removes most of their analytic value and is therefore normally accompanied by the introduction of new derived and less identifying forms, such as year of birth or a larger postal code region.

Data fields that are less identifying, such as date of attendance, are usually not pseudonymized. It is important to realize that this is because too much statistical utility is lost in doing so, not because the data cannot be identified. For example, given prior knowledge of a few attendance dates it is easy to identify someone's data in a pseudonymized dataset by selecting only those people with that pattern of dates. This is an example of an inference attack.

The weakness of pseudonymized data to inference attacks is commonly overlooked. A famous example is the AOL search data scandal.

Protecting statistically useful pseudonymized data from re-identification requires:

  1. a sound information security base
  2. controlling the risk that the analysts, researchers or other data workers cause a privacy breach

The pseudonym allows tracking back of data to its origins, which distinguishes pseudonymization from anonymization,[3] where all person-related data that could allow backtracking has been purged. Pseudonymization is an issue in, for example, patient-related data that has to be passed on securely between clinical centers.

The application of pseudonymization to e-health intends to preserve the patient's privacy and data confidentiality. It allows primary use of medical records by authorized health care providers and privacy preserving secondary use by researchers.[4] However, plain pseudonymization for privacy preservation often reaches its limits when genetic data are involved (see also genetic privacy). Due to the identifying nature of genetic data, depersonalization is often not sufficient to hide the corresponding person. Potential solutions are the combination of pseudonymization with fragmentation and encryption.[5]

An example of application of pseudonymization procedure is creation of datasets for de-identification research by replacing identifying words with words from the same category (e.g. replacing a name with a random name from the names dictionary),[6][7][8] however, in this case it is in general not possible to track data back to its origins.

See also[edit]

References[edit]

  1. ^ Data science under GDPR with pseudonymization in the data pipeline Published by Dativa, 17 April, 2018
  2. ^ Pseudonymization vs. Anonymization and How They Help With GDPR Published January, 2017 Retrieved April 20, 2018
  3. ^ http://dud.inf.tu-dresden.de/literatur/Anon_Terminology_v0.31.pdf Anonymity, Unlinkability, Undetectability, Unobservability, Pseudonymity, and Identity Management – A Consolidated Proposal for Terminology
  4. ^ Neubauer T, Heurix J. A methodology for the pseudonymization of medical data. Int J Med Inform. 2011 Mar;80(3) 190-204. doi:10.1016/j.ijmedinf.2010.10.016. PMID 21075676.
  5. ^ http://www.xylem-technologies.com/2011/09/07/privacy-preserving-storage-and-access-of-medical-data-through-pseudonymization-and-encryption Privacy-Preserving Storage and Access of Medical Data through Pseudonymization and Encryption
  6. ^ Neamatullah, Ishna; Douglass, Margaret M; Li-wei; Lehman, H; Reisner, Andrew; Villarroe, Mauricio; Long, William J; Szolovits, Peter; Moody, George B; Mark, Roger G; Clifford, Gari D (2008). "Automated de-identification of free-text medical records". BMC Medical Informatics and Decision Making. 8: 32. doi:10.1186/1472-6947-8-32.
  7. ^ org/physiotools/deid/doc/ishna-meng-thesis.pdf
  8. ^ Deleger, L; et al. (2014). "Preparing an annotated gold standard corpus to share with extramural investigators for de-identification research". J Biomed Inform. 50: 173–183. doi:10.1016/j.jbi.2014.01.014.