Statistical disclosure control

From Wikipedia, the free encyclopedia
  (Redirected from Statistical Disclosure Control)
Jump to navigation Jump to search

Statistical disclosure control (SDC), also known as statistical disclosure limitation (SDL) or disclosure avoidance, is a technique used in data-driven research to ensure no person or organization is identifiable from the results of an analysis of survey or administrative data, or in the release of microdata. The purpose of SDC is to protect the confidentiality of the respondents and subjects of the research.[1]

SDC usually refers to 'output SDC'; ensuring that, for example, a published table or graph does not disclosure confidential information about respondents. SDC can also describes protection methods applied to the data: for example, removing names and addresses, limiting extreme values, or swapping problematic observations. This is sometimes referred to as 'input SDC', but is more commonly called anonymization, de-identification, or microdata protection.

Textbooks (eg [2]) typically cover input SDC and tabular data protection (but not other parts of output SDC). This is because these two problems are of direct interest to statistical agencies who supported the development of the field.[3] For analytical environments, output rules developed for statistical agencies were generally used until data managers began arguing for specific output SDC for research.[4]

Necessity[edit]

Many kinds of social, economic and health research use potentially sensitive data as a basis for their research, such as survey or Census data, tax records, health records, educational information, etc. Such information is usually given in confidence, and, in the case of administrative data, not always for the purpose of research.

Researchers are not usually interested in information about one single person or business; they are looking for trends among larger groups of people.[5] However, the data they use is, in the first place, linked to individual people and businesses, and SDC ensures that these cannot be identified from published data, no matter how detailed or broad.[6]

It is possible that at the end of data analysis, the researcher somehow singles out one person or business through their research. For example, a researcher may identify the exceptionally good or bad service in a geriatric department within a hospital in a remote area, where only one hospital provides such care. In that case, the data analysis 'discloses' the identity of the hospital, even if the dataset used for analysis was properly anonymised or de-identified.

Statistical disclosure control will identify this disclosure risk and ensure the results of the analysis are altered to protect confidentiality.[7] It requires a balance between protecting confidentiality and ensuring the results of the data analysis are still useful for statistical research.[8]

Output SDC[edit]

There are two main approaches to output SDC: principles-based and rules-based.[9] In principles-based systems, disclosure control attempts to uphold a specific set of fundamental principles—for example, "no person should be identifiable in released microdata".[10] Rules-based systems, in contrast, are evidenced by a specific set of rules that a person performing disclosure control follows, after which the data are presumed to be safe to release. In general, official statistics are rules-based; research environments are more likely to be principles-based.

In research environments, the choice of output-checking regime can have significant operational implications.[11]

Rules-Based SDC[edit]

In rules-based SDC, a rigid set of rules is used to determine whether or not the results of data analysis can be released. The rules are applied consistently, which makes it obvious what kinds of output are acceptable. Rules-based systems are good for ensuring consistency across time, across data sources, and across production teams, which makes them appealing for statistical agencies.[11] Rules-based systems also work well for remote job serves such as microdata.no or Lissy.

However, because the rules are inflexible, either disclosive information may still slip through, or the rules are over-restrictive and may only allow for results that are too broad for useful analysis to be published.[9] In practice, research environments running rules-based systems may have to bring flexibility in 'ad hoc' systems.[11]

The Northern Ireland Statistics and Research Agency uses a rules-based approach to releasing statistics and research results.[12]

Principles-Based SDC[edit]

In principles-based SDC, both the researcher and the output checker are trained in SDC. They receive a set of rules, which are rules-of-thumb rather than hard rules as in rules-based SDC. This means that in principle, any output may be approved or refused. The rules-of-thumb are a starting point for the researcher. A researcher may request outputs which breach the 'rules of thumb' as long as (1) they are non-disclosive (2) they are important and (3) this is an exceptional request.[13] It is up to the researcher to prove that any 'unsafe' outputs are non-disclosive, but the checker has the final say. Since there are no hard rules, this requires knowledge on disclosure risks and judgment from both the researcher and the checker. It requires training and an understanding of statistics and data analysis,[9] although it has been argued[11] that this can be used to make the process more efficient than a rules-based model.

The UK Data Service employs a principles-based approach to statistical disclosure control from its Secure Data Service.[14]

Critiques[edit]

Many contemporary statistical disclosure control techniques, such as generalization and cell suppression, have been shown to be vulnerable to attack by a hypothetical data intruder. For example, Cox showed in 2009 that Complementary cell suppression typically leads to "over-protected" solutions because of the need to suppress both primary and complementary cells, and even then can lead to the compromise of sensitive data when exact intervals are reported.[15]

A more substantive criticism is that the theoretical models used to explore control measures are not appropriate for guides for practical action.[16] Hafner et al provide a practical example of how a change in perspective can generate substantially different results.[3]

Tools[edit]

mu-Argus and sdcMicro are open-source tools for input SDC.

tau-Argus and sdcTable are open-source tools for tabular data protection.

See also[edit]

References[edit]

  1. ^ Skinner, Chris (2009). "Statistical Disclosure Control for Survey Data" (PDF). Handbook of Statistics Vol 29A: Sample Surveys: Design, Methods and Applications. Handbook of Statistics. 29: 381–396. doi:10.1016/S0169-7161(08)00015-1. ISBN 9780444531247. Retrieved March 2016. {{cite journal}}: Check date values in: |access-date= (help)
  2. ^ "References", Statistical Disclosure Control, Chichester, UK: John Wiley & Sons, Ltd, pp. 261–277, 2012-07-05, doi:10.1002/9781118348239.refs, ISBN 9781118348239, retrieved 2021-11-08
  3. ^ a b Hafner, Hans-Peter; Lenz, Rainer; Ritchie, Felix (2019-01-01). "User-focused threat identification for anonymised microdata". Statistical Journal of the IAOS. 35 (4): 703–713. doi:10.3233/SJI-190506. ISSN 1874-7655. S2CID 55976703.
  4. ^ Ritchie, Felix (2007). Disclosure detection in research environments in practice. Paper presented at UNECE/Eurostat work session on statistical data confidentiality.
  5. ^ "ADRN » Safe results". adrn.ac.uk. Retrieved 2016-03-08.
  6. ^ "Government Statistical Services: Statistical Disclosure Control". Retrieved March 2016. {{cite web}}: Check date values in: |access-date= (help)
  7. ^ Templ, Matthias; et al. (2014). "International Household Survey Network" (PDF). IHSN Working Paper. Retrieved March 2016. {{cite journal}}: Check date values in: |access-date= (help)
  8. ^ "Archived: ONS Statistical Disclosure Control". Office for National Statistics. Archived from the original on 2016-01-05. Retrieved March 2016. {{cite web}}: Check date values in: |access-date= (help)
  9. ^ a b c Ritchie, Felix, and Elliott, Mark (2015). "Principles- Versus Rules-Based Output Statistical Disclosure Control In Remote Access Environments" (PDF). IASSIST Quarterly. 39 (2): 5–13. doi:10.29173/iq778. Retrieved March 2016. {{cite journal}}: Check date values in: |access-date= (help)
  10. ^ Ritchie, Felix (2009-01-01). "UK release practices for official microdata". Statistical Journal of the IAOS. 26 (3, 4): 103–111. doi:10.3233/SJI-2009-0706. ISSN 1874-7655.
  11. ^ a b c d Alves, Kyle; Ritchie, Felix (2020-11-25). "Runners, repeaters, strangers and aliens: Operationalising efficient output disclosure control". Statistical Journal of the IAOS. 36 (4): 1281–1293. doi:10.3233/SJI-200661. S2CID 209455141.
  12. ^ "Census 2001 - Methodology" (PDF). Northern Ireland Statistics and Research Agency. 2001. Retrieved March 2016. {{cite web}}: Check date values in: |access-date= (help)
  13. ^ Office for National Statistics. "Safe Researcher Training".{{cite web}}: CS1 maint: url-status (link)
  14. ^ Afkhamai, Reza; et al. (2013). "Statistical Disclosure Control Practice in the Secure Access of the UK Data Service" (PDF). United Nations Economic Commission for Europe. Retrieved March 2016. {{cite web}}: Check date values in: |access-date= (help)
  15. ^ Lawrence H. Cox, Vulnerability of Complementary Cell Suppression to Intruder Attack, Journal of Privacy and Confidentiality (2009) 1, Number 2, pp. 235–251 http://repository.cmu.edu/jpc/vol1/iss2/8/
  16. ^ Ritchie, Felix; Hafner, Hans-Peter; Lenz, Rainer; Welpton, Richard (2018-10-18). "Evidence-based, default-open, risk-managed, user-centred data access". {{cite journal}}: Cite journal requires |journal= (help)