Differential Privacy

From Wikipedia, the free encyclopedia
  (Redirected from Differential privacy)
Jump to navigation Jump to search

Differential privacy is a statistical technique that aims to provide means to maximize the accuracy of queries from statistical databases while measuring (and, thereby, hopefully minimizing) the privacy impact on individuals whose information is in the database. Differential privacy was developed by cryptographers and is thus often associated with cryptography, and it draws much of its language from cryptography. Although differential privacy is often discussed in the context of identifying individuals whose information may be a database, identification and reidentification attacks are not included within the original differential privacy framework.

Synopsis[edit]

Invented by Cynthia Dwork and Frank McSherry in 2005[1], Differential Privacy is a system for measuring the theoretical maximum privacy loss that an individual can experience when their private information is used to create a statistical data product. The term was coined in the Dwork McSherry patent application in 2005, and first appeared in print in 2006,[2]. Today the term "differential privacy" is used broadly to describe both the mathematical guarantees associated with 2005 invention, so-called relaxed definitions of privacy that can be derived from the 2005 definition, and the family of data query mechanisms that provide equivalent privacy guarantees.

Differential privacy mechanisms operate by introducing randomness into the results of queries on underlying confidential data. Because of this randomness, an observer of the queries faces ambiguity when trying to reconstruct what the confidential data must have been in order to produce the observed results. Mechanisms that do not employ randomness cannot exhibit differential privacy.

History[edit]

Official statistical organizations are charged with collecting information from individuals or establishments and publishing aggregate data to serve the public interest. For example, the 1790 United States Census collected information about individuals living in the United States and published tabulations based on sex, age, race, and condition of servitude. Statistical organizations have long collected information under a promise of confidentiality that the information provided will be used for statistical purposes, but that the publications will not produce information that can be traced back to a specific individual or establishment. To accomplish this goal, statistical organizations have long suppressed information in their publications. For example, in table presenting the sales of each business in a town grouped by business category, a cell that has information from only one company might be suppressed, in order to maintain the confidentiality of that company's specific sales.

The adoption of electronic information processing systems by statistical agencies in the 1950s and 1960s dramatically increased the number of tables that a statistical organization could produce and, in so doing, significantly increased the potential for an improper disclosure of confidential information. For example, if a business that had its sales numbers suppressed also had those numbers appear in the total sales of a region, then it might be possible to determine the suppressed value by subtracting the other sales from that total. But there might also be combinations of additions and subtractions that might cause the private information to be revealed. The number of combinations that needed to be checked increases exponentially with the number of publications, and it is potentially unbounded if data users are able to make queries of the statistical database using an interactive query system.

In 1977 Tore Dalenius formalized the mathematics of cell suppression.[3]

In 1979, Dorothy Denning, Peter J. Denning and Mayer D. Schwartz formalized the concept of a Tracker, an adversary that could learn the confidential contents of a statistical database by creating a series of targeted queries and remembering the results.[4]. This and future research showed that privacy properties in a database could only be preserved by considering each new query in light of (possibly all) previous queries. This line of work is sometimes called query privacy, with the final result being that tracking the impact of a query on the privacy of individuals in the database was NP-hard.

In 2003 Kobbi Nissim and Irit Dinur demonstrated that it is impossible to publish arbitrary queries on a private statistical database without revealing some amount of private information, and that the entire information content of the database can be revealed by publishing the results of a surprisingly small number of random queries---far fewer than was implied by previous work.[5]

In 2006 Dwork, McSherry, Kobbi Nissim and Adam D. Smith published an article formalizing the amount of noise that needed to be added and proposing a generalized mechanism for doing so. [6]

Since then, subsequent research has shown that there are many ways that it possible to produce very accurate statistics from the database while still ensuring high levels of privacy.[7][8]

ε-Differential Privacy[edit]

The 2006 Dwork, McSherry, Nissim and Smith article introduced the concept of ε-differential privacy, a mathematical definition for the privacy loss associated with any data release drawn from a statistical database. (Here, the term statistical database means a set of data that are collected under the pledge of confidentiality for the purpose of producing statistics that, by their production, do not compromise the privacy of those individuals who provided the data.)

The key discovery of the 2003 Nissim Dinum paper is that any release of data on a statistical database has the possibility of impacting the privacy of individuals whose data is in the database. For example, if a database contains individuals' ages and their heights, then releasing the average age and the average height reveals some information about all of the individuals. If enough statistics are released, it is possible to reconstruct the entire database.

The intuition for the 2006 definition of ε-differential privacy is that a person's privacy cannot be compromised by a statistical release if their data are not in the database. Therefore with differential privacy, the goal is to give each individual roughly the same privacy that would result from having their data removed. That is, the statistical functions run on the database should not overly depend on their data of any one individual.

Of course, how much any individual contributes to the result of a database depends in part on how many people's data are involved in the query. If the database contains data from a single person, that person's data contributes 100%. If the database contains data from a hundred people, each person's data contributes just 1%. The key insight of differential privacy is that as the query is made on the data of fewer and fewer people, more noise needs to be added to the query result to produce the same amount of privacy. Hence the name of the 2006 paper, "Calibrating noise to sensitivity in private data analysis."

The 2006 paper presents both a mathematical definition of differential privacy and a mechanism based on the addition of Laplace noise that satisfies the definition.

The definition of ε-differential privacy[edit]

Let ε be a positive real number and be a randomized algorithm that takes a dataset as input (representing the actions of the trusted party holding the data). Let denote the image of . The algorithm is said to provide -differential privacy if, for all datasets and that differ on a single element (i.e., the data of one person), and all subsets of :

where the probability is taken over the randomness used by the algorithm.[9]

The Laplace mechanism[edit]

The Laplace mechanism adds Laplace noise (i.e. noise from the Laplace distribution, which can be expressed by probability density function , which has mean zero and standard deviation ). Now in our case we define the output function of as a real valued function (called as the transcript output by ) as where and is the original real valued query/function we planned to execute on the database. Now clearly can be considered to be a continuous random variable, where



which is at most . We can consider to be the privacy factor . Thus follows a differentially private mechanism (as can be seen from the definition above). If we try to use this concept in our diabetes example then it follows from the above derived fact that in order to have as the -differential private algorithm we need to have . Though we have used Laplacian noise here, other forms of noise, such as the Gaussian Noise, can be employed, but they may require a slight relaxation of the definition of differential privacy.[10]

According to this definition, differential privacy is a condition on the release mechanism (i.e., the trusted party releasing information about the dataset) and not on the dataset itself. Intuitively, this means that for any two datasets that are similar, a given differentially private algorithm will behave approximately the same on both datasets. The definition gives a strong guarantee that presence or absence of an individual will not affect the final output of the algorithm significantly.

For example, assume we have a database of medical records where each record is a pair (Name, X), where is a Boolean denoting whether a person has diabetes or not. For example:

Name Has Diabetes (X)
Ross 1
Monica 1
Joey 0
Phoebe 0
Chandler 1

Now suppose a malicious user (often termed an adversary) wants to find whether Chandler has diabetes or not. Suppose he also knows in which row of the database Chandler resides. Now suppose the adversary is only allowed to use a particular form of query that returns the partial sum of the first rows of column in the database. In order to find Chandler's diabetes status the adversary executes and , then computes their difference. In this example, and , so their difference is 1. This indicates that the "Has Diabetes" field in Chandler's row must be 1. This example highlights how individual information can be compromised even without explicitly querying for the information of a specific individual.

Continuing this example, if we construct by replacing (Chandler, 1) with (Chandler, 0) then this malicious adversary will be able to distinguish from by computing for each dataset. If the adversary were required to receive the values via an -differentially private algorithm, for a sufficiently small , then he or she would be unable to distinguish between the two datasets.

Sensitivity[edit]

Let be a positive integer, be a collection of datasets, and be a function. The sensitivity [11] of a function, denoted , is defined by

where the maximum is over all pairs of datasets and in differing in at most one element and denotes the norm.

In the example of the medical database above, if we consider to be the function , then the sensitivity of the function is one, since changing any one of the entries in the database causes the output of the function to change by either zero or one.

There are techniques (which are described below) using which we can create a differentially private algorithm for functions with low sensitivity.

Trade-off between utility and privacy[edit]

There is a trade-off between the accuracy of the statistics estimated in a privacy-preserving manner, and the privacy parameter ε.[12][13][14][15] This tradeoff must also account the number of queries (done and expected in the future), by multiplying the epsilon parameter to the number of queries.

Composability[edit]

Sequential composition. If we query an ε-differential privacy mechanism times, and the randomization of the mechanism is independent for each query, then the result would be -differentially private. In the more general case, if there are independent mechanisms: , whose privacy guarantees are differential privacy, respectively, then any function of them: is -differentially private.[16]

Parallel composition. If the previous mechanisms are computed on disjoint subsets of the private database then the function would be -differentially private instead.[16]

Epsilon-Differential Privacy Mechanisms[edit]

Since differential privacy is a probabilistic concept, any differentially private mechanism is necessarily randomized. Some of these, like the Laplace mechanism, described below, rely on adding controlled noise to the function that we want to compute. Others, like the exponential mechanism[17] and posterior sampling[18] sample from a problem-dependent family of distributions instead.

Randomized Response[edit]

A simple example, especially developed in the social sciences,[9] is to ask a person to answer the question "Do you own the attribute A?", according to the following procedure:

  1. Throw a coin.
  2. If head, then answer honestly.
  3. If tail, then throw the coin again and answer "Yes" if head, "No" if tail.

The confidentiality arises from the refutability of the individual responses.

But, overall, these data with many responses are significant, since positive responses are given to a quarter by people who do not have the attribute A and three-quarters by people who actually possess it. Thus, if p is the true proportion of people with A, then we expect to obtain (1/4)(1-p) + (3/4)p = (1/4) + p/2 positive responses. Hence it is possible to estimate p.

In particular, if the attribute A is synonymous with illegal behavior, then answering "Yes" is not incriminating, insofar as the person has a probability of a "Yes" response, whatever it may be.

Although this example, inspired by randomized response, might be applicable to microdata (i.e., releasing datasets with each individual response), by definition differential privacy excludes microdata release and is only applicable to queries (i.e., aggregating individual responses into one result) as this would violate the requirements, more specifically the plausible deniability that a subject participated or not.[19][20]

Group privacy[edit]

In general, ε-differential privacy is designed to protect the privacy between neighboring databases which differ only in one row. This means that no adversary with arbitrary auxiliary information can know if one particular participant submitted his information. However this is also extendable if we want to protect databases differing in rows, which amounts to adversary with arbitrary auxiliary information can know if particular participants submitted their information. This can be achieved because if items change, the probability dilation is bounded by instead of ,[10] i.e., for D1 and D2 differing on items:

Thus setting ε instead to achieves the desired result (protection of items). In other words, instead of having each item ε-differentially private protected, now every group of items is ε-differentially private protected (and each item is -differentially private protected).

Stable transformations[edit]

A transformation is -stable if the hamming distance between and is at most -times the hamming distance between and for any two databases . Theorem 2 in [16] asserts that if there is a mechanism that is -differentially private, then the composite mechanism is -differentially private.

This could be generalized to group privacy, as the group size could be thought of as the hamming distance between and (where contains the group and doesn't). In this case is -differentially private.


Other notions of differential privacy[edit]

Since differential privacy is considered to be too strong for some applications, many weakened versions of privacy have been proposed. These include (ε, δ)-differential privacy,[21] randomised differential privacy,[22] and privacy under a metric.[23]


Adoption of differential privacy in real-world applications[edit]

Several uses of differential privacy in practice are known to date:

  • U.S. Census Bureau, for showing commuting patterns.[24]
  • Google's RAPPOR, for telemetry such as learning statistics about unwanted software hijacking users' settings [25] (RAPPOR's open-source implementation).
  • Google, for sharing historical traffic statistics.[26]
  • On June 13, 2016 Apple announced its intention to use differential privacy in iOS 10 to improve its intelligent assistance and suggestions technology.[27]
  • Some initial research has been done into practical implementations of differential privacy in data mining models.[28]

Style Guide[edit]

"Differential Privacy" is capitalized when referring to:

  • The *invention* called Differential Privacy, invented by Dwork, McSherry, Nissim and Smith in 2006.

The phrase "differential privacy" is *not* capitalized when:

  • The word "differential" is being used to modify the word "privacy."
  • * The technology described in Microsoft Patent US 7698250B2, "Differential data privacy" 

See also[edit]

Further reading[edit]

  • Dinur, Irit and Kobbi Nissim. 2003. Revealing information while preserving privacy. In Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems(PODS '03). ACM, New York, NY, USA, 202-210. DOI: 10.1145/773153.773173.
  • Dwork, Cynthia, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. in Halevi, S. & Rabin, T. (Eds.) Calibrating Noise to Sensitivity in Private Data Analysis Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006. Proceedings, Springer Berlin Heidelberg, 265-284, DOI: 10.1007/11681878_14.
  • Dwork, Cynthia. 2006. Differential Privacy, 33rd International Colloquium on Automata, Languages and Programming, part II (ICALP 2006), Springer Verlag, 4052, 1-12, ISBN 3-540-35907-9.
  • Dwork, Cynthia and Aaron Roth. 2014. The Algorithmic Foundations of Differential Privacy. Foundations and Trends in Theoretical Computer Science. Vol. 9, Nos. 3–4. 211–407, DOI: 10.1561/0400000042.
  • Machanavajjhala, Ashwin, Daniel Kifer, John M. Abowd , Johannes Gehrke, and Lars Vilhuber. 2008. Privacy: Theory Meets Practice on the Map, International Conference on Data Engineering (ICDE) 2008: 277-286, doi:10.1109/ICDE.2008.4497436
  • Dwork, Cynthia and Moni Naor. 2010. On the Difficulties of Disclosure Prevention in Statistical Databases or The Case for Differential Privacy, Journal of Privacy and Confidentiality: Vol. 2: Iss. 1, Article 8. Available at: http://repository.cmu.edu/jpc/vol2/iss1/8.
  • Kifer, Daniel and Ashwin Machanavajjhala. 2011. No free lunch in data privacy. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data (SIGMOD '11). ACM, New York, NY, USA, 193-204. DOI:10.1145/1989323.1989345.
  • Erlingsson, Úlfar, Vasyl Pihur and Aleksandra Korolova. 2014. RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security (CCS '14). ACM, New York, NY, USA, 1054-1067. DOI:10.1145/2660267.2660348.
  • Abowd, John M. and Ian M. Schmutte. 2017 . Revisiting the economics of privacy: Population statistics and confidentiality protection as public goods. Labor Dynamics Institute, Cornell University, Labor Dynamics Institute, Cornell University, at https://digitalcommons.ilr.cornell.edu/ldi/37/

Abowd, John M. and Ian M. Schmutte. Forthcoming. An Economic Analysis of Privacy Protection and Statistical Accuracy as Social Choices. American Economic Review, arXiv:1808.06303

  • Ding, Bolin, Janardhan Kulkarni, and Sergey Yekhanin 2017. Collecting Telemetry Data Privately, NIPS 2017.

References[edit]

  1. ^ US 7698250, Cynthia Dwork & Frank McSherry, "Differential data privacy", published 2010-04-13, assigned to Microsoft Corp (original) and Microsoft Technology Licensing LLC (current) 
  2. ^ Dwork, Cynthia (2006). "Differential Privacy": 1–12.
  3. ^ Tore Dalenius (1977). "Towards a methodology for statistical disclosure control". Statistik Tidskrift. 15.
  4. ^ Dorothy E. Denning; Peter J. Denning; Mayer D. Schwartz (March 1978). "The Tracker: A Threat to Statistical Database Security" (PDF). 4 (1): 76–96.
  5. ^ Irit Dinur and Kobbi Nissim. 2003. Revealing information while preserving privacy. In Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems (PODS '03). ACM, New York, NY, USA, 202–210. DOI:10.1145/773153.773173
  6. ^ Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. Calibrating noise to sensitivity in private data analysis. In Proceedings of the Third conference on Theory of Cryptography (TCC'06), Shai Halevi and Tal Rabin (Eds.). Springer-Verlag, Berlin, Heidelberg, 265–284. DOI:10.1007/11681878_14
  7. ^ Hilton, Michael. "Differential Privacy: A Historical Survey" (PDF).
  8. ^ Dwork, Cynthia (2008-04-25). "Differential Privacy: A Survey of Results". In Agrawal, Manindra; Du, Dingzhu; Duan, Zhenhua; Li, Angsheng. Theory and Applications of Models of Computation. Lecture Notes in Computer Science. Springer Berlin Heidelberg. pp. 1–19. doi:10.1007/978-3-540-79228-4_1. ISBN 9783540792277.
  9. ^ a b The Algorithmic Foundations of Differential Privacy by Cynthia Dwork and Aaron Roth. Foundations and Trends in Theoretical Computer Science. Vol. 9, no. 3–4, pp. 211‐407, Aug. 2014. DOI:10.1561/0400000042
  10. ^ a b Differential Privacy by Cynthia Dwork, International Colloquium on Automata, Languages and Programming (ICALP) 2006, p. 1–12. DOI:10.1007/11787006_1
  11. ^ Calibrating Noise to Sensitivity in Private Data Analysis by Cynthia Dwork, Frank McSherry, Kobbi Nissim, Adam Smith In Theory of Cryptography Conference (TCC), Springer, 2006. DOI:10.1007/11681878_14
  12. ^ A. Ghosh, T. Roughgarden, and M. Sundararajan. Universally utility-maximizing privacy mechanisms. In Proceedings of the 41st annual ACM Symposium on Theory of Computing, pages 351–360. ACM New York, NY, USA, 2009.
  13. ^ H. Brenner and K. Nissim. Impossibility of Differentially Private Universally Optimal Mechanisms. In Proceedings of the 51st Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2010.
  14. ^ R. Chen, N. Mohammed, B. C. M. Fung, B. C. Desai, and L. Xiong. Publishing set-valued data via differential privacy. The Proceedings of the VLDB Endowment (PVLDB), 4(11):1087–1098, August 2011. VLDB Endowment.
  15. ^ N. Mohammed, R. Chen, B. C. M. Fung, and P. S. Yu. Differentially private data release for data mining. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), pages 493–501, San Diego, CA: ACM Press, August 2011.
  16. ^ a b c Privacy integrated queries: an extensible platform for privacy-preserving data analysis by Frank D. McSherry. In Proceedings of the 35th SIGMOD International Conference on Management of Data (SIGMOD), 2009. DOI:10.1145/1559845.1559850
  17. ^ F.McSherry and K.Talwar. Mechasim Design via Differential Privacy. Proceedings of the 48th Annual Symposium of Foundations of Computer Science, 2007.
  18. ^ Christos Dimitrakakis, Blaine Nelson, Aikaterini Mitrokotsa, Benjamin Rubinstein. Robust and Private Bayesian Inference. Algorithmic Learning Theory 2014
  19. ^ Dwork, Cynthia. "A firm foundation for private data analysis." Communications of the ACM 54.1 (2011): 86–95, supra note 19, page 91.
  20. ^ Bambauer, Jane, Krishnamurty Muralidhar, and Rathindra Sarathy. "Fool's gold: an illustrated critique of differential privacy." Vand. J. Ent. & Tech. L. 16 (2013): 701.
  21. ^ Dwork, Cynthia, Krishnaram Kenthapadi, Frank McSherry, Ilya Mironov, and Moni Naor. "Our data, ourselves: Privacy via distributed noise generation." In Advances in Cryptology-EUROCRYPT 2006, pp. 486–503. Springer Berlin Heidelberg, 2006.
  22. ^ Hall, Rob, Alessandro Rinaldo, and Larry Wasserman. "Random differential privacy." arXiv preprint arXiv:1112.2680 (2011).
  23. ^ Chatzikokolakis, Konstantinos, Miguel E. Andrés, Nicolás Emilio Bordenabe, and Catuscia Palamidessi. "Broadening the scope of Differential Privacy using metrics." In Privacy Enhancing Technologies, pp. 82–102. Springer Berlin Heidelberg, 2013.
  24. ^ Ashwin Machanavajjhala, Daniel Kifer, John M. Abowd, Johannes Gehrke, and Lars Vilhuber. "Privacy: Theory meets Practice on the Map". In Proceedings of the 24th International Conference on Data Engineering, (ICDE) 2008.
  25. ^ Úlfar Erlingsson, Vasyl Pihur, Aleksandra Korolova. "RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response". In Proceedings of the 21st ACM Conference on Computer and Communications Security (CCS), 2014. arXiv:1407.6981
  26. ^ Tackling Urban Mobility with Technology by Andrew Eland. Google Policy Europe Blog, Nov 18, 2015.
  27. ^ "Apple - Press Info - Apple Previews iOS 10, the Biggest iOS Release Ever". Apple. Retrieved 16 June 2016.
  28. ^ Fletcher, Sam; Islam, Md Zahidul (July 2017). "Differentially private random decision forests using smooth sensitivity". Expert Systems with Applications. 78: 16–31. arXiv:1606.03572. doi:10.1016/j.eswa.2017.01.034.

External links[edit]