# Differential privacy

In cryptography, differential privacy aims to provide means to maximize the accuracy of queries from statistical databases while minimizing the chances of identifying its records.

## Motivation

Consider a trusted party that holds a dataset of sensitive private information (for example, medical records, movie viewing, or email usage) that would like to provide global, statistical information about the data. Such a system is called a statistical database. However, providing aggregate statistical information about the data may reveal some information about the individuals. In fact, various ad-hoc approaches to anonymizing public records have failed when researchers managed to identify personal information by linking two or more separately innocuous databases. Differential privacy is a framework for formalizing privacy in statistical databases introduced in order to protect against these kinds of deanonymization techniques.

### Netflix Prize

In 2007, Netflix offered a \$1 million prize for a 10% improvement in its recommendation system. Netflix also released a training dataset for the competing developers to train their systems. While releasing this dataset, they provided a disclaimer: To protect customer privacy, all personal information identifying individual customers has been removed and all customer ids [sic] have been replaced by randomly assigned ids [sic].

Netflix is not the only movie-rating portal on the web; there are many others, including IMDb. On IMDb individuals can register and rate movies and they have the option of not keeping their details anonymous. Arvind Narayanan and Vitaly Shmatikov, researchers at the University of Texas at Austin, linked the Netflix anonymized training database with the IMDb database (using the date of rating by a user) to partially de-anonymize the Netflix training database, compromising the identity of some users.[1]

### Massachusetts Group Insurance Commission (GIC) medical encounter database

Latanya Sweeney from Carnegie Mellon University linked the anonymized GIC database (which retained the birthdate, sex, and ZIP code of each patient) with voter registration records, and was able to identify the medical record of the governor of Massachusetts.[2]

De Montjoye et al. from MIT introduced the notion of unicity (meaning uniqueness) and showed that 4 spatio-temporal points, approximate places and times, are enough to uniquely identify 95% of 1.5 million people in a mobility database.[3] The study further shows that these constraints hold even when the resolution of the dataset is low, meaning that even coarse or blurred mobility datasets and metadata provide little anonymity.

## Formal definition and example application

Let ${\displaystyle \epsilon }$ be a positive real number and ${\displaystyle {\mathcal {A}}}$ be a randomized algorithm that takes a dataset as input (representing the actions of the trusted party holding the data). Let ${\displaystyle {\textrm {im}}{\mathcal {A}}}$ denote the image of ${\displaystyle {\mathcal {A}}}$. The algorithm ${\displaystyle {\mathcal {A}}}$ is ${\displaystyle \epsilon }$-differentially private if for all datasets ${\displaystyle D_{1}}$ and ${\displaystyle D_{2}}$ that differ on a single element (i.e., the data of one person), and all subsets ${\displaystyle S}$ of ${\displaystyle {\textrm {im}}{\mathcal {A}}}$,

${\displaystyle \Pr[{\mathcal {A}}(D_{1})\in S]\leq e^{\epsilon }\times \Pr[{\mathcal {A}}(D_{2})\in S],}$

where the probability is taken over the randomness used by the algorithm.[4]

According to this definition, differential privacy is a condition on the release mechanism (i.e., the trusted party releasing information about the dataset) and not on the dataset itself. Intuitively, this means that for any two datasets that are similar, a given differentially private algorithm will behave approximately the same on both datasets. The definition gives a strong guarantee that presence or absence of an individual will not affect the final output of the algorithm significantly.

For example, assume we have a database of medical records ${\displaystyle D_{1}}$ where each record is a pair (Name, X), where ${\displaystyle X}$ is a Boolean denoting whether a person has diabetes or not. For example:

Name Has Diabetes (X)
Ross 1
Monica 1
Joey 0
Phoebe 0
Chandler 1

Now suppose a malicious user (often termed an adversary) wants to find whether Chandler has diabetes or not. Suppose he also knows in which row of the database Chandler resides. Now suppose the adversary is only allowed to use a particular form of query ${\displaystyle Q_{i}}$ that returns the partial sum of the first ${\displaystyle i}$ rows of column ${\displaystyle X}$ in the database. In order to find Chandler's diabetes status the adversary executes ${\displaystyle Q_{5}(D_{1})}$ and ${\displaystyle Q_{4}(D_{1})}$, then computes their difference. In this example, ${\displaystyle Q_{5}(D_{1})=3}$ and ${\displaystyle Q_{4}(D_{1})=2}$, so their difference is 1. This indicates that the "Has Diabetes" field in Chandler's row must be 1. This example highlights how individual information can be compromised even without explicitly querying for the information of a specific individual.

Continuing this example, if we construct ${\displaystyle D_{2}}$ by replacing (Chandler, 1) with (Chandler, 0) then this malicious adversary will be able to distinguish ${\displaystyle D_{2}}$ from ${\displaystyle D_{1}}$ by computing ${\displaystyle Q_{5}-Q_{4}}$ for each dataset. If the adversary were required to receive the values ${\displaystyle Q_{i}}$ via an ${\displaystyle \epsilon }$-differentially private algorithm, for a sufficiently small ${\displaystyle \epsilon }$, then he or she would be unable to distinguish between the two datasets.

### Sensitivity

Let ${\displaystyle d}$ be a positive integer, ${\displaystyle {\mathcal {D}}}$ be a collection of datasets, and ${\displaystyle f\colon {\mathcal {D}}\rightarrow \mathbb {R} ^{d}}$ be a function. The sensitivity [5] of a function, denoted ${\displaystyle \Delta f}$, is defined by

${\displaystyle \Delta f=\max \lVert f(D_{1})-f(D_{2})\rVert _{1},}$

where the maximum is over all pairs of datasets ${\displaystyle D_{1}}$ and ${\displaystyle D_{2}}$ in ${\displaystyle {\mathcal {D}}}$ differing in at most one element and ${\displaystyle \lVert \cdot \rVert _{1}}$ denotes the ${\displaystyle \ell _{1}}$ norm.

In the example of the medical database above, if we consider ${\displaystyle f}$ to be the function ${\displaystyle Q_{i}}$, then the sensitivity of the function is one, since changing any one of the entries in the database causes the output of the function to change by either zero or one.

There are techniques (which are described below) using which we can create a differentially private algorithm for functions with low sensitivity.

### Trade-off between utility and privacy

There is a trade-off between the accuracy of the statistics estimated in a privacy-preserving manner, and the privacy parameter ε.[6][7][8][9]

### Other notions of differential privacy

Since differential privacy is considered to be too strong for some applications, many weakened versions of privacy have been proposed. These include (ε, δ)-differential privacy,[10] randomised differential privacy,[11] and privacy under a metric.[12]

## Differentially private mechanisms

Since differential privacy is a probabilistic concept, any differentially private mechanism is necessarily randomized. Some of these, like the Laplace mechanism, described below, rely on adding controlled noise to the function that we want to compute. Others, like the exponential mechanism[13] and posterior sampling[14] sample from a problem-dependent family of distributions instead.

### The Laplace mechanism

Many differentially private methods add controlled noise to functions with low sensitivity.[5] The Laplace mechanism adds Laplace noise (i.e. noise from the Laplace distribution, which can be expressed by probability density function ${\displaystyle {\text{noise}}(y)\propto \exp(-|y|/\lambda )\,\!}$, which has mean zero and standard deviation ${\displaystyle \lambda \,\!}$). Now in our case we define the output function of ${\displaystyle {\mathcal {A}}\,\!}$ as a real valued function (called as the transcript output by ${\displaystyle {\mathcal {A}}\,\!}$) as ${\displaystyle {\mathcal {T}}_{\mathcal {A}}(x)=f(x)+Y\,\!}$ where ${\displaystyle Y\sim {\text{Lap}}(\lambda )\,\!\,\!}$ and ${\displaystyle f\,\!}$ is the original real valued query/function we planned to execute on the database. Now clearly ${\displaystyle {\mathcal {T}}_{\mathcal {A}}(x)\,\!}$ can be considered to be a continuous random variable, where

${\displaystyle {\frac {\mathrm {pdf} ({\mathcal {T}}_{{\mathcal {A}},D_{1}}(x)=t)}{\mathrm {pdf} ({\mathcal {T}}_{{\mathcal {A}},D_{2}}(x)=t)}}={\frac {{\text{noise}}(t-f(D_{1}))}{{\text{noise}}(t-f(D_{2}))}}\,\!}$

which is at most ${\displaystyle e^{\frac {|f(D_{1})-f(D_{2})|}{\lambda }}\leq e^{\frac {\Delta (f)}{\lambda }}\,\!}$. We can consider ${\displaystyle {\frac {\Delta (f)}{\lambda }}\,\!}$ to be the privacy factor ${\displaystyle \epsilon \,\!}$. Thus ${\displaystyle {\mathcal {T}}\,\!}$ follows a differentially private mechanism (as can be seen from the definition above). If we try to use this concept in our diabetes example then it follows from the above derived fact that in order to have ${\displaystyle {\mathcal {A}}\,\!}$ as the ${\displaystyle \epsilon \,\!}$-differential private algorithm we need to have ${\displaystyle \lambda =1/\epsilon \,\!}$. Though we have used Laplacian noise here, other forms of noise, such as the Gaussian Noise, can be employed, but they may require a slight relaxation of the definition of differential privacy.[2]

## Composability

### Sequential composition

If we query an ε-differential privacy mechanism ${\displaystyle t}$ times, and the randomization of the mechanism is independent for each query, then the result would be ${\displaystyle \epsilon t}$-differentially private. In the more general case, if there are ${\displaystyle n}$ independent mechanisms: ${\displaystyle {\mathcal {M}}_{1},\dots ,{\mathcal {M}}_{n}}$, whose privacy guarantees are ${\displaystyle \epsilon _{1},\dots ,\epsilon _{n}}$ differential privacy, respectively, then any function ${\displaystyle g}$ of them: ${\displaystyle g({\mathcal {M}}_{1},\dots ,{\mathcal {M}}_{n})}$ is ${\displaystyle (\sum \limits _{i=1}^{n}\epsilon _{i})}$-differentially private.[15]

### Parallel composition

Furthermore, if the previous mechanisms are computed on disjoint subsets of the private database then the function ${\displaystyle g}$ would be ${\displaystyle (\max _{i}\epsilon _{i})}$-differentially private instead.[15]

## Group privacy

In general, ε-differential privacy is designed to protect the privacy between neighboring databases which differ only in one row. This means that no adversary with arbitrary auxiliary information can know if one particular participant submitted his information. However this is also extendable if we want to protect databases differing in ${\displaystyle c}$ rows, which amounts to adversary with arbitrary auxiliary information can know if ${\displaystyle c}$ particular participants submitted their information. This can be achieved because if ${\displaystyle c}$ items change, the probability dilation is bounded by ${\displaystyle \exp(\epsilon c)}$ instead of ${\displaystyle \exp(\epsilon )}$,[2] i.e., for D1 and D2 differing on ${\displaystyle c}$ items:

${\displaystyle \Pr[{\mathcal {A}}(D_{1})\in S]\leq \exp(\epsilon c)\times \Pr[{\mathcal {A}}(D_{2})\in S]\,\!}$

Thus setting ε instead to ${\displaystyle \epsilon /c}$ achieves the desired result (protection of ${\displaystyle c}$ items). In other words, instead of having each item ε-differentially private protected, now every group of ${\displaystyle c}$ items is ε-differentially private protected (and each item is ${\displaystyle (\epsilon /c)}$-differentially private protected).

### Stable transformations

A transformation ${\displaystyle T}$ is ${\displaystyle c}$-stable if the hamming distance between ${\displaystyle T(A)}$ and ${\displaystyle T(B)}$ is at most ${\displaystyle c}$-times the hamming distance between ${\displaystyle A}$ and ${\displaystyle B}$ for any two databases ${\displaystyle A,B}$. Theorem 2 in [15] asserts that if there is a mechanism ${\displaystyle M}$ that is ${\displaystyle \epsilon }$-differentially private, then the composite mechanism ${\displaystyle M\circ T}$ is ${\displaystyle (\epsilon \times c)}$-differentially private.

This could be generalized to group privacy, as the group size could be thought of as the hamming distance ${\displaystyle h}$ between ${\displaystyle A}$ and ${\displaystyle B}$ (where ${\displaystyle A}$ contains the group and ${\displaystyle B}$ doesn't). In this case ${\displaystyle M\circ T}$ is ${\displaystyle (\epsilon \times c\times h)}$-differentially private.

## Adoption of differential privacy in real-world applications

Several uses of differential privacy in practice are known to date:

## References

1. ^ Arvind Narayanan, Vitaly Shmatikov (2008). Robust De-anonymization of Large Sparse Datasets (PDF). IEEE Symposium on Security and Privacy. pp. 111–125.
2. ^ a b c Differential Privacy by Cynthia Dwork, International Colloquium on Automata, Languages and Programming (ICALP) 2006, p. 1–12. DOI=10.1007/11787006_1
3. ^ de Montjoye, Yves-Alexandre; César A. Hidalgo; Michel Verleysen; Vincent D. Blondel (March 25, 2013). "Unique in the Crowd: The privacy bounds of human mobility". Nature srep. doi:10.1038/srep01376. Retrieved 12 April 2013.
4. ^ The Algorithmic Foundations of Differential Privacy by Cynthia Dwork and Aaron Roth. Foundations and Trends in Theoretical Computer Science. Vol. 9, no. 3–4, pp. 211‐407, Aug. 2014. DOI=10.1561/0400000042
5. ^ a b Calibrating Noise to Sensitivity in Private Data Analysis by Cynthia Dwork, Frank McSherry, Kobbi Nissim, Adam Smith In Theory of Cryptography Conference (TCC), Springer, 2006. DOI=10.1007/11681878_14
6. ^ A. Ghosh, T. Roughgarden, and M. Sundararajan. Universally utility-maximizing privacy mechanisms. In Proceedings of the 41st annual ACM Symposium on Theory of Computing, pages 351–360. ACM New York, NY, USA, 2009.
7. ^ H. Brenner and K. Nissim. Impossibility of Differentially Private Universally Optimal Mechanisms. In Proceedings of the 51st Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2010.
8. ^ R. Chen, N. Mohammed, B. C. M. Fung, B. C. Desai, and L. Xiong. Publishing set-valued data via differential privacy. The Proceedings of the VLDB Endowment (PVLDB), 4(11):1087-1098, August 2011. VLDB Endowment.
9. ^ N. Mohammed, R. Chen, B. C. M. Fung, and P. S. Yu. Differentially private data release for data mining. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), pages 493-501, San Diego, CA: ACM Press, August 2011.
10. ^ Dwork, Cynthia, Krishnaram Kenthapadi, Frank McSherry, Ilya Mironov, and Moni Naor. "Our data, ourselves: Privacy via distributed noise generation." In Advances in Cryptology-EUROCRYPT 2006, pp. 486-503. Springer Berlin Heidelberg, 2006.
11. ^ Hall, Rob, Alessandro Rinaldo, and Larry Wasserman. "Random differential privacy." arXiv preprint arXiv:1112.2680 (2011).
12. ^ Chatzikokolakis, Konstantinos, Miguel E. Andrés, Nicolás Emilio Bordenabe, and Catuscia Palamidessi. "Broadening the scope of Differential Privacy using metrics." In Privacy Enhancing Technologies, pp. 82-102. Springer Berlin Heidelberg, 2013.
13. ^ F.McSherry and K.Talwar. Mechasim Design via Differential Privacy. Proceedings of the 48th Annual Symposium of Foundations of Computer Science, 2007.
14. ^ Christos Dimitrakakis, Blaine Nelson, Aikaterini Mitrokotsa, Benjamin Rubinstein. Robust and Private Bayesian Inference. Algorithmic Learning Theory 2014
15. ^ a b c Privacy integrated queries: an extensible platform for privacy-preserving data analysis by Frank D. McSherry. In Proceedings of the 35th SIGMOD International Conference on Management of Data (SIGMOD), 2009. DOI=10.1145/1559845.1559850
16. ^ Ashwin Machanavajjhala, Daniel Kifer, John M. Abowd, Johannes Gehrke, and Lars Vilhuber. "Privacy: Theory meets Practice on the Map". In Proceedings of the 24th International Conference on Data Engineering, (ICDE) 2008.
17. ^ Úlfar Erlingsson, Vasyl Pihur, Aleksandra Korolova. "RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response". In Proceedings of the 21st ACM Conference on Computer and Communications Security (CCS), 2014.
18. ^ Tackling Urban Mobility with Technology by Andrew Eland. Google Policy Europe Blog, Nov 18, 2015.
19. ^ "Apple - Press Info - Apple Previews iOS 10, the Biggest iOS Release Ever". Apple. Retrieved 16 June 2016.