Jump to content

Mahalanobis distance: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
FrescoBot (talk | contribs)
m Bot: standard sections headers and minor changes
Moved some discussion from the definition section, where, in my opinion, it was doing more harm than good.
Line 2: Line 2:


==Definition==
==Definition==
In general, given a normal ([[Gaussian]]) random variable <math>X</math> with variance <math>S=1</math> and mean <math>\mu = 0</math>, any other normal random variable <math>R</math> can be defined in terms of <math>X</math> by the equation <math>R = \mu_1 + \sqrt{S_1}X.</math> Conversely, to recover a normalized random variable from any normal random variable, one can typically solve for <math>X = (R - \mu_1)/\sqrt{S_1} </math>. If we square both sides, and take the square-root, we will get an equation for a metric that looks a lot like the Mahalanobis distance:


The Mahalanobis distance of a multivariate vector <math>x = ( x_1, x_2, x_3, \dots, x_N )^T</math> from a group of values with mean <math>\mu = ( \mu_1, \mu_2, \mu_3, \dots , \mu_N )^T</math> and [[covariance matrix]] ''S'' is defined as:
:<math>D = \sqrt{X^2} = \sqrt{(R - \mu_1)^2/S_1} = \sqrt{(R - \mu_1) S_1^{-1} (R - \mu_1) }.</math>

The resulting magnitude is always non-negative and varies with the distance of the data from the mean, attributes that are convenient when trying to define a model for the data.

Formally, the Mahalanobis distance of a multivariate vector <math>x = ( x_1, x_2, x_3, \dots, x_N )^T</math> from a group of values with mean <math>\mu = ( \mu_1, \mu_2, \mu_3, \dots , \mu_N )^T</math> and [[covariance matrix]] ''S'' is defined as:


:<math>D_M(x) = \sqrt{(x - \mu)^T S^{-1} (x-\mu)}.\, </math><ref>De Maesschalck, Roy; Jouan-Rimbaud, Delphine; and Massart, Désiré L. (2000); ''The Mahalanobis distance'', Chemometrics and Intelligent Laboratory Systems 50:1–18</ref>
:<math>D_M(x) = \sqrt{(x - \mu)^T S^{-1} (x-\mu)}.\, </math><ref>De Maesschalck, Roy; Jouan-Rimbaud, Delphine; and Massart, Désiré L. (2000); ''The Mahalanobis distance'', Chemometrics and Intelligent Laboratory Systems 50:1–18</ref>
Line 37: Line 32:


Putting this on a mathematical basis, the ellipsoid that best represents the set's probability distribution can be estimated by building the covariance matrix of the samples. The Mahalanobis distance is simply the distance of the test point from the center of mass divided by the width of the ellipsoid in the direction of the test point.
Putting this on a mathematical basis, the ellipsoid that best represents the set's probability distribution can be estimated by building the covariance matrix of the samples. The Mahalanobis distance is simply the distance of the test point from the center of mass divided by the width of the ellipsoid in the direction of the test point.

==Discussion==
In general, given a normal ([[Gaussian]]) random variable <math>X</math> with variance <math>S=1</math> and mean <math>\mu = 0</math>, any other normal random variable <math>R</math> can be defined in terms of <math>X</math> by the equation <math>R = \mu_1 + \sqrt{S_1}X.</math> Conversely, to recover a normalized random variable from any normal random variable, one can typically solve for <math>X = (R - \mu_1)/\sqrt{S_1} </math>. If we square both sides, and take the square-root, we will get an equation for a metric that looks a lot like the Mahalanobis distance:

:<math>D = \sqrt{X^2} = \sqrt{(R - \mu_1)^2/S_1} = \sqrt{(R - \mu_1) S_1^{-1} (R - \mu_1) }.</math>

The resulting magnitude is always non-negative and varies with the distance of the data from the mean, attributes that are convenient when trying to define a model for the data.


==Relationship to leverage==
==Relationship to leverage==

Revision as of 12:14, 13 January 2014

The Mahalanobis distance is a descriptive statistic that provides a relative measure of a data point's distance (residual) from a common point. It is a unitless measure introduced by P. C. Mahalanobis in 1936.[1] The Mahalanobis distance is used to identify and gauge similarity of an unknown sample set to a known one. It differs from Euclidean distance in that it takes into account the correlations of the data set and is scale-invariant. In other words, it has a multivariate effect size.

Definition

The Mahalanobis distance of a multivariate vector from a group of values with mean and covariance matrix S is defined as:

[2]

Mahalanobis distance (or "generalized squared interpoint distance" for its squared value[3]) can also be defined as a dissimilarity measure between two random vectors and of the same distribution with the covariance matrix S:

If the covariance matrix is the identity matrix, the Mahalanobis distance reduces to the Euclidean distance. If the covariance matrix is diagonal, then the resulting distance measure is called a normalized Euclidean distance:

where si is the standard deviation of the xi and yi over the sample set.

Intuitive explanation

Consider the problem of estimating the probability that a test point in N-dimensional Euclidean space belongs to a set, where we are given sample points that definitely belong to that set. Our first step would be to find the average or center of mass of the sample points. Intuitively, the closer the point in question is to this center of mass, the more likely it is to belong to the set.

However, we also need to know if the set is spread out over a large range or a small range, so that we can decide whether a given distance from the center is noteworthy or not. The simplistic approach is to estimate the standard deviation of the distances of the sample points from the center of mass. If the distance between the test point and the center of mass is less than one standard deviation, then we might conclude that it is highly probable that the test point belongs to the set. The further away it is, the more likely that the test point should not be classified as belonging to the set.

This intuitive approach can be made quantitative by defining the normalized distance between the test point and the set to be . By plugging this into the normal distribution we can derive the probability of the test point belonging to the set.

The drawback of the above approach was that we assumed that the sample points are distributed about the center of mass in a spherical manner. Were the distribution to be decidedly non-spherical, for instance ellipsoidal, then we would expect the probability of the test point belonging to the set to depend not only on the distance from the center of mass, but also on the direction. In those directions where the ellipsoid has a short axis the test point must be closer, while in those where the axis is long the test point can be further away from the center.

Putting this on a mathematical basis, the ellipsoid that best represents the set's probability distribution can be estimated by building the covariance matrix of the samples. The Mahalanobis distance is simply the distance of the test point from the center of mass divided by the width of the ellipsoid in the direction of the test point.

Discussion

In general, given a normal (Gaussian) random variable with variance and mean , any other normal random variable can be defined in terms of by the equation Conversely, to recover a normalized random variable from any normal random variable, one can typically solve for . If we square both sides, and take the square-root, we will get an equation for a metric that looks a lot like the Mahalanobis distance:

The resulting magnitude is always non-negative and varies with the distance of the data from the mean, attributes that are convenient when trying to define a model for the data.

Relationship to leverage

Mahalanobis distance is closely related to the leverage statistic, h, but has a different scale:[4]

Squared Mahalanobis distance = (N − 1)(h − 1/N).

Applications

Mahalanobis's discovery was prompted by the problem of identifying the similarities of skulls based on measurements in 1927.[5]

Mahalanobis distance is widely used in cluster analysis and classification techniques. It is closely related to Hotelling's T-square distribution used for multivariate statistical testing and Fisher's Linear Discriminant Analysis that is used for supervised classification.[6]

In order to use the Mahalanobis distance to classify a test point as belonging to one of N classes, one first estimates the covariance matrix of each class, usually based on samples known to belong to each class. Then, given a test sample, one computes the Mahalanobis distance to each class, and classifies the test point as belonging to that class for which the Mahalanobis distance is minimal.

Mahalanobis distance and leverage are often used to detect outliers, especially in the development of linear regression models. A point that has a greater Mahalanobis distance from the rest of the sample population of points is said to have higher leverage since it has a greater influence on the slope or coefficients of the regression equation. Mahalanobis distance is also used to determine multivariate outliers. Regression techniques can be used to determine if a specific case within a sample population is an outlier via the combination of two or more variable scores. A point can be a multivariate outlier even if it is not a univariate outlier on any variable (consider a probability density similar to a hollow cube in three dimensions, for example).

Mahalanobis distance was also widely used in biology, such as predicting protein structural class, [7] predicting membrane protein type,[8] predicting protein subcellular localization,[9]as well as predicting many other attributes of proteins through their pseudo amino acid composition [10] or Chou's PseAAC,[11] based on Chou's invariance theorem, as done in the papers. [12] [13]

See also

References

  1. ^ Mahalanobis, Prasanta Chandra (1936). "On the generalised distance in statistics" (PDF). Proceedings of the National Institute of Sciences of India. 2 (1): 49–55. Retrieved 2012-05-03.
  2. ^ De Maesschalck, Roy; Jouan-Rimbaud, Delphine; and Massart, Désiré L. (2000); The Mahalanobis distance, Chemometrics and Intelligent Laboratory Systems 50:1–18
  3. ^ Gnanadesikan, Ramanathan; and Kettenring, John R. (1972); Robust estimates, residuals, and outlier detection with multiresponse data, Biometrics 28:81–124
  4. ^ Schinka, John A.; Velicer, Wayne F.; and Weiner, Irving B. (2003); Handbook of psychology: Research methods in psychology, John Wiley and Sons
  5. ^ Mahalanobis, Prasanta Chandra (1927); Analysis of race mixture in Bengal, Journal and Proceedings of the Asiatic Society of Bengal, 23:301–333
  6. ^ McLachlan, Geoffrey J. (1992); Discriminant Analysis and Statistical Pattern Recognition, Wiley Interscience, p. 12. ISBN 0-471-69115-1
  7. ^ Chou, Kuo-Chen (1995). "A novel approach to predicting protein structural classes in a (20-1)-D amino acid composition space". Proteins. 21 (4): 319–44. doi:10.1002/prot.340210406. PMID 7567954. {{cite journal}}: Unknown parameter |month= ignored (help)
  8. ^ Chou, Kuo-Chen; and Elrod, David W. (1999); Prediction of membrane protein types and subcellular locations, Proteins: Structure, Function, and Genetics, 34, 137–153
  9. ^ Chou, Kuo-Chen; and Elrod, David W. (1999); Protein subcellular location prediction, Protein Engineering, 12, 107–118
  10. ^ Chou, Kuo-Chen (2001). "Prediction of protein cellular attributes using pseudo-amino acid composition". Proteins. 43 (3): 246–55. doi:10.1002/prot.1035. PMID 11288174. {{cite journal}}: Unknown parameter |month= ignored (help)
  11. ^ Lin, Sheng-Xiang; Lapointe, Jacques (2013). "Theoretical and experimental biology in one —A symposium in honour of Professor Kuo-Chen Chou's 50th anniversary and Professor Richard Giegé's 40th anniversary of their scientific careers". JBiSE. 6: 435–442. doi:10.4236/jbise.2013.64054. {{cite journal}}: Cite has empty unknown parameter: |1= (help)CS1 maint: unflagged free DOI (link)
  12. ^ Pan YX, Zhang ZZ, Guo ZM, Feng GY, Huang ZD, He L (2003). "Application of pseudo amino acid composition for predicting protein subcellular location: stochastic signal processing approach". J. Protein Chem. 22 (4): 395–402. doi:10.1023/A:1025350409648. PMID 13678304. {{cite journal}}: Unknown parameter |month= ignored (help)CS1 maint: multiple names: authors list (link)
  13. ^ Zhou, Guo-Ping & Doctor, Kutbuddin (2003). "Subcellular location prediction of apoptosis proteins". Proteins. 50 (1): 44–8. doi:10.1002/prot.10251. PMID 12471598. {{cite journal}}: Unknown parameter |month= ignored (help)CS1 maint: multiple names: authors list (link)