Earth mover's distance: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
m revert - better encapsulated by the already-included Category:Statistical distance
Preimage (talk | contribs)
Non-Wasserstein cleanup - simplified intro - reverted 02:08, 14 July 2021 - original definition uses partial matching - partial functions not relevant - alternative is from Pele and Werman (e.g. it's an option in FastEMD)
Line 1: Line 1:
{{short description|Distance between probability distributions}}
{{short description|Distance between probability distributions}}
In computer science, the '''earth mover's distance''' ('''EMD''') is a [[divergence (statistics)|measure of distance]] between two [[frequency (statistics)|frequency distributions]], [[density (disambiguation)|densities]], or [[measure (mathematics)|measure]]s over a region ''D''.
In statistics, the '''earth mover's distance''' ('''EMD''') is a [[metric (mathematics)|measure of the distance]] between two [[probability distribution]]s over a region&nbsp;''D''. In mathematics, this is known as the [[Wasserstein metric]]. Informally, if the distributions are interpreted as two different ways of piling up a certain amount of [[Soil|earth]] (dirt) over the region ''D'', the EMD is the minimum cost of turning one pile into the other; where the cost is assumed to be the amount of dirt moved times the [[distance]] by which it is moved.<ref>[http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/RUBNER/emd.htm Formal definition]</ref>
For [[probability distribution]]s and normalized [[histogram]]s, it reduces to the [[Wasserstein metric]].<ref>{{cite journal | author = Elizaveta Levina|author1-link= Elizaveta Levina |author2=Peter Bickel | title = The EarthMover's Distance is the Mallows Distance: Some Insights from Statistics | journal = Proceedings of ICCV 2001 | location = Vancouver, Canada | pages = 251–256 | year = 2001}}</ref><ref>{{cite journal | doi = 10.1214/aoms/1177692631 | author = C. L. Mallows | title = A note on asymptotic joint normality | journal = Annals of Mathematical Statistics | volume = 43 | issue = 2 | pages = 508–515 | year = 1972| doi-access = free }}</ref>

Informally, if the distributions are interpreted as two different ways of piling up [[Soil|earth]] (dirt) over the region D, the EMD captures the minimum cost of building the smaller pile using dirt taken from the larger, where cost is defined as the amount of dirt moved multiplied by the distance by which it is moved.<ref>[http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/RUBNER/emd.htm Formal definition]</ref>
The above definition is valid only if the two distributions have the same integral (informally, if the two piles have the same amount of dirt), as in normalized [[histogram]]s or [[probability density function]]s. In that case, the EMD is equivalent to the 1st [[Wasserstein metric|Mallows distance or 1st Wasserstein distance]] between the two distributions.<ref>{{cite journal | author = Elizaveta Levina|author1-link= Elizaveta Levina |author2=Peter Bickel | title = The EarthMover's Distance is the Mallows Distance: Some Insights from Statistics | journal = Proceedings of ICCV 2001 | location = Vancouver, Canada | pages = 251–256 | year = 2001}}</ref><ref>{{cite journal | doi = 10.1214/aoms/1177692631 | author = C. L. Mallows | title = A note on asymptotic joint normality | journal = Annals of Mathematical Statistics | volume = 43 | issue = 2 | pages = 508–515 | year = 1972| doi-access = free }}</ref>


==Theory==
==Theory==
Line 25: Line 25:


:<math>\sum_{i=1}^m\sum_{j=1}^n f_{i,j} = \sum_{i=1}^m w_{pi} = \sum_{j=1}^n w_{q j}</math>
:<math>\sum_{i=1}^m\sum_{j=1}^n f_{i,j} = \sum_{i=1}^m w_{pi} = \sum_{j=1}^n w_{q j}</math>

:<math>\sum_{i=1}^m\sum_{j=1}^n f_{i,j} = \min \left\{ \ \sum_{i=1}^m w_{pi}, \quad \sum_{j=1}^n w_{q j} \ \right\}</math>


The optimal flow <math display="inline">F </math> is found by solving this linear optimization problem. The earth mover's distance is defined as the work normalized by the total flow:
The optimal flow <math display="inline">F </math> is found by solving this linear optimization problem. The earth mover's distance is defined as the work normalized by the total flow:
Line 44: Line 46:
that is, to supremum over all [[Lipschitz continuity|1-Lipschitz continuous]] functions, i.e. <math>\| \nabla f(x)\| \leq 1 \quad \forall x</math>.
that is, to supremum over all [[Lipschitz continuity|1-Lipschitz continuous]] functions, i.e. <math>\| \nabla f(x)\| \leq 1 \quad \forall x</math>.


==Variants and extensions==
==Extensions==
Some applications may require the comparison of distributions with different total masses. One approach is to allow for a ''[[partial function|partial match]]'', where dirt from the most massive distribution is rearranged to make the least massive, and any leftover "dirt" is discarded at no cost. Under this approach, the EMD is no longer a true distance between distributions.


Some applications may require the comparison of distributions with different total masses. The approach used in the original definition is to allow for ''partial matching'', where dirt from the most massive distribution is rearranged to make the least massive, and any leftover "dirt" is discarded at no cost. The original EMD is hence not a [[metric (mathematics)|true distance]] between distributions, as it does not satisfy the triangle inequality.
Another approach is to allow for mass to be created or destroyed, on a global and/or local level, as an alternative to transportation, but with a cost penalty. In that case one must specify a real parameter σ, the ratio between the cost of creating or destroying one unit of "dirt", and the cost of transporting it by a unit distance. This is equivalent to minimizing the sum of the earth moving cost plus σ times the ''L''<sup>1</sup> distance between the rearranged pile and the second distribution.


An alternative approach is to allow for mass to be created or destroyed, on a global and/or local level, as an alternative to transportation, but with a cost penalty. In that case one must specify a real parameter <math>\alpha</math>, the ratio between the cost of creating or destroying one unit of "dirt", and the cost of transporting it by a unit distance. This is equivalent to minimizing the sum of the earth moving cost plus <math>\alpha</math> times the ''L''<sup>1</sup> distance between the rearranged pile and the second distribution. The resulting measure <math>\widehat{EMD}_\alpha</math> is a true distance function.<ref name="PeleWerman2008">{{cite book | title = Lecture Notes in Computer Science | last1 = Pele | first1 = Ofir | last2 = Werman | first2 = Michael | chapter = A Linear Time Histogram Metric for Improved SIFT Matching | date = 2008 | pages = 495–508 | publisher = Springer Berlin Heidelberg | issn = 0302-9743 | eissn = 1611-3349 | doi = 10.1007/978-3-540-88690-7_37 | url = }}</ref>
Notationally, if <math>\pi:P\to Q</math> is a [[partial function]] which is a [[bijection]] on subsets <math>P'\subset P</math> and <math>Q'\subset Q</math>, then one is interested in the distance function
:<math>|P-Q|_\pi = |P\setminus P'| + |Q\setminus Q'|</math>
where <math>P\setminus P'</math> denotes [[set minus]]. Here, <math>P'</math> would be the portion of the earth that was moved; thus <math>P\setminus P'</math> would be the portion not moved, and <math>|P\setminus P'|</math> the size of the pile not moved. By symmetry, one contemplates <math>Q'\subset Q</math> as the pile at the destination that 'got there' from ''P'', as compared to the total ''Q'' that we ''want to have there''. Formally, this distance indicates how much an [[injective]] correspondence differs from an [[isomorphism]].


The EMD can be extended naturally to the case where more than two distributions are compared. In this case, the "distance" between the many distributions is defined as the optimal value of a linear program. This generalized EMD may be computed exactly using a greedy algorithm, and the resulting functional has been shown to be [[Minkowski addition|Minkowski additive]] and convex monotone.<ref>{{cite journal | journal = Discrete Applied Mathematics | title = Properties of the d-dimensional earth mover's problem | last1 = Kline | first1 = Jeffery | volume = 265 | year = 2019 | pages = 128–141 | doi = 10.1016/j.dam.2019.02.042}}</ref>
The EMD can be extended naturally to the case where more than two distributions are compared. In this case, the "distance" between the many distributions is defined as the optimal value of a linear program. This generalized EMD may be computed exactly using a greedy algorithm, and the resulting functional has been shown to be [[Minkowski addition|Minkowski additive]] and convex monotone.<ref>{{cite journal | journal = Discrete Applied Mathematics | title = Properties of the d-dimensional earth mover's problem | last1 = Kline | first1 = Jeffery | volume = 265 | year = 2019 | pages = 128–141 | doi = 10.1016/j.dam.2019.02.042}}</ref>

Revision as of 11:29, 3 April 2023

In computer science, the earth mover's distance (EMD) is a measure of distance between two frequency distributions, densities, or measures over a region D. For probability distributions and normalized histograms, it reduces to the Wasserstein metric.[1][2] Informally, if the distributions are interpreted as two different ways of piling up earth (dirt) over the region D, the EMD captures the minimum cost of building the smaller pile using dirt taken from the larger, where cost is defined as the amount of dirt moved multiplied by the distance by which it is moved.[3]

Theory

Assume that we have a set of points in (dimension ). Instead of assigning one distribution to the set of points, we can cluster them and represent the point set in terms of the clusters. Thus, each cluster is a single point in and the weight of the cluster is decided by the fraction of the distribution present in that cluster. This representation of a distribution by a set of clusters is called the signature. Two signatures can have different sizes, for example, a bimodal distribution has shorter signature (2 clusters) than complex ones. One cluster representation (mean or mode in ) can be thought of as a single feature in a signature. The distance between each of the features is called as ground distance.

The Earth Mover's Distance can be formulated and solved as a transportation problem. Suppose that several suppliers, each with a given amount of goods, are required to supply several consumers, each with a given limited capacity. For each supplier-consumer pair, the cost of transporting a single unit of goods is given. The transportation problem is then to find a least-expensive flow of goods from the suppliers to the consumers that satisfies the consumers' demand. Similarly, here the problem is transforming one signature() to another() with minimum work done.

Assume that signature has clusters with , where is the cluster representative and is the weight of the cluster. Similarly another signature has clusters. Let be the ground distance between clusters and .

We want to find a flow , with the flow between and , that minimizes the overall cost.

subject to the constraints:

The optimal flow is found by solving this linear optimization problem. The earth mover's distance is defined as the work normalized by the total flow:

Equivalent form

Equivalently, we can write:

where is the set of all joint distributions whose marginals are and .

Kantorovich-Rubinstein duality

By Kantorovich-Rubinstein duality, the following is an equivalent expression:

that is, to supremum over all 1-Lipschitz continuous functions, i.e. .

Variants and extensions

Some applications may require the comparison of distributions with different total masses. The approach used in the original definition is to allow for partial matching, where dirt from the most massive distribution is rearranged to make the least massive, and any leftover "dirt" is discarded at no cost. The original EMD is hence not a true distance between distributions, as it does not satisfy the triangle inequality.

An alternative approach is to allow for mass to be created or destroyed, on a global and/or local level, as an alternative to transportation, but with a cost penalty. In that case one must specify a real parameter , the ratio between the cost of creating or destroying one unit of "dirt", and the cost of transporting it by a unit distance. This is equivalent to minimizing the sum of the earth moving cost plus times the L1 distance between the rearranged pile and the second distribution. The resulting measure is a true distance function.[4]

The EMD can be extended naturally to the case where more than two distributions are compared. In this case, the "distance" between the many distributions is defined as the optimal value of a linear program. This generalized EMD may be computed exactly using a greedy algorithm, and the resulting functional has been shown to be Minkowski additive and convex monotone.[5]

Computing the EMD

The EMD can be computed by solving an instance of transportation problem, using any algorithm for minimum-cost flow problem, e.g. the network simplex algorithm.

The Hungarian algorithm can be used to get the solution if the domain D is the set {0, 1}. If the domain is integral, it can be translated for the same algorithm by representing integral bins as multiple binary bins.

As a special case, if D is a one-dimensional array of "bins" of size n, the EMD can be efficiently computed by scanning the array and keeping track of how much dirt needs to be transported between consecutive bins. Here the bins are zero-indexed:

EMD-based similarity analysis

EMD-based similarity analysis (EMDSA) is an important and effective tool in many multimedia information retrieval[6] and pattern recognition[7] applications. However, the computational cost of EMD is super-cubic to the number of the "bins" given an arbitrary "D". Efficient and scalable EMD computation techniques for large scale data have been investigated using MapReduce,[8][9] as well as bulk synchronous parallel and resilient distributed dataset.[10]

Applications

An early application of the EMD in computer science was to compare two grayscale images that may differ due to dithering, blurring, or local deformations.[11] In this case, the region is the image's domain, and the total amount of light (or ink) is the "dirt" to be rearranged.

The EMD is widely used in content-based image retrieval to compute distances between the color histograms of two digital images.[citation needed] In this case, the region is the RGB color cube, and each image pixel is a parcel of "dirt". The same technique can be used for any other quantitative pixel attribute, such as luminance, gradient, apparent motion in a video frame, etc..

More generally, the EMD is used in pattern recognition to compare generic summaries or surrogates of data records called signatures.[12] A typical signature consists of list of pairs ((x1,m1), ... (xn,mn)), where each xi is a certain "feature" (e.g., color in an image, letter in a text, etc.), and mi is "mass" (how many times that feature occurs in the record). Alternatively, xi may be the centroid of a data cluster, and mi the number of entities in that cluster. To compare two such signatures with the EMD, one must define a distance between features, which is interpreted as the cost of turning a unit mass of one feature into a unit mass of the other. The EMD between two signatures is then the minimum cost of turning one of them into the other.

EMD analysis has been used for quantitating multivariate changes in biomarkers measured by flow cytometry, with potential applications to other technologies that report distributions of measurements.[13]

History

The concept was first introduced by Gaspard Monge in 1781,[14] in the context of transportation theory. The use of the EMD as a distance measure for monochromatic images was described in 1989 by S. Peleg, M. Werman and H. Rom.[11] The name "earth movers' distance" was proposed by J. Stolfi in 1994,[15] and was used in print in 1998 by Y. Rubner, C. Tomasi and L. G. Guibas.[16]

References

  1. ^ Elizaveta Levina; Peter Bickel (2001). "The EarthMover's Distance is the Mallows Distance: Some Insights from Statistics". Proceedings of ICCV 2001. Vancouver, Canada: 251–256.
  2. ^ C. L. Mallows (1972). "A note on asymptotic joint normality". Annals of Mathematical Statistics. 43 (2): 508–515. doi:10.1214/aoms/1177692631.
  3. ^ Formal definition
  4. ^ Pele, Ofir; Werman, Michael (2008). "A Linear Time Histogram Metric for Improved SIFT Matching". Lecture Notes in Computer Science. Springer Berlin Heidelberg. pp. 495–508. doi:10.1007/978-3-540-88690-7_37. eISSN 1611-3349. ISSN 0302-9743.
  5. ^ Kline, Jeffery (2019). "Properties of the d-dimensional earth mover's problem". Discrete Applied Mathematics. 265: 128–141. doi:10.1016/j.dam.2019.02.042.
  6. ^ Mark A. Ruzon; Carlo Tomasi (2001). "Edge, Junction, and Corner Detection Using Color Distributions". IEEE Transactions on Pattern Analysis and Machine Intelligence.
  7. ^ Kristen Grauman; Trevor Darrel (2004). "Fast contour matching using approximate earth mover's distance". Proceedings of CVPR 2004.
  8. ^ Jin Huang; Rui Zhang; Rajkumar Buyya; Jian Chen (2014). "MELODY-Join: Efficient Earth Mover's Distance Similarity Joins Using MapReduce". Proceedings of IEEE International Conference on Data Engineering.
  9. ^ Jia Xu; Bin Lei; Yu Gu; Winslett, M.; Ge Yu; Zhenjie Zhang (2015). "Efficient Similarity Join Based on Earth Mover's Distance Using MapReduce". IEEE Transactions on Knowledge and Data Engineering.
  10. ^ Jin Huang; Rui Zhang; Rajkumar Buyya; Jian Chen, M.; Yongwei Wu (2015). "Heads-Join: Efficient Earth Mover's Distance Join on Hadoop". IEEE Transactions on Parallel and Distributed Systems.
  11. ^ a b S. Peleg; M. Werman; H. Rom (1989). "A unified approach to the change of resolution: Space and gray-level". IEEE Transactions on Pattern Analysis and Machine Intelligence. 11 (7): 739–742. doi:10.1109/34.192468.
  12. ^ Rubner, Y.; Tomasi, C.; Guibas, L.J. "A metric for distributions with applications to image databases". Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271). Narosa Publishing House. doi:10.1109/iccv.1998.710701.
  13. ^ Orlova, DY; Zimmerman, N; Meehan, C; Meehan, S; Waters, J; Ghosn, EEB (23 March 2016). "Earth Mover's Distance (EMD): A True Metric for Comparing Biomarker Expression Levels in Cell Populations". PLOS One. 11 (3): e0151859. doi:10.1371/journal.pone.0151859. PMC 4805242. PMID 27008164.
  14. ^ "Mémoire sur la théorie des déblais et des remblais". Histoire de l'Académie Royale des Science, Année 1781, avec les Mémoires de Mathématique et de Physique. 1781.
  15. ^ J. Stolfi, personal communication to L. J. Guibas, 1994, as cited by Rubner, Yossi; Tomasi, Carlo; Guibas, Leonidas J. (2000). "The earth mover's distance as a metric for image retrieval" (PDF). International Journal of Computer Vision. 40 (2): 99–121. doi:10.1023/A:1026543900054. S2CID 14106275.
  16. ^ Yossi Rubner; Carlo Tomasi; Leonidas J. Guibas (1998). "A Metric for Distributions with Applications to Image Databases". Proceedings ICCV 1998: 59–66. doi:10.1109/ICCV.1998.710701. ISBN 81-7319-221-9. S2CID 18648233.

External links