# Sørensen–Dice coefficient

The Sørensen–Dice index, also known by other names (see Names, below), is a statistic used for comparing the similarity of two samples. It was independently developed by the botanists Thorvald Sørensen[1] and Lee Raymond Dice,[2] who published in 1948 and 1945 respectively.

## Name

The index is known by several other names, usually Sørensen index or Dice's coefficient. Both names also see "similarity coefficient", "index", and other such variations. Common alternate spellings for Sørensen are Sorenson, Soerenson index and Sörenson index, and all three can also be seen with the –sen ending.

## Formula

Sørensen's original formula was intended to be applied to presence/absence data, and is

$QS = \frac{2C}{A + B} = \frac{2 |A \cap B|}{|A| + |B|}$

where A and B are the number of species in samples A and B, respectively, and C is the number of species shared by the two samples; QS is the quotient of similarity and ranges from 0 to 1. This expression is easily extended to abundance instead of presence/absence of species. This quantitative version of the Sørensen index is also known as Czekanowski index. The Sørensen index is identical to Dice's coefficient[3] which is always in [0, 1] range. The Sørensen index used as a distance measure, 1 − QS, is identical to Hellinger distance and Bray Curtis dissimilarity[4] when applied to quantitative data.

It can be viewed as a similarity measure over sets:

$s = \frac{2 | X \cap Y |}{| X | + | Y |}$

It is not very different in form from the Jaccard index but has some different properties.

The function ranges between zero and one, like Jaccard. Unlike Jaccard, the corresponding difference function

$d = 1 - \frac{2 | X \cap Y |}{| X | + | Y |}$

is not a proper distance metric as it does not possess the property of triangle inequality. The simplest counterexample of this is given by the three sets {a}, {b}, and {a,b}, the distance between the first two being 1, and the difference between the third and each of the others being one-third.

Similarly to Jaccard, the set operations can be expressed in terms of vector operations over binary vectors A and B:

$s_v = \frac{2 | A \cdot B |}{| A |^2 + | B |^2}$

which gives the same outcome over binary vectors and also gives a more general similarity metric over vectors in general terms.

For sets X and Y of keywords used in information retrieval, the coefficient may be defined as twice the shared information (intersection) over the sum of cardinalities :[5]

When taken as a string similarity measure, the coefficient may be calculated for two strings, x and y using bigrams as follows:[6]

$s = \frac{2 n_t}{n_x + n_y}$

where nt is the number of character bigrams found in both strings, nx is the number of bigrams in string x and ny is the number of bigrams in string y. For example, to calculate the similarity between:

night
nacht

We would find the set of bigrams in each word:

{ni,ig,gh,ht}
{na,ac,ch,ht}

Each set has four elements, and the intersection of these two sets has only one element: ht.

Inserting these numbers into the formula, we calculate, s = (2 · 1) / (4 + 4) = 0.25.

## Applications

The Sørensen–Dice coefficient is mainly useful for ecological community data (e.g. Looman & Campbell, 1960[7]). Justification for its use is primarily empirical rather than theoretical (although it can be justified theoretically as the intersection of two fuzzy sets[8]). As compared to Euclidean distance, Sørensen distance retains sensitivity in more heterogeneous data sets and gives less weight to outliers.[9]

## References

