The Dunn index (DI) (introduced by J. C. Dunn in 1974) is a metric for evaluating clustering algorithms. This is part of a group of validity indices including the Davies–Bouldin index, in that it is an internal evaluation scheme, where the result is based on the clustered data itself. As do all other such indices, the aim is to identify sets of clusters that are compact, with a small variance between members of the cluster, and well separated, where the means of different clusters are sufficiently far apart, as compared to the within cluster variance. For a given assignment of clusters, a higher Dunn index indicates better clustering. One of the drawbacks of using this, is the computational cost as the number of clusters and dimensionality of the data increase.
There are many ways to define the size or diameter of a cluster. It could be the distance between the farthest two points inside a cluster, it could be the mean of all the pairwise distances between data points inside the cluster, or it could as well be the distance of each data point from the cluster centroid. Each of these formulations are mathematically shown below:
Let Ci be a cluster of vectors. Let x and y be any two n dimensional feature vectors assigned to the same cluster Ci.
- , which calculates the maximum distance.
- , which calculates the mean distance between all pairs.
- , calculates distance of all the points from the mean.
This can also be said about the intercluster distance, where similar formulations can be made, using either the closest two data points, one in each cluster, or the farthest two, or the distance between the centroids and so on. The definition of the index includes any such formulation, and the family of indices so formed are called Dunn-like Indices. Let
- be this intercluster distance metric, between clusters Ci and Cj.
With the above notation, if there are m clusters, then the Dunn Index for the set is defined as:
Being defined in this way, the DI depends on m, the number of clusters in the set. If the number of clusters is not known apriori, the m for which the DI is the highest can be chosen as the number of clusters. There is also some flexibility when it comes to the definition of d(x,y) where any of the well known metrics can be used, like Manhattan distance or Euclidean distance based on the geometry of the clustering problem. This formulation has a peculiar problem, in that if one of the clusters is badly behaved, where the others are tightly packed, since the denominator contains a 'max' term instead of an average term, the Dunn Index for that set of clusters will be uncharacteristically low. This is thus some sort of a worst case indicator, and has to be used keeping that in mind. There are ready implementation of Dunn index in some vector based programming languages like MATLAB, R (programming language) and Apache Mahout.
Notes and references
- Dunn, J. C. (1973). "A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters". Journal of Cybernetics 3 (3): 32–57. doi:10.1080/01969727308546046.
- "MATLAB implementation of the Dunn Index". Retrieved 5 December 2011.
- Lukasz, Nieweglowski. "Package ‘clv’" (PDF). R project. CRAN. Retrieved 2 April 2013.
- "Apache Mahout". Apache Software Foundation. Retrieved 9 May 2013.