Correlation clustering

Clustering is the problem of partitioning data points into groups based on their similarity. Correlation clustering provides a method for clustering a set of objects into the optimum number of clusters without specifying that number in advance.[1]

Description of the problem

Main article: Cluster analysis

In machine learning, correlation clustering or cluster editing operates in a scenario where the relationships between the objects are known instead of the actual representations of the objects. For example, given a weighted graph ${\displaystyle G=(V,E)}$ where the edge weight indicates whether two nodes are similar (positive edge weight) or different (negative edge weight), the task is to find a clustering that either maximizes agreements (sum of positive edge weights within a cluster plus the absolute value of the sum of negative edge weights between clusters) or minimizes disagreements (absolute value of the sum of negative edge weights within a cluster plus the sum of positive edge weights across clusters). Unlike other clustering algorithms this does not require choosing the number of clusters ${\displaystyle k}$ in advance because the objective, to minimize the sum of weights of the cut edges, is independent of the number of clusters.

It may not be possible to find a perfect clustering, where all similar items are in a cluster while all dissimilar ones are in different clusters. If the graph indeed admits a perfect clustering, then simply deleting all the negative edges and finding the connected components in the remaining graph will return the required clusters.

But, in general a graph may not have a perfect clustering. For example, given nodes a,b,c such that a,b and a,c are similar while b,c are dissimilar, a perfect clustering is not possible. In such cases, the task is to find a clustering that maximizes the number of agreements (number of + edges inside clusters minus the number of - edges between clusters) or minimizes the number of disagreements (the number of - edges inside clusters minus the number of + edges between clusters). This problem of maximizing the agreements is NP-complete (multiway cut problem reduces to maximizing weighted agreements and the problem of partitioning into triangles[2] can be reduced to the unweighted version).

Algorithms

Bansal et al.[3] discuss the NP-completeness proof and also present both a constant factor approximation algorithm and polynomial-time approximation scheme to find the clusters in this setting. Ailon et al.[4] propose a randomized 3-approximation algorithm for the same problem.

`CC-Pivot(G=(V,E+,E−))`

``````   Pick random pivot i ∈ V
Set ${\displaystyle C=\{i\}}$, V'=Ø
For all j ∈ V, j ≠ i;
If (i,j) ∈ E+ then
Add j to C
Else (If (i,j) ∈ E−)
Add j to V'
Let G' be the subgraph induced by V'
Return clustering C,CC-Pivot(G')
```
```

The authors show that the above algorithm is a 3-approximation algorithm for correlation clustering. The best polynomial-time approximation algorithm known at the moment for this problem achieves a ~2.06 approximation by rounding a linear program.[5]

Karpinski and Schudy[6] proved existence of a polynomial time approximation scheme (PTAS) for that problem on complete graphs and fixed number of clusters.

Optimal number of clusters

In 2011, it was shown by Bagon and Galun[7] that the optimization of the correlation clustering functional is closely related to well known discrete optimization methods. In their work they proposed a probabilistic analysis of the underlying implicit model that allows the correlation clustering functional to estimate the underlying number of clusters. This analysis suggests the functional assumes a uniform prior over all possible partitions regardless of their number of clusters. Thus, a non-uniform prior over the number of clusters emerges.

Several discrete optimization algorithms are proposed in this work that scales gracefully with the number of elements (experiments show results with more than 100,000 variables). The work of Bagon and Galun also evaluated the effectiveness of the recovery of the underlying number of clusters in several applications.

Correlation clustering (data mining)

Correlation clustering also relates to a different task, where correlations among attributes of feature vectors in a high-dimensional space are assumed to exist guiding the clustering process. These correlations may be different in different clusters, thus a global decorrelation cannot reduce this to traditional (uncorrelated) clustering.

Correlations among subsets of attributes result in different spatial shapes of clusters. Hence, the similarity between cluster objects is defined by taking into account the local correlation patterns. With this notion, the term has been introduced in [8] simultaneously with the notion discussed above. Different methods for correlation clustering of this type are discussed in [9] and the relationship to different types of clustering is discussed in.[10] See also Clustering high-dimensional data.

Correlation clustering (according to this definition) can be shown to be closely related to biclustering. As in biclustering, the goal is to identify groups of objects that share a correlation in some of their attributes; where the correlation is usually typical for the individual clusters.

References

1. ^ Becker, Hila, "A Survey of Correlation Clustering", 5 May 2005
2. ^ Garey, M. and Johnson, D (W.H. Freeman and Company). (2000). Computers and Intractability: A Guide to the Theory of NP-Completeness.
3. ^ Bansal, N.; Blum, A.; Chawla, S. (2004). "Correlation Clustering". Machine Learning 56: 89. doi:10.1023/B:MACH.0000033116.57574.95.
4. ^ Ailon, N.; Charikar, M.; Newman, A. (2005). "Aggregating inconsistent information". Proceedings of the thirty-seventh annual ACM symposium on Theory of computing - STOC '05. p. 684. doi:10.1145/1060590.1060692. ISBN 1581139608.
5. ^ Chawla, Shuchi; Makarychev, Konstantin; Schramm, Tselil; Yaroslavtsev, Grigory. "Near Optimal LP Rounding Algorithm for CorrelationClustering on Complete and Complete k-partite Graphs". Proceedings of the 46th Annual ACM on Symposium on Theory of Computing.
6. ^ Karpinski, M.; Schudy, W. (2009). "Linear time approximation schemes for the Gale-Berlekamp game and related minimization problems". Proceedings of the 41st annual ACM symposium on Symposium on theory of computing - STOC '09. p. 313. doi:10.1145/1536414.1536458. ISBN 9781605585062.
7. ^ Bagon, S.; Galun, M. (2011) "Large Scale Correlation Clustering Optimization" arXiv:1112.2903v1
8. ^ Böhm, C.; Kailing, K.; Kröger, P.; Zimek, A. (2004). "Computing Clusters of Correlation Connected objects". Proceedings of the 2004 ACM SIGMOD international conference on Management of data - SIGMOD '04. p. 455. doi:10.1145/1007568.1007620. ISBN 1581138598.
9. ^ Zimek, A. (2008). "Correlation Clustering".
10. ^ Kriegel, H. P.; Kröger, P.; Zimek, A. (2009). "Clustering high-dimensional data". ACM Transactions on Knowledge Discovery from Data 3: 1. doi:10.1145/1497577.1497578.