Consensus clustering

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Clustering is the assignment of objects into groups (called clusters) so that objects from the same cluster are more similar to each other than objects from different clusters.[1] Often similarity is assessed according to a distance measure. Clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics.

Consensus clustering has emerged as an important elaboration of the classical clustering problem. Consensus clustering, also called aggregation of clustering (or partitions), refers to the situation in which a number of different (input) clusterings have been obtained for a particular dataset and it is desired to find a single (consensus) clustering which is a better fit in some sense than the existing clusterings.[2][3] Consensus clustering is thus the problem of reconciling clustering information about the same data set coming from different sources or from different runs of the same algorithm. When cast as an optimization problem, consensus clustering is known as median partition, and has been shown to be NP-complete.[4] Consensus clustering for unsupervised learning is analogous to ensemble learning in supervised learning.[3]

Issues with existing clustering techniques[edit]

  • Current clustering techniques do not address all the requirements adequately.
  • Dealing with large number of dimensions and large number of data items can be problematic because of time complexity;
  • Effectiveness of the method depends on the definition of "distance" (for distance based clustering)
  • If an obvious distance measure doesn’t exist we must "define" it, which is not always easy, especially in multidimensional spaces.
  • The result of the clustering algorithm (that in many cases can be arbitrary itself) can be interpreted in different ways.

Justification for using consensus clustering[edit]

There are potential shortcomings for all existing clustering techniques. This may cause interpretation of results to become difficult, especially when there is no knowledge about the number of clusters. Clustering methods are also very sensitive to the initial clustering settings, which can cause non-significant data to be amplified in non-reiterative methods. An extremely important issue in cluster analysis is the validation of the clustering results, that is, how to gain confidence about the significance of the clusters provided by the clustering technique (cluster numbers and cluster assignments). Lacking an external objective criterion (the equivalent of a known class label in supervised analysis), this validation becomes somewhat elusive. Iterative descent clustering methods, such as the SOM and K-means clustering circumvent some of the shortcomings of Hierarchical clustering by providing for univocally defined clusters and cluster boundaries. Consensus clustering provides a method that represents the consensus across multiple runs of a clustering algorithm, to determine the number of clusters in the data, and to assess the stability of the discovered clusters. The method can also be used to represent the consensus over multiple runs of a clustering algorithm with random restart (such as K-means, model-based Bayesian clustering, SOM, etc.), so as to account for its sensitivity to the initial conditions. It can provide data for a visualization tool to inspect cluster number, membership, and boundaries. However, they lack the intuitive and visual appeal of Hierarchical clustering dendrograms, and the number of clusters must be chosen a priori.

Over-interpretation potential of consensus clustering[edit]

PAC measure (proportion of ambiguous clustering) explained. Optimal K is the K with lowest PAC value.

Consensus clustering can be a powerful tool for identifying clusters, but it needs to be applied with caution. It has been shown that consensus clustering is able to claim apparent stability of chance partitioning of null datasets drawn from a unimodal distribution, and thus has the potential to lead to over-interpretation of cluster stability in a real study.[5][6] If clusters are not well separated, consensus clustering could lead one to conclude apparent structure when there is none, or declare cluster stability when it is subtle. To reduce the false positive potential in clustering samples (observations), Şenbabaoğlu et al [5] recommends (1) doing a formal test of cluster strength using simulated unimodal data with the same feature-space correlation structure as in the empirical data, (2) not relying solely on the consensus matrix heatmap to declare the existence of clusters, or to estimate optimal K, (3) applying the proportion of ambiguous clustering (PAC) as a simple yet powerful method to infer optimal K.

Inferred optimal K values of different methods on simulated datasets with known number of clusters and known degree of separation between clusters. Consensus clustering followed by PAC outperforms other methods.

PAC: In the CDF curve of a consensus matrix, the lower left portion represents sample pairs rarely clustered together, the upper right portion represents those almost always clustered together, whereas the middle segment represent those with ambiguous assignments in different clustering runs. The "proportion of ambiguous clustering" (PAC) measure quantifies this middle segment; and is defined as the fraction of sample pairs with consensus indices falling in the interval (u1, u2) ∈ [0, 1] where u1 is a value close to 0 and u2 is a value close to 1 (for instance u1=0.1 and u2=0.9). A low value of PAC indicates a flat middle segment, and a low rate of discordant assignments across permuted clustering runs. We can therefore infer the optimal number of clusters by the K value having the lowest PAC.

In simulated datasets with known number of clusters, consensus clustering+PAC has been shown to perform better than several other commonly used methods such as consensus clustering+Δ(K), CLEST, GAP, and silhouette width.[5][6]

Related work[edit]

1. Clustering ensemble (Strehl and Ghosh): They considered various formulations for the problem, most of which reduce the problem to a hyper-graph partitioning problem. In one of their formulations they considered the same graph as in the correlation clustering problem. The solution they proposed is to compute the best k-partition of the graph, which does not take into account the penalty for merging two nodes that are far apart.

2. Clustering aggregation (Fern and Brodley): They applied the clustering aggregation idea to a collection of soft clusterings they obtained by random projections. They used an agglomerative algorithm and did not penalize for merging dissimilar nodes.

3. Fred and Jain: They proposed to use a single linkage algorithm to combine multiple runs of the k-means algorithm.

4. Dana Cristofor and Dan Simovici: They observed the connection between clustering aggregation and clustering of categorical data. They proposed information theoretic distance measures, and they propose genetic algorithms for finding the best aggregation solution.

5. Topchy et al.: They defined clustering aggregation as a maximum likelihood estimation problem, and they proposed an EM algorithm for finding the consensus clustering.

6. Abu-Jamous et al.: They proposed their binarization of consensus partition matrices (Bi-CoPaM) method to enhance ensemble clustering in two major aspects. The first is to consider clustering the same set of objects by various clustering methods as well as by considering their features measured in multiple datasets; this seems perfectly relevant in the context of microarray gene expression clustering, which is the context they initially proposed the method in. The second aspect is the format of the final result; based on the consistency of inclusion of a data object in the same cluster by the multiple single clustering results, they allowed any single data object to have any of the three eventualities; to be exclusively assigned to one and only one cluster, to be unassigned from all clusters, or to be simultaneously assigned to multiple clusters at the same time. They made it possible to produce, in a perfectly tunable way, wide overlapping clusters, tight specific clusters, as well as complementary clusters. Therefore, they proposed their work as a new paradigm of clustering rather than merely a new ensemble clustering method.[3][7]

Hard ensemble clustering[edit]

This approach by Strehl and Ghosh introduces the problem of combining multiple partitionings of a set of objects into a single consolidated clustering without accessing the features or algorithms that determined these partitionings. They discuss three approaches towards solving this problem to obtain high quality consensus functions. Their techniques have low computational costs and this makes it feasible to evaluate each of the techniques discussed below and arrive at the best solution by comparing the results against the objective function.

Efficient consensus functions[edit]

1. Cluster-based similarity partitioning algorithm (CSPA)

In CSPA the similarity between two data-points is defined to be directly proportional to number of constituent clusterings of the ensemble in which they are clustered together. The intuition is that the more similar two data-points are the higher is the chance that constituent clusterings will place them in the same cluster. CSPA is the simplest heuristic, but its computational and storage complexity are both quadratic in n. The following two methods are computationally less expensive:

2. Hyper-graph partitioning algorithm (HGPA)

The HGPA algorithm takes a very different approach to finding the consensus clustering than the previous method. The cluster ensemble problem is formulated as partitioning the hypergraph by cutting a minimal number of hyperedges. They make use of hMETIS which is a hypergraph partitioning package system.

3. Meta-clustering algorithm (MCLA)

The meta-cLustering algorithm (MCLA) is based on clustering clusters. First, it tries to solve the cluster correspondence problem and then uses voting to place data-points into the final consensus clusters. The cluster correspondence problem is solved by grouping the clusters identified in the individual clusterings of the ensemble. The clustering is performed using METIS and Spectral clustering.

Soft clustering ensembles[edit]

Punera and Ghosh extended the idea of hard clustering ensembles to the soft clustering scenario. Each instance in a soft ensemble is represented by a concatenation of r posterior membership probability distributions obtained from the constituent clustering algorithms. We can define a distance measure between two instances using the Kullback–Leibler (KL) divergence, which calculates the “distance” between two probability distributions.

1. sCSPA

sCSPA extends CSPA by calculating a similarity matrix. Each object is visualized as a point in dimensional space, with each dimension corresponding to probability of its belonging to a cluster. This technique first transforms the objects into a label-space and then interprets the dot product between the vectors representing the objects as their similarity.

2. sMCLA

sMCLA extends MCLA by accepting soft clusterings as input. sMCLA’s working can be divided into the following steps:

  • Construct Soft Meta-Graph of Clusters
  • Group the Clusters into Meta-Clusters
  • Collapse Meta-Clusters using Weighting
  • Compete for Objects

3. sHBGF

HBGF represents the ensemble as a bipartite graph with clusters and instances as nodes, and edges between the instances and the clusters they belong to.[8] This approach can be trivially adapted to consider soft ensembles since the graph partitioning algorithm METIS accepts weights on the edges of the graph to be partitioned. In sHBGF, the graph has n + t vertices, where t is the total number of underlying clusters.

Tunable-tightness partitions[edit]

In this different form of clustering, each data object is allowed to be exclusively assigned to one and only one cluster, to be unassigned from all clusters, or to be simultaneously assigned to multiple clusters, in a completely tunable way.[3] In some applications like gene clustering, this matches the biological reality that many of the genes considered for clustering in a particular gene discovery study might be irrelevant to the case of study in hand and should be ideally not assigned to any of the output clusters, moreover, any single gene can be participating in multiple processes and would be useful to be included in multiple clusters simultaneously. This has been proposed in the recent method of the binarization of consensus partition matrices (Bi-CoPaM)[3] and is being used currently in the field of bioinformatics.[7]

  1. ^ Gibbons, F. D.; Johnson, GF; Yalow, RS (30 September 2002). "Judging the Quality of Gene Expression-Based Clustering Methods Using Gene Annotation". Genome Research 12 (10): 1574–1581. doi:10.1101/gr.397002. 
  2. ^ VEGA-PONS, SANDRO; RUIZ-SHULCLOPER, JOSÉ (1 May 2011). "A SURVEY OF CLUSTERING ENSEMBLE ALGORITHMS". International Journal of Pattern Recognition and Artificial Intelligence 25 (03): 337–372. doi:10.1142/S0218001411008683. 
  3. ^ a b c d e Abu-Jamous, Basel; Fa, Rui; Roberts, David J.; Nandi, Asoke K.; Peddada, Shyamal D. (11 February 2013). "Paradigm of Tunable Clustering Using Binarization of Consensus Partition Matrices (Bi-CoPaM) for Gene Discovery". PLoS ONE 8 (2): e56432. doi:10.1371/journal.pone.0056432. PMC 3569426. PMID 23409186. 
  4. ^ Filkov, Vladimir (2003). "Integrating microarray data by consensus clustering". In Proceedings of the 15th IEEE International Conference on Tools with Artificial Intelligence.: 418–426. doi:10.1109/TAI.2003.1250220. 
  5. ^ a b c Şenbabaoğlu, Y.; Michailidis, G.; Li, J. Z. (2014). "Critical limitations of consensus clustering in class discovery". Nature Scientific Reports. doi:10.1038/srep06207. 
  6. ^ a b Şenbabaoğlu, Y.; Michailidis, G.; Li, J. Z. (Feb 2014). "A reassessment of consensus clustering for class discovery". bioRxiv. doi:10.1101/002642. 
  7. ^ a b Abu-Jamous, B.; Fa, R.; Roberts, D. J.; Nandi, A. K. (24 January 2013). "Yeast gene CMR1/YDL156W is consistently co-expressed with genes participating in DNA-metabolic processes in a variety of stringent clustering experiments". Journal of The Royal Society Interface 10 (81): 20120990–20120990. doi:10.1098/rsif.2012.0990. 
  8. ^ Solving cluster ensemble problems by bipartite graph partitioning, Xiaoli Zhang Fern and Carla E. Brodley,Proceedings of the twenty-first international conference on Machine learning

References[edit]

Further reading[edit]