Biclustering: Difference between revisions

Content deleted Content added

Inline

Revision as of 11:58, 12 May 2009

Biclustering, co-clustering, or two-mode clustering^[1] is a data mining technique which allows simultaneous clustering of the rows and columns of a matrix. The term was first introduced by Mirkin^[2] (recently by Cheng and Church^[3] in gene expression analysis), although the technique was originally introduced much earlier^[2] (i.e., by J.A. Hartigan^[4]).

Given a set of m rows in n columns (i.e., an m×n matrix), the biclustering algorithm generates biclusters - a subset of rows which exhibit similar behavior across a subset of columns, or vice versa.

Complexity

The complexity of the biclustering problem depends on the exact problem formulation, and particularly on the merit function used to evaluate the quality of a given bicluster. However most interesting variants of this problem are NP-complete requiring either large computational effort or the use of lossy heuristics to short-circuit the calculation.

Type of Bicluster

Different biclustering algorithms have different definitions of bicluster.

They are:

Bicluster with constant values (a),
Bicluster with constant values on rows or columns (b, c),
Bicluster with coherent values (d, e).

File:Bicluster.JPG

Algorithms

There are many biclustering algorithms developed for bioinformatics, including: Block clustering, CTWC, ITWC, δ-bicluster, δ-pCluster, δ-pattern, FLOC, OPC, Plaid Model, OPSMs, Gibbs, SAMBA, Robust Biclustering Algorithm (RoBA), Crossing Minimization, cMonkey^[5], PRMs, DCC and LEB(Localize and Extract Biclusters). Biclustering algorithms have also been proposed and used in other application fields under the names coclustering, biodimentional clustering, and subspace clustering^[6].

Some recent algorithms have attempted to include additional support for biclustering rectangular matrices in the form of other datatypes. One such algorithm, cMonkey, has been recently developed and applied to several systems-biology datasets.

There is an ongoing debate about how to judge the results of these methods, as biclustering allows overlap between clusters and some algorithms allow the exclusion of hard to reconcile columns/conditions. Not all of the available algorithms are deterministic and you need to pay attention to the degree to which results represent stable minima. Because this is an unsupervised classification problem, the lack of gold standard makes it difficult to spot errors in the results. One approach is to utilize multiple biclustering algorithms, with majority or super-majority voting amongst them deciding the best result. Another way is to analyse the quality of shifting and scaling patterns in biclusters^[7].

More recently, Biclustering has been used in the domain for text mining (or classification) where it is popularly known as Co-Clustering ^[8]. Text corpora are represented in a vectorial form as a Matrix D whose rows denote the documents and whose columns denote the words in the dictionary. Matrix elements D_ij denote occurrence of word j in document i. Co-clustering algorithms are then applied to discover blocks in the matrix D which correspond to a group of documents (rows) characterized by a group of words(columns).

Several approaches have been proposed based on the information contents of the resulting blocks, matrix based approaches such as SVD and BVD, and graph based approaches . Information theoretic algorithms iteratively assign each row to a cluster of documents and each column to a cluster of words such that the mutual information is maximized. Matrix based methods focus on the decomposition of matrices into blocks such that the error between the original matrix and the re-generated matrices from the decomposition is minimized. Graph based methods tend to minimize the cuts between the clusters. Given two groups of documents d₁ and d₂, the number of cuts can be measured as the number of words that occur in documents of groups d₁ and d₂.

More recently (Hussain and Bisson) have proposed a new approach of using the similarity between words and the similarity between documents to co-cluster the matrix. Their method (known as χ-Sim, for cross similarity) is based on finding document-document similarity and word-word similarity, and then using classical clustering methods such as hierarchical clustering. Instead of explicitly clustering rows and columns alternately, they consider higher order occurrences of words that inherently takes into account the documents in which they occur into consideration. Thus, the similarity between two words is calculated based on the documents in which they occur and also the documents in which "similar" words occur. The idea here is that two documents about the same topic do not necessarily use the same set of words to describe it but a subset of the words and other similar words that are characteristic of that topic. This approach of taking higher-order similarities takes the latent sematic structure of the whole corpus into consideration with the result of having a better clustering of the documents and words.

References

^ Van Mechelen I, Bock HH, De Boeck P (2004). "Two-mode clustering methods:a structured overview". Statistical Methods in Medical Research. 13 (5): 363–94. doi:10.1191/0962280204sm373ra. PMID 15516031.{{cite journal}}: CS1 maint: multiple names: authors list (link)
^ ^a ^b Mirkin, Boris (1996). Mathematical Classification and Clustering. Kluwer Academic Publishers. ISBN 0792341597. Cite error: The named reference "mirkin" was defined multiple times with different content (see the help page).
^ Cheng Y, Church GM (2000). "Biclustering of expression data". Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology: 93–103.
^ Hartigan JA (1972). "Direct clustering of a data matrix". Journal of the American Statistical Association. 67 (337): 123–9. doi:10.2307/2284710. {{cite journal}}: Cite has empty unknown parameter: |month= (help)
^ Reiss DJ, Baliga NS, Bonneau R (2006). "Integrated biclustering of heterogeneous genome-wide datasets for the inference of global regulatory networks". BMC Bioinformatics. 2 (7): 280–302. doi:10.1186/1471-2105-7-280.{{cite journal}}: CS1 maint: multiple names: authors list (link) CS1 maint: unflagged free DOI (link)
^ Madeira SC, Oliveira AL (2004). "Biclustering Algorithms for Biological Data Analysis: A Survey". IEEE Transactions on Computational Biology and Bioinformatics. 1 (1): 24–45. doi:10.1109/TCBB.2004.2.
^ Aguilar-Ruiz JS (2005). "Shifting and scaling patterns from gene expression data". Bioinformatics. 21 (10): 3840–3845. doi:10.1093/bioinformatics/bti641. PMID 16144809.
^ Hussain F.and Bission G. (20058). "Chi-Sim: A new similarity measure for the co-clustering task". ICMLA: 211–217. doi:http://doi.ieeecomputersociety.org/10.1109/ICMLA.2008.103. {{cite journal}}: Check |doi= value (help); Check date values in: |year= (help); External link in |doi= (help)

Others

A. Tanay. R. Sharan, and R. Shamir, "Biclustering Algorithms: A Survey", In Handbook of Computational Molecular Biology, Edited by Srinivas Aluru, Chapman (2004)
Kluger Y, Basri R, Chang JT, and Gerstein MB. "Spectral Biclustering of Microarray Data: Coclustering Genes and Conditions", Genome Research 2003; 13: 703-716.
A. Abdullah & A. Hussain, A new biclustering technique based on crossing minimization, Neurocomputing Journal 69 (2006), pp.1882–1896. <a href="http://linkinghub.elsevier.com/retrieve/pii/S0925231206001615"></a>

[1] Van Mechelen I, Bock HH, De Boeck P (2004). "Two-mode clustering methods:a structured overview". Statistical Methods in Medical Research. 13 (5): 363–94. doi:10.1191/0962280204sm373ra. PMID 15516031.{{cite journal}}: CS1 maint: multiple names: authors list (link)

[mirkin-2] Mirkin, Boris (1996). Mathematical Classification and Clustering. Kluwer Academic Publishers. ISBN 0792341597. Cite error: The named reference "mirkin" was defined multiple times with different content (see the help page).

[3] Cheng Y, Church GM (2000). "Biclustering of expression data". Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology: 93–103.

[4] Hartigan JA (1972). "Direct clustering of a data matrix". Journal of the American Statistical Association. 67 (337): 123–9. doi:10.2307/2284710. {{cite journal}}: Cite has empty unknown parameter: |month= (help)

[5] Reiss DJ, Baliga NS, Bonneau R (2006). "Integrated biclustering of heterogeneous genome-wide datasets for the inference of global regulatory networks". BMC Bioinformatics. 2 (7): 280–302. doi:10.1186/1471-2105-7-280.{{cite journal}}: CS1 maint: multiple names: authors list (link) CS1 maint: unflagged free DOI (link)

[6] Madeira SC, Oliveira AL (2004). "Biclustering Algorithms for Biological Data Analysis: A Survey". IEEE Transactions on Computational Biology and Bioinformatics. 1 (1): 24–45. doi:10.1109/TCBB.2004.2.

[7] Aguilar-Ruiz JS (2005). "Shifting and scaling patterns from gene expression data". Bioinformatics. 21 (10): 3840–3845. doi:10.1093/bioinformatics/bti641. PMID 16144809.

[8] Hussain F.and Bission G. (20058). "Chi-Sim: A new similarity measure for the co-clustering task". ICMLA: 211–217. doi:http://doi.ieeecomputersociety.org/10.1109/ICMLA.2008.103. {{cite journal}}: Check |doi= value (help); Check date values in: |year= (help); External link in |doi= (help)

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

@@ Line 106: / Line 106: @@
 }}
 </ref>.
+More recently, Biclustering has been used in the domain for text mining (or classification) where it is popularly known as Co-Clustering <ref>{{cite journal
+ | author = Hussain F.and Bission G.
+ | year = 20058
+ | title = Chi-Sim: A new similarity measure for the co-clustering task
+ | journal = ICMLA
+ | pages = 211-217
+ | doi = http://doi.ieeecomputersociety.org/10.1109/ICMLA.2008.103
+}}
+</ref>. Text corpora are represented in a vectorial form as a Matrix D whose rows denote the documents and whose columns denote the words in the dictionary. Matrix elements D<sub>ij</sub> denote occurrence of word j in document i. Co-clustering algorithms are then applied to discover blocks in the matrix D which correspond to a group of documents (rows) characterized by a group of words(columns).
+Several approaches have been proposed based on the information contents of the resulting blocks, matrix based approaches such as SVD and BVD, and graph based approaches . Information theoretic algorithms iteratively assign each row to a cluster of documents and each column to a cluster of words such that the mutual information is maximized. Matrix based methods focus on the decomposition of matrices into blocks such that the error between the original matrix and the re-generated matrices from the decomposition is minimized.  Graph based methods  tend to minimize the cuts between the clusters. Given two groups of documents d<sub>1</sub> and d<sub>2</sub>, the number of cuts can be measured as the number of words that occur in documents of groups d<sub>1</sub> and d<sub>2</sub>.
+More recently (Hussain and Bisson) have proposed a new approach of using the similarity between words and the similarity between documents to co-cluster the matrix. Their method (known as '''χ-Sim''', for cross similarity) is based on finding document-document similarity and word-word similarity, and then using classical clustering methods such as hierarchical clustering. Instead of explicitly clustering rows and columns alternately, they consider higher order occurrences of words that inherently takes into account the documents in which they occur into consideration. Thus, the similarity between two words is calculated based on the documents in which they occur and also the documents in which "similar" words occur. The idea here is that two documents about the same topic do not necessarily use the same set of words to describe it but a subset of the words and other similar words that are characteristic of that topic. This approach of taking higher-order similarities takes the latent sematic structure of the whole corpus into consideration with the result of having a better clustering of the documents and words.
 == See also ==