# Biclustering

Biclustering, co-clustering, or two-mode clustering[1] is a data mining technique which allows simultaneous clustering of the rows and columns of a matrix. The term was first introduced by Mirkin,[2] although the technique was originally introduced much earlier[2] (i.e., by J.A. Hartigan[3]).

Given a set of $m$ rows in $n$ columns (i.e., an $m \times n$ matrix), the biclustering algorithm generates biclusters - a subset of rows which exhibit similar behavior across a subset of columns, or vice versa.

## Development

The opinion of biclustering was originally introduced by J.A.Hartigan in 1972. The author mentioned early algorithm of biclustering. [4]And the term of biclustering was first introduced by Mirkin later. This algorithm was not generalized until 2000 when Y.Cheng and G.M.Church proposed the biclustering algorithm based on variance and apply it to gene-data of biology.[5]Till today, their paper is still the most important literature in the gene expression biclustering field.

In 2001 and 2003, I.S.Dhillon put forward two algorithms to apply to biclustering of files and word. One of them was based on bipartite spectral graph partitioning.[6] The other one was based on information theorem. These two theories become the bases of file biclustering algorithm mentioned in recent years. In his paper, I.S.Dhillon assumed the loss of mutual information during biclustering was equal to the KL(Kullback-Leibler)-distance between P and Q. P means the distribution of files and feature words before biclustering. Q means that distribution after biclustering. KL-distance is for measuring the difference between two random distributions. KL=0 when the two distributions are the same and KL increases as the difference increases.[7]Thus I.S.Dhillon set the aim of algorithm to find the minimum KL-distance between P and Q. Considered that KL-distance can only be used in special matrix. In 2004, A.Banerjee used weightedBregman distance instead of KL-distance to design a biclustering algorithm which was suitable for all kinds of matrix. [8]

To cluster more than two types of objects, in 2005, R.Bekkerman expand one pair of mutual information in I.S.Dhillon’s theorem into multiple pairs of mutual information. He designed more algorithms by making weighted summation of pairs of mutual information. There are also some other methods of biclustering such as these who are based on matrix decomposition.

## Complexity

The complexity of the biclustering problem depends on the exact problem formulation, and particularly on the merit function used to evaluate the quality of a given bicluster. However most interesting variants of this problem are NP-complete. NP-complete have two conditions. In the simple case that there is only element a_(i,j) either 0 or 1 in the binary matrix A, a bicluster is equal to a biclique in the corresponding bipartite graph. The maximum size bluster is equivalent to maximum edge biclique in bipartite graph. In the complex case, the element in matrix A is used to compute the quality of a given bicluster and solve the more restricted version of the problem.[9] It requires either large computational effort or the use of lossy heuristics to short-circuit the calculation.[10]

## Type of Bicluster

Different biclustering algorithms have different definitions of bicluster.[10]

They are:

1. Bicluster with constant values (a),
2. Bicluster with constant values on rows (b) or columns (c),
3. Bicluster with coherent values (d, e).

1.Bicluster with constant values

When a biclustering algorithm tries to find a constant bicluster, the normal way for it is to reorder the rows and columns of the matrix so it can group together similar rows/columns and find biclusters with similar values. This method is OK when the data is tidy. But as the data can be noisy most of the times, so it can’t satisfy us. More sophisticated methods should be used. A perfect constant bicluster is a matrix(I,J) where all values a(i,j) are equal to μ. In real data, a(i,j) can be seen as n(i,j) +μ where n(i,j) is the noise. According to Hartigan’s algorithm, by spliting the original data matrix into a set of biclusters. Variance is used to compute constant biclusters. So a perfect bicluster is a matrix with variance zero. Also, in order to prevent the partitioning of the data matrix into biclusters with only one row and one column. Hartigan assumes that there are K biclusters within the data matrix. When the data matrix is partitioned into K biclusters, the algorithm ends.

2.Biclusters with constant values on rows or columns

This kind of biclusters can’t be evaluated just by variance of its values. To finish the identification, the columns and the rows should be normalized at first. There are other algorithms, without normalization step, can find biclusters have rows and columns with different approaches.

3.Biclusters with coherent values

For biclusters with coherent values on rows and columns, an overall improvement over the algorithms for biclusters with constant values on rows or on columns should be considered. That means a sophisticated algorithm is needed. This algorithm may contain analysis of variance between groups, using co-variance between both rows and columns.In Cheng and Churchs’ theorem, a bicluster is defined as a subset of rows and columns with almost the same score.the similarity score is used to measure the coherence of rows and columns.

 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4
 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
 1 4 5 0 1.5 4 7 8 3 4.5 3 6 7 2 3.5 5 8 9 4 5.5 2 5 6 1 2.5
 1 0.5 2 0.2 0.8 2 1 4 0.4 1.6 3 1.5 6 0.6 2.4 4 2 8 0.8 3.2 5 2.5 10 1 4

The relationship between these cluster models and other types of clustering such as correlation clustering is discussed in.[11]

## Algorithms

There are many biclustering algorithms developed for bioinformatics, including: block clustering, CTWC (Coupled Two-Way Clustering), ITWC (Interrelated Two-Way Clustering), δ-bicluster, δ-pCluster, δ-pattern, FLOC, OPC, Plaid Model, OPSMs (Order-preserving submatrixes), Gibbs, SAMBA (Statistical-Algorithmic Method for Bicluster Analysis),[12] Robust Biclustering Algorithm (RoBA), Crossing Minimization,[13] cMonkey,[14] PRMs, DCC, LEB (Localize and Extract Biclusters), QUBIC (QUalitative BIClustering), BCCA (Bi-Correlation Clustering Algorithm) BIMAX, ISA, SAMBA and FABIA (Factor Analysis for Bicluster Acquisition).[15] Biclustering algorithms have also been proposed and used in other application fields under the names coclustering, bidimensional clustering, and subspace clustering.[10]

Given the known importance of discovering local patterns in time-series data, recent proposals have addressed the biclustering problem in the specific case of time series gene expression data. In this case, the interesting biclusters can be restricted to those with contiguous columns. This restriction leads to a tractable problem and enables the development of efficient exhaustive enumeration algorithms such as CCC-Biclustering [16] and e-CCC-Biclustering.[17] The approximate patterns in CCC-Biclustering algorithms allow a given number of errors, per gene, relatively to an expression profile respresenting the expression pattern in the bicluster. The e-CCC-Biclustering algorithm uses approximate expressions to find and report all maximal CCC-Biclusters by a discretized matrix A and efficient string processing techniques.

These algorithms ﬁnd and report all maximal biclusters with coherent and contiguous columns with perfect/approximate expression patterns, in time linear/polynomial which is obtained by manipulating a discretized version of original expression matrix in the size of the time series gene expression matrix using eﬃcient string processing techniques based on suffix trees. These algorithms are also applied to solve problems and sketch the analysis of computational complexity.

Some recent algorithms have attempted to include additional support for biclustering rectangular matrices in the form of other datatypes, including cMonkey.

There is an ongoing debate about how to judge the results of these methods, as biclustering allows overlap between clusters and some algorithms allow the exclusion of hard-to-reconcile columns/conditions. Not all of the available algorithms are deterministic and the analyst must pay attention to the degree to which results represent stable minima. Because this is an unsupervised classification problem, the lack of a gold standard makes it difficult to spot errors in the results. One approach is to utilize multiple biclustering algorithms, with majority or super-majority voting amongst them deciding the best result. Another way is to analyse the quality of shifting and scaling patterns in biclusters.[18] Biclustering has been used in the domain of text mining (or classification) where it is popularly known as co-clustering .[19] Text corpora are represented in a vectorial form as a matrix D whose rows denote the documents and whose columns denote the words in the dictionary. Matrix elements Dij denote occurrence of word j in document i. Co-clustering algorithms are then applied to discover blocks in D that correspond to a group of documents (rows) characterized by a group of words(columns).

Test clustering can solve the high-dimensional sparse problem, which means clustering text and words at the same time. When clustering text, we need to think about not only the words information, but also the information of words clusters that was composed by words. Then according to similarity of feature words in the text, will eventually cluster the feature words. This is called co-clustering. There are two advantages of co-clustering: one is clustering the test based on words clusters can extremely decrease the dimension of clustering, it can also appropriate to measure the distance between the tests. Second is mining more useful information and can get the corresponding information in test clusters and words clusters. This corresponding information can be used to describe the type of texts and words, at the same time, the result of words clustering can be also used to text mining and information retrival.

Several approaches have been proposed based on the information contents of the resulting blocks: matrix-based approaches such as SVD and BVD, and graph-based approaches. Information-theoretic algorithms iteratively assign each row to a cluster of documents and each column to a cluster of words such that the mutual information is maximized. Matrix-based methods focus on the decomposition of matrices into blocks such that the error between the original matrix and the regenerated matrices from the decomposition is minimized. Graph-based methods tend to minimize the cuts between the clusters. Given two groups of documents d1 and d2, the number of cuts can be measured as the number of words that occur in documents of groups d1 and d2.

More recently (Bisson and Hussain)[19] have proposed a new approach of using the similarity between words and the similarity between documents to co-cluster the matrix. Their method (known as χ-Sim, for cross similarity) is based on finding document-document similarity and word-word similarity, and then using classical clustering methods such as hierarchical clustering. Instead of explicitly clustering rows and columns alternately, they consider higher-order occurrences of words, inherently taking into account the documents in which they occur. Thus, the similarity between two words is calculated based on the documents in which they occur and also the documents in which "similar" words occur. The idea here is that two documents about the same topic do not necessarily use the same set of words to describe it but a subset of the words and other similar words that are characteristic of that topic. This approach of taking higher-order similarities takes the latent semantic structure of the whole corpus into consideration with the result of generating a better clustering of the documents and words.

In text databases, for a document collection defined by a document by term D matrix (of size m by n, m: number of documents, n: number of terms) the cover-coefficient based clustering methodology[20] yields the same number of clusters both for documents and terms (words) using a double-stage probability experiment. According to the cover coefficient concept number of clusters can also be roughly estimated by the following formula $(m \times n) / t$ where t is the number of non-zero entries in D. Note that in D each row and each column must contain at least one non-zero element.

In contrast to other approaches, FABIA is a multiplicative model that assumes realistic non-Gaussian signal distributions with heavy tails. FABIA utilizes well understood model selection techniques like variational approaches and applies the Bayesian framework. The generative framework allows FABIA to determine the information content of each bicluster to separate spurious biclusters from true biclusters.

## References

1. ^ Van Mechelen I, Bock HH, De Boeck P (2004). "Two-mode clustering methods:a structured overview". Statistical Methods in Medical Research 13 (5): 363–94. doi:10.1191/0962280204sm373ra. PMID 15516031.
2. ^ a b Mirkin, Boris (1996). Mathematical Classification and Clustering. Kluwer Academic Publishers. ISBN 0-7923-4159-7.
3. ^ Hartigan JA (1972). "Direct clustering of a data matrix". Journal of the American Statistical Association (American Statistical Association) 67 (337): 123–9. doi:10.2307/2284710. JSTOR 2284710.
4. ^ Hartigan J A. Direct clustering of a data matrix[J. Journal of the american statistical association, 1972, 67(337): 123-129.]
5. ^ https://www.cs.princeton.edu/courses/archive/fall03/cs597F/Articles/biclustering_of_expression_data.pdf Cheng Y, Church G M. Biclustering of expression data[C]//Ismb. 2000, 8: 93-103.
6. ^ Dhillon I S. Co-clustering documents and words using bipartite spectral graph partitioning[C//Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2001: 269-274.]
7. ^ Dhillon I S, Mallela S, Modha D S. Information-theoretic co-clustering[C//Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2003: 89-98.]
8. ^ Banerjee A, Dhillon I, Ghosh J, et al. A generalized maximum entropy approach to bregman co-clustering and matrix approximation[C//Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2004: 509-514.]
9. ^ Peeters R. The maximum edge biclique problem is NP-complete[J. Discrete Applied Mathematics, 2003, 131(3): 651-654.]
10. ^ a b c Madeira SC, Oliveira AL (2004). "Biclustering Algorithms for Biological Data Analysis: A Survey". IEEE Transactions on Computational Biology and Bioinformatics 1 (1): 24–45. doi:10.1109/TCBB.2004.2. PMID 17048406.
11. ^ Kriegel, H.-P.; Kröger, P.; Zimek, A. (March 2009). "Clustering High Dimensional Data: A Survey on Subspace Clustering, Pattern-based Clustering, and Correlation Clustering". ACM Transactions on Knowledge Discovery from Data (TKDD) 3 (1): 1–58. doi:10.1145/1497577.1497578.
12. ^ Tanay A, Sharan R, Kupiec M and Shamir R (2004). "Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data". Proc Natl Acad Sci USA 101 (9): 2981–2986. doi:10.1073/pnas.0308661100. PMC 365731. PMID 14973197.
13. ^ Abdullah, Ahsan; Hussain, Amir (2006). "A new biclustering technique based on crossing minimization". Neurocomputing, vol. 69 issue 16-18 69 (16–18): 1882–1896. doi:10.1016/j.neucom.2006.02.018.
14. ^ Reiss DJ, Baliga NS, Bonneau R (2006). "Integrated biclustering of heterogeneous genome-wide datasets for the inference of global regulatory networks". BMC Bioinformatics 2: 280–302. doi:10.1186/1471-2105-7-280. PMC 1502140. PMID 16749936.
15. ^ Hochreiter S, Bodenhofer U, Heusel M, Mayr A, Mitterecker A, Kasim A, Khamiakova T, Van Sanden S, Lin D, Talloen W, Bijnens L, Gohlmann HWH, Shkedy Z, Clevert DA (2010). "FABIA: factor analysis for bicluster acquisition". Bioinformatics 26 (12): 1520–1527. doi:10.1093/bioinformatics/btq227. PMC 2881408. PMID 20418340.
16. ^ Madeira SC, Teixeira MC, Sá-Correia I, Oliveira AL (2010). "Identification of Regulatory Modules in Time Series Gene Expression Data using a Linear Time Biclustering Algorithm". IEEE Transactions on Computational Biology and Bioinformatics 1 (7): 153–165. doi:10.1109/TCBB.2008.34.
17. ^ Madeira SC, Oliveira AL (2009). "A polynomial time biclustering algorithm for finding approximate expression patterns in gene expression time series". Algorithms for Molecular Biology 4 (8).
18. ^ Aguilar-Ruiz JS (2005). "Shifting and scaling patterns from gene expression data". Bioinformatics 21 (10): 3840–3845. doi:10.1093/bioinformatics/bti641. PMID 16144809.
19. ^ a b Bisson G. and Hussain F. (2008). "Chi-Sim: A new similarity measure for the co-clustering task". ICMLA: 211–217. doi:10.1109/ICMLA.2008.103.
20. ^ Can, F., Ozkarahan, E. A. (1990). "Concepts and effectiveness of the cover coefficient based clustering methodology for text databases". ACM Transactions on Database Systems 15 (4): 483–517. doi:10.1145/99935.99938.