Gene chip analysis
Microarray technology is a powerful tool for genomic analysis. It gives a global view of the genome in a single experiment. Data analysis of the microarray is a vital part of the experiment. Each microarray study comprises multiple microarrays, each giving tens of thousands of data points. Since the volume of data is growing exponentially as microarrays grow larger, the analysis becomes more challenging. In general the greater the volume of data, the more chances arise for erroneous results. Handling such large volumes of data requires high end computational infrastructures and programs that can handle multiple data formats. There are already programs available for microarray data analysis on various platforms. However, due to rapid development, diversity in microarray technology, and different data formats, there is always the need for more comprehensive and complete microarray data analysis.
- 1 Data processing and quality control
- 2 Replicates
- 3 Normalization
- 4 Quality control
- 5 Filtering of flagged data
- 6 Filtering of noisy replicates
- 7 Filtering of non-significant genes
- 8 Statistical analysis
- 9 Clustering
- 10 Hierarchical clustering
- 11 K-means clustering
- 12 Gene ontology studies
- 13 Pathway analysis
- 14 References
Data processing and quality control
Proper data processing and quality control are critical to the validity and interpretability of gene chip analysis.
Data processing includes data normalization, flagging of the data, averaging the intensity ratio for replicates, clustering of similarly expressed genes, etc. Data also must be normalized before further analysis. Normalization removes non-biological variation between the samples. After normalization, the intensity ratio is calculated for each gene in the replicate. Based on the ratio, the level of gene expression is determined. Quality control can then be performed.
Various statistical analyses are performed for quality control. Each replicate is also examined for various experimental artifacts and bias by computing parameters related to intensity, background, flags, spot details, etc.
It is important to note the necessity of replicates in conducting microarray experiments. Like any other quantitative measurement, repeated experiments provide the ability to conduct confidence analysis and identify differentially expressed genes at a given level of confidence. More replicates provide more confidence in determining differentially expressed genes. In practice, three to five replicates would be ideal.
Normalization is required to standardize data and focus on biologically relevant changes. There are many sources of systematic variation in microarray experiments that affect the measured gene expression levels such as dye bias, heat and light sensitivity, efficiency of dye incorporation, differences in the labeled cDNA hybridization conditions, scanning conditions, and unequal quantities of starting RNA, etc. Normalization is an important step in adjusting the data set for technical variation and removing relative abundance of gene expression profiles; this is the only point where 1- and 2-color data analyses vary. The normalization method depends on the data. The basic idea behind all the normalization methods is that the expected mean intensity ratio between the two channels should be one. If the observed mean intensity ratio deviates from one, the data is mathematically processed in such a way that the final observed mean intensity ratio becomes one. With the mean intensity ratio adjusted to one, the distribution of the gene expression is centered so that genuine differentials can be identified.
Before analyzing data for biological variation, QC steps must be performed to determine whether the data is fit for statistical testing. Statistical tests are sensitive to the nature of the input data.
Filtering of flagged data
Filtering of bad intensity spots is an important process of quality control. For example, the scanner has a measurement limit below which intensity values cannot be trusted. Typically, the lowest intensity value of reliable data is 100–200 for Affymetrix data and 100–1000 for cDNA Microarray data. These cut-offs are likely to change as scanners become more precise. Values below the cut-off point are usually removed (filtered) from the data because they are likely to be artifacts.
Filtering of noisy replicates
Filtering of noisy replicates is a crucial part of quality control. Experimental replicates should have similar values. Replicates with noise should be eliminated before analysis; this can be done using the ANOVA statistical method.
Filtering of non-significant genes
Filtering of non-significant genes is done so that analysis can be done on selected genes. Non-significant genes are removed by specifying relative change in expression with respect to normal control. Values for over-expressed and under-expressed genes are defined as 2 and −2 respectively. As a result of filtering, few genes are retained. Those remaining genes are then subjected to statistical analysis.
Statistical analysis plays a vital role in identifying genes that are expressed at statistically significant levels.
Hierarchical clustering is a statistical method for finding relatively homogeneous clusters. Hierarchical clustering consists of two separate phases. Initially, a distance matrix containing all the pairwise distances between the genes is calculated. Pearson’s correlation and Spearman’s correlation are often used as dissimilarity estimates, but other methods, like Manhattan distance or Euclidean distance, can also be applied. Given the number of distance measures available and their influence in the clustering algorithm results, several studies have compared and evalauted different distance measures for the clustering of microarray data, considering their intrinsic properties and robustness to noise. After calculation of the initial distance matrix, the hierarchical clustering algorithm either (A) joins iteratively the two closest clusters starting from single data points (agglomerative, bottom-up approach, which is fairly more commonly used), or (B) partitions clusters iteratively starting from the complete set (divisive, top-down approach). After each step, a new distance matrix between the newly formed clusters and the other clusters is recalculated. Hierarchical cluster analysis methods include:
- Single linkage (minimum method, nearest neighbor)
- Average linkage (UPGMA).
- Complete linkage (maximum method, furthest neighbor)
K-means clustering is an algorithm for grouping genes or samples based on pattern into K groups. Grouping is done by minimizing the sum of the squares of distances between the data and the corresponding cluster centroid. Thus the purpose of K-means clustering is to classify data based on similar expression. (www.biostat.ucsf.edu). K-means clustering algorithm and some of its variants (including k-medoids) have been shown to produce good results for gene expression data (at least better than hierarchical clustering methods). Empirical comparisons of k-means, k-medoids, hierarchical methods and, different distance measures can be found in the literature.
Gene ontology studies
Gene ontology studies give biologically meaningful information about the gene including cellular location, molecular function, and biological function. This information is analyzed for differences in regulation in disease or drug treatment regimen, with respect to normal control.
Pathway analysis gives specific information about the pathway being affected in disease conditions, with respect to normal control. Pathway analysis also allows identification of gene networks and how genes are regulated.
- Gentleman et al., Robert (2005). Bioinformatics and computational biology solutions using R and Bioconductor. New York: Springer Science+Business Media. ISBN 978-0-387-29362-2.
- Jaskowiak, Pablo A.; Campello, Ricardo J.G.B.; Costa, Ivan G. "Proximity Measures for Clustering Gene Expression Microarray Data: A Validation Methodology and a Comparative Analysis". IEEE/ACM Transactions on Computational Biology and Bioinformatics 10 (4): 845–857. doi:10.1109/TCBB.2013.9.
- Jaskowiak, Pablo A; Campello, Ricardo JGB; Costa, Ivan G. "On the selection of appropriate distances for gene expression data clustering". BMC Bioinformatics 15 (Suppl 2): S2. doi:10.1186/1471-2105-15-S2-S2.
- de Souto, Marcilio C. P.; Costa, Ivan G.; de Araujo, Daniel S. A.; Ludermir, Teresa B.; Schliep, Alexander. "Clustering cancer gene expression data: a comparative study". BMC Bioinformatics 9 (1): 497. doi:10.1186/1471-2105-9-497.
GeneChip® Expression Analysis-Data Analysis Fundamentals (by Affymetrix)http://mmjggl.caltech.edu/microarray/data_analysis_fundamentals_manual.pdf http://www.stat.duke.edu/~mw/ABS04/RefInfo/data_analysis_fundamentals_manual.pdf