Gene cluster

A gene cluster is a group of two or more genes that encode for similar products, or proteins, which collectively share a generalized function. Portions of the DNA sequence of each gene within a gene cluster is found to be identical; however, the resulting proteins of each gene is distinctive from the resulting protein of another gene within the cluster. Genes found in a gene cluster may be observed near one another on the same chromosome or on different chromosomes. Because of the homology of the DNA sequences, the presence of gene clusters on the same chromosome suggests a close evolutionary relationship between two species. Therefore, gene clusters are used to assess the evolutionary relationship between different species. An example of a gene cluster is hemoglobin.

Formation

Coordinated gene expression, or co-expression as a result of codominance, is considered to be the most common mechanism driving the formation of gene clusters; however, coinheritance has also been considered as a driving force for the formation of gene clusters. ^[1] The typical eukaryotic gene was thought to be randomly distributed in the eukaryotic genome and independently expressed from its neighbor; however, evidence has been found that eukaryotic genes are not only regulated at the individual level by promoter sequences and transcription factors but by their location within the genome. As a result, genes are nonrandomly distributed within genomes when two or more genes share similar expression levels.^[2] These co-expressed genes tend to be found in a gene cluster in which the individual genes generally share a similar function, such as a metabolic pathway. The cluster is composed of genes, specific to a particular metabolic pathway, spanning a large portion of the genome. ^[1]

Gene duplication may also result in the formation of gene clusters. Conserved gene clusters, such as Hox and the human β-globin gene cluster, may be formed as a result of the process of gene duplication and divergence. A gene is duplicated during cell division, so that its descendants have two end-to-end copies of the gene where it had one copy, initially coding for the same protein or otherwise having the same function. In the course of subsequent evolution, they diverge, so that the products they code for have different but related functions, with the genes still being adjacent on the chromosome. ^[3] It was theorized that the origin of new genes during evolution was dependent on gene duplication. If only a single copy of a gene existed in the genome of a species, the proteins transcribed from this gene would be essential to their survival. These single copies of genes were considered essential. Because there was only a single copy of the gene, they could not undergo mutations which would potentially result in new genes; however, gene duplication allows essential genes to undergo mutations in the duplicated copy, which would ultimately give rise to new genes over the course of evolution.^[4] Mutations in the duplicated copy were tolerated because the original copy contained genetic information for the essential gene's function. Thus, species who have gene clusters have a selective evolutionary advantage.^[1] Over a short span of time, the new genetic information exhibited by the duplicated copy of the essential gene would not serve a practical advantage; however, over a long, evolutionary time period, the genetic information in the duplicated copy may undergo additional and drastic mutations in which the proteins of the duplicated gene served a different role than those of the original essential gene.^[4] Over the long, evolutionary time period, the two similar genes would diverge so the proteins of each gene were unique in their functions. Hox gene clusters, ranging in various sizes, are found among several phyla.

File:Genes hox.jpg

Hox gene clusters of various sizes are found in several phyla. Each colored box is indicative of one Hox gene.

When gene duplication occurs to produce a gene cluster, one or multiple genes may be duplicated at once. In the case of the Hox gene, a shared ancestral ProtoHox cluster was duplicated, resulting in genetic clusters in the Hox gene as well as the ParaHox gene, an evolutionary sister complex of the Hox gene.^[5] It is unknown the exact number of genes contained in the duplicated Protohox cluster; however, models exist suggesting that the duplicated Protohox cluster originally contained four, three, or two genes. ^[6] In the case where a gene cluster is duplicated, some genes may be lost. Loss of genes is dependent of the number of genes originating in the gene cluster. In the four gene model, the ProtoHox cluster contained four genes which resulted in two twin clusters: the Hox cluster and the ParaHox cluster.^[5] As its name indicates, the two gene model gave rise to the Hox cluster and the ParaHox cluster as a result of the ProtHox cluster which contained only two genes. The three gene model was originally proposed in conjunction with the four gene model^[6]; however, rather than the Hox cluster and the ParaHox cluster resulting from a cluster containing three genes, the Hox cluster and ParaHox cluster were as a result of single gene tandem duplication, identical genes found adjacent on the same chromosome.^[5] This was independent of duplication of the ancestral ProtoHox cluster.

Gene duplication may occur via cis-duplication or trans duplication.Cis-duplication, or intrachromosomal duplication, entails the duplication of genes on within the same chromosome whereas trans duplication, or interchromosomal duplication, consists of duplicating genes on neighboring but separate chromosomes.^[5] The formation of the Hox cluster and the ParaHox cluster were a result of intrachromosomal duplication despite it was initially thought to be interchromosomal.^[6]

Gene clusters vs. tandem arrays

Repeated genes can occur in two major patterns: gene clusters and tandem repeats, or formerly called tandemly arrayed genes. Although similar, gene clusters and tandemly arrayed genes may be distinguished from one another.

Gene Clusters

Gene clusters are found to be close to one another when observed on the same chromosome. They are dispersed randomly; however, gene clusters are normally within, at most, a few thousand bases of each other. The distance between each gene in the gene cluster can vary. The DNA found between each repeated gene in the gene cluster is non-conserverd.^[7] Portions of the DNA sequence of a gene is found to be identical in genes contained in a gene cluster.^[4] Gene conversion is the only method in which gene clusters may become homogenized. Although the size of a gene cluster may vary, it is rarely comprised of more than 50 genes, making clusters stable in number. Gene clusters change over a long evolutionary time period, which does not result in genetic complexity.^[7]

Tandem arrays

Tandem arrays are a group of genes with the same or similar function that are repeated consecutively without space between each gene. The genes are organized in the same orientation. ^[7] Unlike gene clusters, tandemly arrayed genes are found to consist of consecutive, identical repeats, separated only by a nontranscribed spacer region.^[8] While the genes contained in a gene cluster encode for similar proteins, identical proteins or functional RNAs are encoded by tandemly arrayed genes. Unequal recombination, which changes the number of repeats by placing duplicated genes next to the original gene, as well as gene conversion allow tandemly arrayed genes to become homogenized. And unlike gene clusters, tandemly arrayed genes rapidly change in response to the needs of the environment, causing an increase in genetic complexity.

Tandemly arrayed genes are essential to maintaining large gene families, such as ribosomal RNA. In the eukaryotic genome, tandemly arrayed genes make up ribosomal RNA. Tandemly repeated rRNAs are essential to maintain the RNA transcript. One RNA gene may not be able to provide a sufficient amount of RNA. In this situation, tandem repeats of the gene allow a sufficient amount of RNA to be provided. For example, human embryonic cells contain 5-10 million ribosomes and double in number within 24 hours. In order to provide a substantive amount of ribosomes, multiple RNA polymerases must consecutively transcribe multiple rRNA genes.^[8]

Tandemly arrayed genes may form into gene clusters.

Types

Prokaryotic gene clusters

Gene clusters may be similar to that of an operon in which all genes are controlled by a single promoter and are transcribed as a polycistronic messenger RNA. Operon-like gene clusters are primarily, but not exclusively, formed by horizontal gene transfer in prokaryotes.^[9] Operon-like gene clusters have been observed in the bacterium Caenorhabditis elegans^[1] as well as Escherichia coli.^[9] The lac operon of Escherichia coli is the most well-studied operon-like gene cluster.^[10] However, gene clusters found in Caenorhabditis elegans and Ciona intestinalis are thought to exhibit the most characteristics of a true operon.^[9]

Eukaryotic gene clusters

Gene clusters have also been observed in eukaryotic organisms, such as yeast, fungi, insects, vertebrates, and plants. A variety of well-known gene clusters, such as the clusters DAL and GAL, are exhibited in yeast.^[1] Filamentous fungal gene clusters play a key role in the biosynthesis of primary or secondary metabolites.^[9] Metabolic pathway gene clusters vastly differ from the structure of operon-like gene clusters.^[1] In general, eukaryotic gene clusters greatly differ from prokaryotic gene clusters. While prokaryotic gene clusters are thought to form as a result of horizontal gene transfer, this mechanism is highly unlikely in eukaryotes. Despite the isolated observations of fungal gene clusters arising as a result of horizontal gene transfer the messenger RNA of eukaryotic gene clusters is transcribed as an independent, or monocistronic, messenger RNA.^[9]

Detection

Gene expression is critical to understanding gene function and networks of genes. Furthermore, it aids in the study of diseases as well as their treatment. The essential first step in analyzing gene expression is detecting the presence of gene clusters.^[11] The use of bioinformatics tools and techniques can help identify gene clusters in organisms. Searching a genome (or a section of a genome) for gene clusters can be based on sequence similarity or functional similarity.^[12] Data from gene expression experiments is typically presented in a matrix in which the rows correlate to genes and columns correlate to conditions or time. The data found within the matrix demonstrates the level of expression for each gene specific for a type of condition or length of time.^[11]

Microarrays

High-density microarrays allows researchers to view transcription levels and the expression of specific genes during various stages as well as various situations. Understanding the expression of particular genes during developmental stages allows a more thorough study of diseases as well as their response to treatment options. Currently, two types of microarrays exist that produce data of gene expression on a large scale. In each microarray, hybridizations between probes and targets are performed on DNA chips.^[11]

1. cDNA- Complementary DNA, or cDNA, microarrays are composed of sequences of cDNA, which is immobilized on a solid substrate. Using a single matrix, cDNA specific for a particular gene is detected. Simultaneously, RNA, which is tagged with fluorescent dye, is used to probe the matrix. The intensity of the fluorescence at each spot, or gene, allows researchers to estimate the amount of expression of each gene. Researchers may also analyze clustering of genes.^[11]

2. Oligonucleotide- Oligonucleotide microarrays are similar to cDNA microarrays in that they are also immobilized on a solid substrate. While cDNA sequences in cDNA microarrays are typically at least one hundred base pairs in length, the genes in oligonucleotide microarrays are generally twenty-five base pairs in length. These oligonucleotides, which represent each gene, will also hybridize to fluorescently tagged RNA. The same data can be determined from oligonucleotide microarrays as cDNA microarrays; however, oligonucleotide microarrays have been found to be more accurate. ^[11]

Algorithms

Detecting gene clusters presents an algorithmic problem. Elements, which are commonly genes, and a characteristic vector for an element, which is a gene's pattern of expression, make up a clustering problem. Similarity, which is problem dependent, is measured between two vectors. The elements are separated into subsets, or clusters, satisfying homogeneity and separation. Homogeneity is defined as all genes within a cluster are highly similar to one another. In contrast, separation is defined as genes found in different clusters exhibit low similarity to one another. Homogeneity and separation share an inverse relationship. That is, the greater the homongeneity of elements, the more poor the separation of elements and vice versa. A variety of programs exist in the Bioinformatics field which allows for easy analysis of gene cluster problems.^[11] In other words, several software programs exist which conducts the algorithms.

Key algorithmic terminology

N={e₁,...e_n}; set of n elements
e_n= an element
Cluster= subset
C={C₁,...C_l} partition of N; also called a clustering solution or clustering
mates= elements that belong to the same cluster
Input Data for a clustering problem may be entered via two forms:

1. Fingerpint data occurs when a number (p) of measurements of elements in which each element correlates to a real-valued vector. For example,

the expression levels of mRNA at various conditions would constitute as fingerprint data.

2. Similarity data pairs elements and utilizes the similarity of one element in respect to the other element. These values may be calculated from fingerprint data.^[11]

Agglomerative hierarchical clustering

Agglomerative hierarchical clustering is the oldest and most popular algorithm used for detecting gene clusters. A dendogram is typically representative of an agglomerative hierarchical cluster. In a top-down or bottom-up manner, elements are repeatedly partitioned until one cluster which encompasses all elements. Clusters are merged until all elements are found in the same cluster. Eisen et al developed the software program Cluster based off of this algorithm. Its viewing program is TreeView. ^[11]

Self-Organizing Maps

Self organizing maps (SOM) assumes that all gene clusters are known. A two-dimensional grid is composed of sets of nodes which are representative of each cluster. Each reference vector correlates to one node. Movement of the reference vectors is dependent on the input vector, which directs the reference vector movement toward dense areas found in the input vectors space. Tamayo et al developed GeneCluster, a software program based on the SOM algorithm.^[11]

CLICK

Cluster Identification via Connectivity Kernels (CLICK) assumes that pairwise similarity values between mates are normally distributed. CLICK uses a theoretical graph. A weighted similarity graph is generated from input data. Before a partition occurs, the algorithm assesses whether any subset of the elements in the weighted similarity graph are kernels, which is the basis for clusters. If a kernel exists, then the data is not partitioned further; however, if not, the data is partitioned into a list of kernels and a set of single vertices. The similarities between a single vertices' fingerprint and the fingerprint of a cluster are calculated. Two kernels who share the highest similarity are merged; however, they are only merged if the similarity exceeds a predetermined threshold. Standard error bars and expression patterns are exhibited for each cluster upon visualization.^[11]

Diametrical Clustering

The majority of clustering software clusters genes that exhibit a positive correlation among expression patterns; however, genes that are anti-correlated may still be functionally similar and thus exhibit a gene cluster. Genes found in a cellular pathway are anticipated to be positively correlated. Genes that repress other genes in the same pathway are expected to be anti-correlated, or negatively correlated. The diametrical clustering algorithm was developed to specifically cluster negatively correlated clusters. Diametrical clustering repeatedly re-partitions genes while repeatedly calculated the dominant singular vector of each cluster. Each dominant singular vector serves as a model of a diametric cluster.^[13]

TGICL

Expressed sequence tags (ESTs) are used to discover genes and analyze gene expression; however, encoded genes are typically not identified from ESTs. ESTs are composed of large, repeated, partial transcript sequences often exhibiting chimaerism and vector adaptor contamination. TIGR Gene Indices Clustering tools (TGICL) is a software system specifically designed to provide fast, efficient clustering of large ESTs.^[14]

DAVID

Common genomic analysis tools like BLAST can be used to find similar sequences throughout the genome. A program called DAVID (Database for Annotation, Visualization, and Integrated Discovery) can be applied to find functionally similar genes across the genome, once a gene of interest has been identified.^[12] ^[15]

References

^ ^a ^b ^c ^d ^e ^f Yi, Gangman (2007). "iIdentifying clusters in functionally related genes in genomes". Bioinformatics. 23 (9): 1053–1060. doi:10.1093/bioinformatics/btl673. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help) Cite error: The named reference "Yi" was defined multiple times with different content (see the help page).
^ Michalak, P. (2008). "Coexpression, coregulation, and cofunctionality of neighboring genes in eukaryotic genomes". Genomics. 91 (3): 243–248. PMID 18082363.
^ Susumu Ohno (1970). Evolution by gene duplication. Springer-Verlag. ISBN 0-04-575015-7.
^ ^a ^b ^c Klug, William (2009). "Chromosome Mutations: Variation in chromosome number and arrangement". In Beth Wilbur (ed.). Concepts of Genetics (9 ed.). San Francisco, CA: Pearson Benjamin Cumming. pp. 213–214. ISBN 9780321540980. {{cite book}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)
^ ^a ^b ^c ^d Garcia-Fernàndez, J. (2005). "Hox, ParaHox, ProtoHox: facts and guesses". Heredity. 94: 145–152. doi:10.1038/sj.hdy.6800621.
^ ^a ^b ^c Garcia-Fernàndez, Jordi (2005). "The genesis and evolution of homeobox gene clusters". Nature Reviews Genetics. 6: 881–892. doi:10.1038/nrg1723.
^ ^a ^b ^c Graham, Geoffrey (July 1995). Journal of Theoretical Biology. 175 (1): 71–87. doi:10.1006/jtbi.1995.0122. {{cite journal}}: Missing or empty |title= (help)
^ ^a ^b Lodish, Harvey (2013). "Genes, Genomics, and Chromosomes". In Beth McHenry (ed.). Molecular Cell Biology (7 ed.). New York: W.H. Freeman Company. pp. 227–230. ISBN 9781429234139. {{cite book}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)
^ ^a ^b ^c ^d ^e Boycheva, Svetlana (2014). "The rise of operon-like gene clusters in plants". Trends in Plant Science. doi:10.1016/j.tplants.2014.01.013. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)
^ Ralston, A (2008). "Operons and Prokaryotic Gene Regulation". Nature Education. 1 (1): 216.
^ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ ^j Sharan, R. (2002). "Cluster Analysis and its applications to gene expression". Ernst Schering Research Foundation Workshop. 38: 83–108. PMID 12061008. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)
^ ^a ^b Huang, Da Wei (2009). "Systematic and integrative analysis of large gene lists using DAVID Bioinformatics Resources". Nature Protocols. 4 (1): 44–57. PMID 19131956. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)
^ Dhillon, Inderjit (2003). "Diametrical clustering for identifying anti-correlated gene clusters". Bioinformatics. 19 (13): 1612–1619. doi:10.1093/bioinformatics/btg209. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)
^ Pertea, Geo (2003). "TIGR Gene Indices clustering tools (TGICL): a software system for fast clustering of large EST datasets". Bioinformatics. 19 (5): 651–652. doi:10.1093/bioinformatics/btg034. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)
^ Huang, Da Wei (2009). "Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists". Nucleic Acids Research. 37 (1): 1–13. doi:10.1093/nar/gkn923. PMID 19033363. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)

This genetics article is a stub. You can help Wikipedia by expanding it.

[Yi-1] ^ ^a ^b ^c ^d ^e ^f Yi, Gangman (2007). "iIdentifying clusters in functionally related genes in genomes". Bioinformatics. 23 (9): 1053–1060. doi:10.1093/bioinformatics/btl673. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help) Cite error: The named reference "Yi" was defined multiple times with different content (see the help page).

[Michalak-2] Michalak, P. (2008). "Coexpression, coregulation, and cofunctionality of neighboring genes in eukaryotic genomes". Genomics. 91 (3): 243–248. PMID 18082363.

[Ohno_1970-3] Susumu Ohno (1970). Evolution by gene duplication. Springer-Verlag. ISBN 0-04-575015-7.

[Klug-4] Klug, William (2009). "Chromosome Mutations: Variation in chromosome number and arrangement". In Beth Wilbur (ed.). Concepts of Genetics (9 ed.). San Francisco, CA: Pearson Benjamin Cumming. pp. 213–214. ISBN 9780321540980. {{cite book}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)

[Garcia_2005_Hox_facts-5] Garcia-Fernàndez, J. (2005). "Hox, ParaHox, ProtoHox: facts and guesses". Heredity. 94: 145–152. doi:10.1038/sj.hdy.6800621.

[Garcia-Fernandez-6] Garcia-Fernàndez, Jordi (2005). "The genesis and evolution of homeobox gene clusters". Nature Reviews Genetics. 6: 881–892. doi:10.1038/nrg1723.

[Graham-7] Graham, Geoffrey (July 1995). Journal of Theoretical Biology. 175 (1): 71–87. doi:10.1006/jtbi.1995.0122. {{cite journal}}: Missing or empty |title= (help)

[Lodish-8] Lodish, Harvey (2013). "Genes, Genomics, and Chromosomes". In Beth McHenry (ed.). Molecular Cell Biology (7 ed.). New York: W.H. Freeman Company. pp. 227–230. ISBN 9781429234139. {{cite book}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)

[Boycheva-9] Boycheva, Svetlana (2014). "The rise of operon-like gene clusters in plants". Trends in Plant Science. doi:10.1016/j.tplants.2014.01.013. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)

[Ralston-10] Ralston, A (2008). "Operons and Prokaryotic Gene Regulation". Nature Education. 1 (1): 216.

[Sharan-11] ^ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ ^j Sharan, R. (2002). "Cluster Analysis and its applications to gene expression". Ernst Schering Research Foundation Workshop. 38: 83–108. PMID 12061008. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)

[Huang(1)-12] Huang, Da Wei (2009). "Systematic and integrative analysis of large gene lists using DAVID Bioinformatics Resources". Nature Protocols. 4 (1): 44–57. PMID 19131956. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)

[Dhillon-13] Dhillon, Inderjit (2003). "Diametrical clustering for identifying anti-correlated gene clusters". Bioinformatics. 19 (13): 1612–1619. doi:10.1093/bioinformatics/btg209. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)

[Pertea-14] Pertea, Geo (2003). "TIGR Gene Indices clustering tools (TGICL): a software system for fast clustering of large EST datasets". Bioinformatics. 19 (5): 651–652. doi:10.1093/bioinformatics/btg034. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)

[15] Huang, Da Wei (2009). "Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists". Nucleic Acids Research. 37 (1): 1–13. doi:10.1093/nar/gkn923. PMID 19033363. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]