|Developer(s)||Georgia Institute of Technology|
|Operating system||Linux, Solaris, AIX, and Mac OS X|
|License||Proprietary commercial software; free for non-profit academic or U.S. Government use|
GeneMark is a family of ab initio gene prediction programs developed at the Georgia Institute of Technology in Atlanta. First developed in 1993, GeneMark was used in 1995 for annotation of the first completely sequenced bacterium, Haemophilus influenzae, and in 1996 for the first completely sequenced archaea, Methanococcus jannaschii. The GeneMark algorithm uses species specific inhomogeneous Markov chain models of protein-coding DNA sequence as well as homogeneous Markov chain models of non-coding DNA. Parameters of the models are estimated from training sets of sequences of a known type. The major step of the algorithm computes a posteriory probability of a sequence fragment to carry on a genetic code in one of six possible frames (including three frames in complementary DNA strand) or to be "non-coding".[clarification needed]
The GeneMark.hmm algorithm was designed to improve gene prediction quality by finding exact gene starts. The idea was to integrate the GeneMark models into a naturally designed hidden Markov model framework, with gene boundaries modeled as transitions between hidden states. Additionally, the ribosome binding site model is used to make the gene-start predictions more accurate. In evaluations by different groups,[by whom?] GeneMark.hmm was shown to be significantly more accurate than GeneMark in exact gene prediction. Since 1998, GeneMark.hmm and its self-training version GeneMarkS have been the standard tools for gene identification in new prokaryotic genomic sequences, including metagenomes.
After developing the prokaryotic version of GeneMark.hmm, the approach was extended to the eukaryotic genomes, where accurate prediction of protein coding exon boundaries presented a major challenge. The hidden Markov model architecture of eukaryotic GeneMark.hmm consists of hidden states for initial, internal, and terminal exons, introns, intergenic regions and single exon genes located on both DNA strands. It also includes hidden states for the initiation site and termination site, as well as donor and acceptor splice sites. GeneMark.hmm has been frequently used for annotation of plant and animal genomes.
To accurately find genes in DNA sequences using computers, models of protein coding and non-coding regions derived either from experimentally validated training sets or from large amounts of anonymous DNA sequence are required. A heuristic method for deriving the parameters of inhomogeneous Markov models of protein coding regions was proposed in 1999.[further explanation needed] This heuristic uses the observation that the parameters of the Markov models used in GeneMark can be approximated by the functions of the sequence G+C content.[clarification needed] Therefore, a short DNA sequence sufficient for estimation of the genome G+C content (a fragment longer than 400 nucleotides) is also sufficient for derivation of parameters of the Markov models used in GeneMark and GeneMark.hmm.
Models built by the heuristic approach can be used to find genes in small fragments of anonymous prokaryotic genomes, such as metagenomic sequences, as well as in genomes of organelles, viruses, phages and plasmids. This method can also be used for highly inhomogeneous genomes, where the Markov models must be adjusted to account for local DNA composition. The heuristic method provides evidence that the mutational pressure that shapes G+C content is the driving force of the evolution of codon usage pattern.
Family of gene prediction programs
Bacteria, Archaea and metagenomes
Viruses, phages and plasmids
- Heuristic approach
EST and cDNA
- Borodovsky M. and McIninch J. "GeneMark: parallel gene recognition for both DNA strands." Computers & Chemistry (1993) 17 (2): 123–133.
- Lukashin A. and Borodovsky M. "GeneMark.hmm: new solutions for gene finding." Nucleic Acids Research (1998) 26 (4): 1107–1115. doi:10.1093/nar/26.4.1107
- Besemer J. and Borodovsky M. "Heuristic approach to deriving models for gene finding." Nucleic Acids Research (1999) 27 (19): 3911–3920. doi:10.1093/nar/27.19.3911
- Besemer J., Lomsadze A. and Borodovsky M. "GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions." Nucleic Acids Research (2001) 29 (12): 2607–2618. doi:10.1093/nar/29.12.2607
- Mills R., Rozanov M., Lomsadze A., Tatusova T. and Borodovsky M. "Improving gene annotation in complete viral genomes." Nucleic Acids Research (2003) 31 (23): 7041–7055. doi:10.1093/nar/gkg878
- Besemer J. and Borodovsky M. "GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses." Nucleic Acids Research (2005) 33 (Web Server Issue): W451-454. doi:10.1093/nar/gki487
- Lomsadze A., Ter-Hovhannisyan V., Chernoff Y. and Borodovsky M. "Gene identification in novel eukaryotic genomes by self-training algorithm." Nucleic Acids Research (2005) 33 (20): 6494–6506. doi:10.1093/nar/gki937
- Zhu W., Lomsadze A. and Borodovsky M. "Ab initio gene identification in metagenomic sequences." Nucleic Acids Research (2010) 38 (12): e132. doi:10.1093/nar/gkq275