GeneMark
Original author(s) | Bioinformatics group of Mark Borodovsky |
---|---|
Developer(s) | Georgia Institute of Technology |
Initial release | 1993 |
Operating system | Linux, Windows, and Mac OS |
License | Free for academic, non-profit or U.S. Government use |
Website | opal.biology.gatech.edu/GeneMark |
GeneMark is a generic name for a family of ab initio gene prediction programs developed at the Georgia Institute of Technology in Atlanta. Developed in 1993, original GeneMark was used in 1995 as a primary gene prediction tool for annotation of the first completely sequenced bacterial genome of Haemophilus influenzae, and in 1996 for the first archaeal genome of Methanococcus jannaschii. The algorithm introduced inhomogeneous three-periodic Markov chain models of protein-coding DNA sequence that became standard in gene prediction as well as Bayesian approach to gene prediction in two DNA strands simultaneously. Species specific parameters of the models were estimated from training sets of sequences of known type (protein-coding and non-coding). The major step of the algorithm computes for a given DNA fragment posterior probabilities of either being "protein-coding" (carrying genetic code) in each of six possible reading frames (including three frames in complementary DNA strand) or being "non-coding". Original GeneMark (developed before the HMM era in Bioinformatics) is an HMM-like algorithm; it can be viewed as approximation to known in the HMM theory posterior decoding algorithm for appropriately defined HMM.
Prokaryotic gene prediction
The GeneMark.hmm algorithm (1998) was designed to improve gene prediction accuracy in finding short genes and gene starts. The idea was to integrate the Markov chain models used in GeneMark into a hidden Markov model framework, with transition between coding and non-coding regions formally interpreted as transitions between hidden states. Additionally, the ribosome binding site model was used to improve accuracy of gene start prediction. Next step was done with development of the self-training gene prediction tool GeneMarkS (2001). GeneMarkS has been in active use by genomics community for gene identification in new prokaryotic genomic sequences. GeneMarkS+, extension of GeneMarkS integrating information on homologous proteins into gene prediction is used in the NCBI pipeline for prokaryotic genomes annotation; the pipeline can annotate up to 2000 genomes daily (www.ncbi.nlm.nih.gov/genome/annotation_prok/process).
Heuristic Models and Gene Prediction in Metagenomes and Metatransciptomes
Accurate identification of species specific parameters of the GeneMark and GeneMark.hmm algorithms was the key condition for making accurate gene predictions. However, the question was raised, motivated by studies of viral genomes, how to define parameters for gene prediction in a rather short sequence that has no large genomic context. In 1999 this question was addressed by development of a "heuristic method" computations of the parameters as functions of the sequence G+C content. Since 2004 models built by the heuristic approach have been used in finding genes in metagenomic sequences. Subsequently, analysis of several hundred prokaryotic genomes led to developing more advanced heuristic method (implemented in MetaGeneMark) in 2010.
Eukaryotic gene prediction
In eukaryotic genomes modeling of exon borders with introns and intergenic regions presents a major challenge addressed by use of HMMs. The HMM architecture of eukaryotic GeneMark.hmm includes hidden states for initial, internal, and terminal exons, introns, intergenic regions and single exon genes located in both DNA strands. Initial eukaryotic GeneMark.hmm needed training sets for estimation of the algorithm parameters. In 2005 first version of self-training algorithm GeneMark-ES was developed. In 2008 the GeneMark-ES algorithm was extended to fungal genomes by developing a special intron model and more complex strategy of self-training. Then, in 2014, GeneMark-ET the algorithm that augmented self-training by information from mapped to genome unassembled RNA-Seq reads was added to the family. Gene prediction in eukaryotic transcripts can be done by the new algorithm GeneMarkS-T (2015)
GeneMark Family of Gene Prediction Programs
Bacteria, Archaea
- GeneMark
- GeneMarkS
- GeneMarkS+
Metagenomes and Metatranscriptomes
- MetaGeneMark
Eukaryotes
- GeneMark
- GeneMark.hmm
- GeneMark-ES
- GeneMark-ET
Viruses, phages and plasmids
- Heuristic models
Transcripts assembled from RNA-Seq read
- GeneMarkS-T
See also
References
- Borodovsky M. and McIninch J. "GeneMark: parallel gene recognition for both DNA strands." Computers & Chemistry (1993) 17 (2): 123–133.
- Lukashin A. and Borodovsky M. "GeneMark.hmm: new solutions for gene finding." Nucleic Acids Research (1998) 26 (4): 1107–1115. doi:10.1093/nar/26.4.1107
- Besemer J. and Borodovsky M. "Heuristic approach to deriving models for gene finding." Nucleic Acids Research (1999) 27 (19): 3911–3920. doi:10.1093/nar/27.19.3911
- Besemer J., Lomsadze A. and Borodovsky M. "GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions." Nucleic Acids Research (2001) 29 (12): 2607–2618. doi:10.1093/nar/29.12.2607
- Mills R., Rozanov M., Lomsadze A., Tatusova T. and Borodovsky M. "Improving gene annotation in complete viral genomes." Nucleic Acids Research (2003) 31 (23): 7041–7055. doi:10.1093/nar/gkg878
- Besemer J. and Borodovsky M. "GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses." Nucleic Acids Research (2005) 33 (Web Server Issue): W451-454. doi:10.1093/nar/gki487
- Lomsadze A., Ter-Hovhannisyan V., Chernoff Y. and Borodovsky M. "Gene identification in novel eukaryotic genomes by self-training algorithm." Nucleic Acids Research (2005) 33 (20): 6494–6506. doi:10.1093/nar/gki937
- Zhu W., Lomsadze A. and Borodovsky M. "Ab initio gene identification in metagenomic sequences." Nucleic Acids Research (2010) 38 (12): e132. doi:10.1093/nar/gkq275