GeneMark

GeneMark
Original author(s)	Bioinformatics group of Mark Borodovsky
Developer(s)	Georgia Institute of Technology
Initial release	1993
Operating system	Linux, Windows, and Mac OS
License	Free for academic, non-profit or U.S. Government use
Website	opal.biology.gatech.edu/GeneMark

GeneMark is a generic name for a family of ab initio gene prediction programs developed at the Georgia Institute of Technology in Atlanta. Developed in 1993, original GeneMark was used in 1995 as a primary gene prediction tool for annotation of the first completely sequenced bacterial genome of Haemophilus influenzae, and in 1996 for the first archaeal genome of Methanococcus jannaschii. The algorithm introduced inhomogeneous three-periodic Markov chain models of protein-coding DNA sequence that became standard in gene prediction as well as Bayesian approach to gene prediction in two DNA strands simultaneously. Species specific parameters of the models were estimated from training sets of sequences of known type (protein-coding and non-coding). The major step of the algorithm computes for a given DNA fragment posterior probabilities of either being "protein-coding" (carrying genetic code) in each of six possible reading frames (including three frames in complementary DNA strand) or being "non-coding". Original GeneMark (developed before the HMM era in Bioinformatics) is an HMM-like algorithm; it can be viewed as approximation to known in the HMM theory posterior decoding algorithm for appropriately defined HMM.

Prokaryotic gene prediction

The GeneMark.hmm algorithm (1998) was designed to improve gene prediction accuracy in finding short genes and gene starts. The idea was to integrate the Markov chain models used in GeneMark into a hidden Markov model framework, with transition between coding and non-coding regions formally interpreted as transitions between hidden states. Additionally, the ribosome binding site model was used to improve accuracy of gene start prediction. Next step was done with development of the self-training gene prediction tool GeneMarkS (2001). GeneMarkS has been in active use by genomics community for gene identification in new prokaryotic genomic sequences. GeneMarkS+, extension of GeneMarkS integrating information on homologous proteins into gene prediction is used in the NCBI pipeline for prokaryotic genomes annotation; the pipeline can annotate up to 2000 genomes daily (www.ncbi.nlm.nih.gov/genome/annotation_prok/process).

Heuristic Models and Gene Prediction in Metagenomes and Metatransciptomes

Accurate identification of species specific parameters of the GeneMark and GeneMark.hmm algorithms was the key condition for making accurate gene predictions. However, the question was raised, motivated by studies of viral genomes, how to define parameters for gene prediction in a rather short sequence that has no large genomic context. In 1999 this question was addressed by development of a "heuristic method" computations of the parameters as functions of the sequence G+C content. Since 2004 models built by the heuristic approach have been used in finding genes in metagenomic sequences. Subsequently, analysis of several hundred prokaryotic genomes led to developing more advanced heuristic method (implemented in MetaGeneMark) in 2010.

Eukaryotic gene prediction

In eukaryotic genomes modeling of exon borders with introns and intergenic regions presents a major challenge addressed by use of HMMs. The HMM architecture of eukaryotic GeneMark.hmm includes hidden states for initial, internal, and terminal exons, introns, intergenic regions and single exon genes located in both DNA strands. Initial eukaryotic GeneMark.hmm needed training sets for estimation of the algorithm parameters. In 2005 first version of self-training algorithm GeneMark-ES was developed. In 2008 the GeneMark-ES algorithm was extended to fungal genomes by developing a special intron model and more complex strategy of self-training. Then, in 2014, GeneMark-ET the algorithm that augmented self-training by information from mapped to genome unassembled RNA-Seq reads was added to the family. Gene prediction in eukaryotic transcripts can be done by the new algorithm GeneMarkS-T (2015)

GeneMark Family of Gene Prediction Programs

Bacteria, Archaea

GeneMark
GeneMarkS
GeneMarkS+

Metagenomes and Metatranscriptomes

MetaGeneMark

Eukaryotes

GeneMark
GeneMark.hmm
GeneMark-ES
GeneMark-ET

Viruses, phages and plasmids

Heuristic models

Transcripts assembled from RNA-Seq read

GeneMarkS-T

References

Borodovsky M. and McIninch J. "GeneMark: parallel gene recognition for both DNA strands." Computers & Chemistry (1993) 17 (2): 123–133.
Lukashin A. and Borodovsky M. "GeneMark.hmm: new solutions for gene finding." Nucleic Acids Research (1998) 26 (4): 1107–1115. doi:10.1093/nar/26.4.1107
Besemer J. and Borodovsky M. "Heuristic approach to deriving models for gene finding." Nucleic Acids Research (1999) 27 (19): 3911–3920. doi:10.1093/nar/27.19.3911
Besemer J., Lomsadze A. and Borodovsky M. "GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions." Nucleic Acids Research (2001) 29 (12): 2607–2618. doi:10.1093/nar/29.12.2607
Mills R., Rozanov M., Lomsadze A., Tatusova T. and Borodovsky M. "Improving gene annotation in complete viral genomes." Nucleic Acids Research (2003) 31 (23): 7041–7055. doi:10.1093/nar/gkg878
Besemer J. and Borodovsky M. "GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses." Nucleic Acids Research (2005) 33 (Web Server Issue): W451-454. doi:10.1093/nar/gki487
Lomsadze A., Ter-Hovhannisyan V., Chernoff Y. and Borodovsky M. "Gene identification in novel eukaryotic genomes by self-training algorithm." Nucleic Acids Research (2005) 33 (20): 6494–6506. doi:10.1093/nar/gki937
Zhu W., Lomsadze A. and Borodovsky M. "Ab initio gene identification in metagenomic sequences." Nucleic Acids Research (2010) 38 (12): e132. doi:10.1093/nar/gkq275

External links

Official website

v t e Genomics
Fields	Cognitive genomics Computational genomics Comparative genomics Functional genomics Genome project Human Genome Project Metagenomics Human Microbiome Project Pangenomics Personal genomics Population genomics Sociogenomics Structural genomics
Bioinformatics	Biochip Cheminformatics Chemogenomics Connectomics Human Connectome Project Epigenomics Human Epigenome Project Glycomics Immunomics Lipidomics Metabolomics Microbiomics Nutrigenomics Paleopolyploidy Pharmacogenetics Pharmacogenomics Systems biology Toxicogenomics Transcriptomics
Structural biology	Proteomics Human proteome project Call-map proteomics Structure-based drug design Expression proteomics
Research tools	2-D electrophoresis Mass spectrometer Electrospray ionization Matrix-assisted laser desorption ionization Matrix-assisted laser desorption ionization-time of flight mass spectrometer Microfluidic-based tools Isotope affinity tags Chromosome conformation capture
Organizations	DNA Data Bank of Japan (JP) European Molecular Biology Laboratory (EU) National Institutes of Health (USA) Wellcome Sanger Institute (UK)
List Category