Jump to content

GeneMark: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Line 33: Line 33:


===Eukaryotic===
===Eukaryotic===
After developing sharada the prokaryotic version of GeneMark.hmm, the approach was extended to the eukaryotic genomes, where accurate prediction of protein coding [[exon]] boundaries presented a major challenge. The hidden Markov model architecture of eukaryotic GeneMark.hmm consists of hidden states for initial, internal, and terminal exons, [[intron]]s, [[intergenic region]]s and single exon genes located on both DNA strands. It also includes hidden states for the initiation site and termination site, as well as donor and acceptor splice sites. GeneMark.hmm has been frequently used for annotation of plant and animal genomes.{{citation needed|date=December 2010}}
After developing ekaryotic version of GeneMark.hmm, the approach was extended to the eukaryotic genomes, where accurate prediction of protein coding [[exon]] boundaries presented a major challenge. The hidden Markov model architecture of eukaryotic GeneMark.hmm consists of hidden states for initial, internal, and terminal exons, [[intron]]s, [[intergenic region]]s and single exon genes located on both DNA strands. It also includes hidden states for the initiation site and termination site, as well as donor and acceptor splice sites. GeneMark.hmm has been frequently used for annotation of plant and animal genomes.{{citation needed|date=December 2010}}


==Heuristic Models==
==Heuristic Models==

Revision as of 03:24, 3 February 2014

GeneMark
Developer(s)Georgia Institute of Technology
Initial release1993
Operating systemLinux, Solaris, AIX, and Mac OS X
LicenseProprietary commercial software; free for non-profit academic or U.S. Government use
Websiteopal.biology.gatech.edu/GeneMark

GeneMark is a family of ab initio gene prediction programs developed at the Georgia Institute of Technology in Atlanta. First developed in 1993, GeneMark was used in 1995 for annotation of the first completely sequenced bacterium, Haemophilus influenzae, and in 1996 for the first completely sequenced archaea, Methanococcus jannaschii. The GeneMark algorithm uses species specific inhomogeneous Markov chain models of protein-coding DNA sequence as well as homogeneous Markov chain models of non-coding DNA. Parameters of the models are estimated from training sets of sequences of a known type. The major step of the algorithm computes a posteriory probability of a sequence fragment to carry on a genetic code in one of six possible frames (including three frames in complementary DNA strand) or to be "non-coding".[clarification needed]

GeneMark.hmm

Prokaryotic

The GeneMark.h algorithm was designed to improve gene prediction quality by finding exact gene starts. The idea was to integrate the GeneMark models into a naturally designed hidden Markov model framework, with gene boundaries modeled as transitions between hidden states. Additionally, the ribosome binding site model is used to make the gene-start predictions more accurate. In evaluations by different groups,[by whom?] GeneMark.hmm was shown to be significantly more accurate than GeneMark in exact gene prediction.[citation needed] Since 1998, GeneMark.hmm and its self-training version GeneMarkS have been the standard tools for gene identification in new prokaryotic genomic sequences, including metagenomes.[citation needed]

Eukaryotic

After developing ekaryotic version of GeneMark.hmm, the approach was extended to the eukaryotic genomes, where accurate prediction of protein coding exon boundaries presented a major challenge. The hidden Markov model architecture of eukaryotic GeneMark.hmm consists of hidden states for initial, internal, and terminal exons, introns, intergenic regions and single exon genes located on both DNA strands. It also includes hidden states for the initiation site and termination site, as well as donor and acceptor splice sites. GeneMark.hmm has been frequently used for annotation of plant and animal genomes.[citation needed]

Heuristic Models

To accurately find genes in DNA sequences using computers, models of protein coding and non-coding regions derived either from experimentally validated training sets or from large amounts of anonymous DNA sequence are required. A heuristic method for deriving the parameters of inhomogeneous Markov models of protein coding regions was proposed in 1999.[further explanation needed] This heuristic uses the observation that the parameters of the Markov models used in GeneMark can be approximated by the functions of the sequence G+C content.[clarification needed] Therefore, a short DNA sequence sufficient for estimation of the genome G+C content (a fragment longer than 400 nucleotides) is also sufficient for derivation of parameters of the Markov models used in GeneMark and GeneMark.hmm.

Models built by the heuristic approach can be used to find genes in small fragments of anonymous prokaryotic genomes, such as metagenomic sequences, as well as in genomes of organelles, viruses, phages and plasmids. This method can also be used for highly inhomogeneous genomes, where the Markov models must be adjusted to account for local DNA composition. The heuristic method provides evidence that the mutational pressure that shapes G+C content is the driving force of the evolution of codon usage pattern.[citation needed]

Family of gene prediction programs

Bacteria, Archaea and metagenomes

  • GeneMark-P
  • GeneMark.hmm-P
  • GeneMarkS

Eukaryotes

  • GeneMark-E
  • GeneMark.hmm-E
  • GeneMark.hmm-ES

Viruses, phages and plasmids

  • Heuristic approach

EST and cDNA

  • GeneMark-E

See also

References