This article needs additional citations for verification. (November 2010)
DNA annotation or genome annotation is the process of identifying the locations of genes and all of the coding regions in a genome and determining what those genes do. An annotation (irrespective of the context) is a note added by way of explanation or commentary. Once a genome is sequenced, it needs to be annotated to make sense of it. Genes in eukaryotic genome can be annotated using FINDER.
For DNA annotation, a previously unknown sequence representation of genetic material is enriched with information relating genomic position to intron-exon boundaries, regulatory sequences, repeats, gene names and protein products. This annotation is stored in genomic databases such as Mouse Genome Informatics, FlyBase, and WormBase. Educational materials on some aspects of biological annotation from the 2006 Gene Ontology annotation camp and similar events are available at the Gene Ontology website.
The National Center for Biomedical Ontology (www.bioontology.org) develops tools for automated annotation of database records based on the textual descriptions of those records.
As a general method, dcGO has an automated procedure for statistically inferring associations between ontology terms and protein domains or combinations of domains from the existing gene/protein-level annotations.
Genome annotation consists of three main steps:.
- identifying portions of the genome that do not code for proteins
- identifying elements on the genome, a process called gene prediction
- attaching biological information to these elements
Automatic annotation tools attempt to perform these steps via computer analysis, as opposed to manual annotation (a.k.a. curation) which involves human expertise. Ideally, these approaches co-exist and complement each other in the same annotation pipeline.
A simple method of gene annotation relies on homology based search tools, like BLAST, to search for homologous genes in specific databases, the resulting information is then used to annotate genes and genomes. However, as information is added to the annotation platform, manual annotators become capable of deconvoluting discrepancies between genes that are given the same annotation. Some databases use genome context information, similarity scores, experimental data, and integrations of other resources to provide genome annotations through their Subsystems approach. Other databases (e.g. Ensembl) rely on curated data sources as well as a range of different software tools in their automated genome annotation pipeline.
Structural annotation consists of the identification of genomic elements.
- ORFs and their localization
- gene structure
- coding regions
- location of regulatory motifs
Functional annotation consists of attaching biological information to genomic elements.
- biochemical function
- biological function
- involved regulation and interactions
These steps may involve both biological experiments and in silico analysis. Proteogenomics based approaches utilize information from expressed proteins, often derived from mass spectrometry, to improve genomics annotations.
A variety of software tools have been developed to permit scientists to view and share genome annotations; for example, MAKER.
Genome annotation remains a major challenge for scientists investigating the human genome, now that the genome sequences of more than a thousand human individuals (The 100,000 Genomes Project, UK) and several model organisms are largely complete. Identifying the locations of genes and other genetic control elements is often described as defining the biological "parts list" for the assembly and normal operation of an organism. Scientists are still at an early stage in the process of delineating this parts list and in understanding how all the parts "fit together".
Genome annotation is an active area of investigation and involves a number of different organizations in the life science community which publish the results of their efforts in publicly available biological databases accessible via the web and other electronic means. Here is an alphabetical listing of on-going projects relevant to genome annotation:
- Encyclopedia of DNA elements (ENCODE)
- Entrez Gene
- Gene Ontology Consortium
- Vertebrate and Genome Annotation Project (Vega)
At Wikipedia, genome annotation has started to become automated under the auspices of the Gene Wiki portal which operates a bot that harvests gene data from research databases and creates gene stubs on that basis.
- "Definition of genome annotation".
- Banerjee S, Bhandary P, Woodhouse M, Sen TZ, Wise RP, Andorf CM (April 2021). "FINDER: an automated software package to annotate eukaryotic genes from RNA-Seq data and associated protein sequences". BMC Bioinformatics. 44 (9): e89. doi:10.1186/s12859-021-04120-9. PMID 33879057.
- "GO Teaching Resources". Archived from the original on 10 October 2006. Retrieved 21 September 2006.
- Fang, H; Gough, J (2013). "DcGO: Database of domain-centric ontologies on functions, phenotypes, diseases and more". Nucleic Acids Research. 41 (Database issue): D536–44. doi:10.1093/nar/gks1080. PMC 3531119. PMID 23161684.
- Stein, L. (2001). "Genome annotation: from sequence to biology". Nature Reviews Genetics. 2 (7): 493–503. doi:10.1038/35080529. PMID 11433356. S2CID 12044602.
- Pevsner, Jonathan (2009). Bioinformatics and functional genomics (2nd ed.). Hoboken, N.J: Wiley-Blackwell. ISBN 9780470085851.
- "Ensembl's genome annotation pipeline online documentation". Archived from the original on 5 March 2016.
- Gupta, Nitin; Stephen Tanner; Navdeep Jaitly; Joshua N Adkins; Mary Lipton; Robert Edwards; Margaret Romine; Andrei Osterman; Vineet Bafna; Richard D Smith; Pavel A Pevzner (September 2007). "Whole proteome analysis of post-translational modifications: applications of mass-spectrometry for proteogenomic annotation". Genome Research. 17 (9): 1362–1377. doi:10.1101/gr.6427907. ISSN 1088-9051. PMC 1950905. PMID 17690205.
- ENCODE Project Consortium (2011). Becker PB (ed.). "A User's Guide to the Encyclopedia of DNA Elements (ENCODE)". PLOS Biology. 9 (4): e1001046. doi:10.1371/journal.pbio.1001046. PMC 3079585. PMID 21526222.
- McVean, G. A.; Abecasis, D. M.; Auton, R. M.; Brooks, G. A. R.; Depristo, D. R.; Durbin, A.; Handsaker, A. G.; Kang, P.; Marth, E. E.; McVean, P.; Gabriel, S. B.; Gibbs, R. A.; Green, E. D.; Hurles, M. E.; Knoppers, B. M.; Korbel, J. O.; Lander, E. S.; Lee, C.; Lehrach, H.; Mardis, E. R.; Marth, G. T.; McVean, G. A.; Nickerson, D. A.; Schmidt, J. P.; Sherry, S. T.; Wang, J.; Wilson, R. K.; Gibbs (Principal Investigator), R. A.; Dinh, H.; et al. (2012). "An integrated map of genetic variation from 1,092 human genomes". Nature. 491 (7422): 56–65. Bibcode:2012Natur.491...56T. doi:10.1038/nature11632. PMC 3498066. PMID 23128226.
- Dunham, I.; Bernstein, A.; Birney, S. F.; Dunham, P. J.; Green, C. A.; Gunter, F.; Snyder, C. B.; Frietze, S.; Harrow, J.; Kaul, R.; Khatun, J.; Lajoie, B. R.; Landt, S. G.; Lee, B. K.; Pauli, F.; Rosenbloom, K. R.; Sabo, P.; Safi, A.; Sanyal, A.; Shoresh, N.; Simon, J. M.; Song, L.; Trinklein, N. D.; Altshuler, R. C.; Birney, E.; Brown, J. B.; Cheng, C.; Djebali, S.; Dong, X.; et al. (2012). "An integrated encyclopedia of DNA elements in the human genome". Nature. 489 (7414): 57–74. Bibcode:2012Natur.489...57T. doi:10.1038/nature11247. PMC 3439153. PMID 22955616.
- Huss, Jon W.; Orozco, C; Goodale, J; Wu, C; Batalov, S; Vickers, TJ; Valafar, F; Su, AI (2008). "A Gene Wiki for Community Annotation of Gene Function". PLOS Biology. 6 (7): e175. doi:10.1371/journal.pbio.0060175. PMC 2443188. PMID 18613750.