Jump to content

Conserved sequence: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Biobeth (talk | contribs)
Started section for bioinfo methods
Line 19: Line 19:
Conservation can occur in [[Coding region|coding]] and [[Noncoding DNA|non-coding]] nucleic acid sequences. The extent to which a sequence is conserved can be affected by its function and [[Robustness (evolution)|robustness]] to mutation, varying [[Evolutionary_pressure|selection pressures]], [[Population genetics|population size]] and [[Genetic drift|genetic drift]]. Many functional sequences are also [[Modularity (biology)|modular]], containing regions which may be subject to independent [[Evolutionary pressure|selection pressures]], such as [[Protein domain#Domains as evolutionary modules|protein domains]].
Conservation can occur in [[Coding region|coding]] and [[Noncoding DNA|non-coding]] nucleic acid sequences. The extent to which a sequence is conserved can be affected by its function and [[Robustness (evolution)|robustness]] to mutation, varying [[Evolutionary_pressure|selection pressures]], [[Population genetics|population size]] and [[Genetic drift|genetic drift]]. Many functional sequences are also [[Modularity (biology)|modular]], containing regions which may be subject to independent [[Evolutionary pressure|selection pressures]], such as [[Protein domain#Domains as evolutionary modules|protein domains]].


==Identifying conserved sequences==
Conserved sequences are typically identified by [[bioinformatics]] approaches based on [[sequence alignment]]. Advances in [[DNA sequencing#High-throughput methods|high-throughput DNA sequencing]] and [[protein mass spectrometry]] has substantially increased the availability of protein sequences and whole genomes for comparison since the early 2000s.

===Homology search===

Conserved sequences may be identified by [[[[Homology (biology)|homology]] search, using tools such as [[BLAST]], [[HMMER]] and Infernal.<ref>{{cite journal|last1=Nawrocki|first1=E. P.|last2=Eddy|first2=S. R.|title=Infernal 1.1: 100-fold faster RNA homology searches|journal=Bioinformatics|date=4 September 2013|volume=29|issue=22|pages=2933–2935|doi=10.1093/bioinformatics/btt509}}</ref> Homology search tools may take an individual nucleic acid or protein sequence as input, or use statistical models generated from multiple sequence alignments of known related sequences. Statistical models such as [[Hidden Markov model|profile-HMMs]], and RNA covariance models which also incorporate structural information,<ref>{{cite journal|last1=Eddy|first1=SR|last2=Durbin|first2=R|title=RNA sequence analysis using covariance models.|journal=Nucleic acids research|date=11 June 1994|volume=22|issue=11|pages=2079-88|pmid=8029015}}</ref> can be helpful when searching for more distantly related sequences. Input sequences are then aligned against a database of sequences from related individuals or other species. The resulting alignments are then scored based on the number of matching amino acids or bases, and the number of gaps or deletions generated by the alignment. Acceptable conservative substitutions may be identified using substitution matrices such as [[PAM]] and [[BLOSUM]]. Highly scoring alignments are assumed to be from homologous sequences. The conservation of a sequence may then be inferred by detection of highly similar homologs over a broad phylogenetic range.
===Genome alignment===
Whole genome alignments (WGAs) may also be used to identify highly conserved regions across species. Currently the accuracy and [[scalability]] of WGA tools remains limited due to the computational complexity of dealing with rearrangements, repeat regions and the large size of many eukaryotic genomes.<ref>{{cite journal|last1=Earl|first1=Dent|last2=Nguyen|first2=Ngan|last3=Hickey|first3=Glenn|last4=Harris|first4=Robert S.|last5=Fitzgerald|first5=Stephen|last6=Beal|first6=Kathryn|last7=Seledtsov|first7=Igor|last8=Molodtsov|first8=Vladimir|last9=Raney|first9=Brian J.|last10=Clawson|first10=Hiram|last11=Kim|first11=Jaebum|last12=Kemena|first12=Carsten|last13=Chang|first13=Jia-Ming|last14=Erb|first14=Ionas|last15=Poliakov|first15=Alexander|last16=Hou|first16=Minmei|last17=Herrero|first17=Javier|last18=Kent|first18=William James|last19=Solovyev|first19=Victor|last20=Darling|first20=Aaron E.|last21=Ma|first21=Jian|last22=Notredame|first22=Cedric|last23=Brudno|first23=Michael|last24=Dubchak|first24=Inna|last25=Haussler|first25=David|last26=Paten|first26=Benedict|title=Alignathon: a competitive assessment of whole-genome alignment methods|journal=Genome Research|date=December 2014|volume=24|issue=12|pages=2077–2089|doi=10.1101/gr.174920.114}}</ref> However, WGAs of 30 or more closely related bacteria are now increasingly feasible. <ref>{{cite journal|last1=Rouli|first1=L.|last2=Merhej|first2=V.|last3=Fournier|first3=P.-E.|last4=Raoult|first4=D.|title=The bacterial pangenome as a new tool for analysing pathogenic bacteria|journal=New Microbes and New Infections|date=September 2015|volume=7|pages=72–85|doi=10.1016/j.nmni.2015.06.005}}</ref><ref>{{cite journal|last1=Méric|first1=Guillaume|last2=Yahara|first2=Koji|last3=Mageiros|first3=Leonardos|last4=Pascoe|first4=Ben|last5=Maiden|first5=Martin C. J.|last6=Jolley|first6=Keith A.|last7=Sheppard|first7=Samuel K.|last8=Bereswill|first8=Stefan|title=A Reference Pan-Genome Approach to Comparative Bacterial Genomics: Identification of Novel Epidemiological Markers in Pathogenic Campylobacter|journal=PLoS ONE|date=27 March 2014|volume=9|issue=3|pages=e92798|doi=10.1371/journal.pone.0092798}}</ref>
==Nucleic acid and protein sequences==
==Nucleic acid and protein sequences==
Highly conserved DNA sequences are thought to have functional value. The role for many of these highly conserved non-coding DNA sequences is not understood.
Highly conserved DNA sequences are thought to have functional value. The role for many of these highly conserved non-coding DNA sequences is not understood.

Revision as of 03:23, 13 December 2017

A sequence alignment of mammalian histone proteins
Sequences are the amino acids for residues 120-180 of the proteins. Residues that are conserved across all sequences are highlighted in grey. Below the protein sequences is a key denoting conserved sequence (*), conservative mutations (:), semi-conservative mutations (.), and non-conservative mutations ( ).[1]

In evolutionary biology, conserved sequences are similar or identical sequences in nucleic acids (DNA and RNA) or proteins across species (orthologous sequences) or within a genome (paralogous sequences). Conservation indicates that a sequence has been maintained by natural selection.

A highly conserved sequence is one that has remained relatively unchanged far back up the phylogenetic tree, and hence far back in geological time. Examples of highly conserved sequences include the include the RNA components of ribosomes present in all domains of life, the homeobox sequences widespread amongst Eukaryotes, and the tmRNA in Bacteria.

History

The discovery of the role of DNA in inheritance, and observations by Frederick Sanger of variation between animal insulins in 1949,[2] prompted early molecular biologists to study taxonomy from a molecular perspective.[3][4] Studies in the 1960’s used DNA hybridization and protein cross-reactivity techniques to measure similarity between known orthologous proteins, such as hemoglobin[5] and Cytochrome C.[6] In 1965, Émile Zuckerkandl and Linus Pauling introduced the concept of the molecular clock,[7] proposing that steady rates of mutation could be used to estimate the time since two organisms diverged. While initial phylogenies closely matched the fossil record, observations that some genes appeared to evolve at different rates led to the development of theories of molecular evolution.[8][9] Margaret Dayhoff's 1966 comparison of ferrodoxin sequences showed that natural selection would act to conserve and optimise protein sequences essential to life.[10]

Mechanisms

Over many generations, nucleic acid sequences in the genome of an evolutionary lineage can gradually change or erode over time due to random mutations and deletions.[11][12] Sequences may also recombine or be deleted due to chromosomal rearrangements. Conserved sequences are sequences which persist in the genome despite such forces, and have slower rates of mutation than the background mutation rate. [13]

Conservation can occur in coding and non-coding nucleic acid sequences. The extent to which a sequence is conserved can be affected by its function and robustness to mutation, varying selection pressures, population size and genetic drift. Many functional sequences are also modular, containing regions which may be subject to independent selection pressures, such as protein domains.

Identifying conserved sequences

Conserved sequences are typically identified by bioinformatics approaches based on sequence alignment. Advances in high-throughput DNA sequencing and protein mass spectrometry has substantially increased the availability of protein sequences and whole genomes for comparison since the early 2000s.

Conserved sequences may be identified by [[homology search, using tools such as BLAST, HMMER and Infernal.[14] Homology search tools may take an individual nucleic acid or protein sequence as input, or use statistical models generated from multiple sequence alignments of known related sequences. Statistical models such as profile-HMMs, and RNA covariance models which also incorporate structural information,[15] can be helpful when searching for more distantly related sequences. Input sequences are then aligned against a database of sequences from related individuals or other species. The resulting alignments are then scored based on the number of matching amino acids or bases, and the number of gaps or deletions generated by the alignment. Acceptable conservative substitutions may be identified using substitution matrices such as PAM and BLOSUM. Highly scoring alignments are assumed to be from homologous sequences. The conservation of a sequence may then be inferred by detection of highly similar homologs over a broad phylogenetic range.

Genome alignment

Whole genome alignments (WGAs) may also be used to identify highly conserved regions across species. Currently the accuracy and scalability of WGA tools remains limited due to the computational complexity of dealing with rearrangements, repeat regions and the large size of many eukaryotic genomes.[16] However, WGAs of 30 or more closely related bacteria are now increasingly feasible. [17][18]

Nucleic acid and protein sequences

Highly conserved DNA sequences are thought to have functional value. The role for many of these highly conserved non-coding DNA sequences is not understood. A common notation to denote the level of sequence conservation is used by the clustal alignment programs. Below a set of aligned sequences, residue columns are indicated as fully conserved (*), containing only conservative mutations (:), semi-conservative mutations (.), and non-conservative mutations ( ).[19]

Extreme conservation

Ultra-conserved elements

Ultra-conserved elements or UCEs are sequences that are highly similar or identical across multiple taxonomic groupings. These were first discovered in vertebrates,[20] and have subsequently been identified within widely-differing taxa.[21] While the origin and function of UCEs are poorly understood,[22] they have been used to investigate deep-time divergences in amniotes,[23] insects, [24] and between animals and plants. [25]

Universally conserved genes

The most highly conserved genes are those that can be found in all organisms. These mainly comprise of the ncRNAs and proteins required for transcription and translation, which are assumed to have been conserved from the last universal common ancestor of all life.[26]

Genes or gene families that have been found to be universally conserved include GTP-binding elongation factors, Methionine aminopeptidase 2, Serine hydroxymethyltransferase, and ATP transporters.[27] Components of the transcription machinery, such as RNA polymerase and helicases, and of the translation machinery, such as ribosomal RNAs, tRNAs and ribosomal proteins are also universally conserved.[28]

GERP scores

A GERP (Genomic Evolutionary Rate Profiling) score measures evolutionary conservation of genetic sequences across species.[29] There is a relationship between a sequence's GERP score and the proportion of variant alleles within that sequence. As the GERP score of a sequence increases, variation within that sequence becomes more rare. A higher GERP signifies a highly conserved sequence, where alteration is harmful, so adverse variants would reduce the fitness of the organism and be selected against.

Biological role

Sequences are only likely to be highly conserved through geological time if they are required for basic cellular functions (such as coding for vital enzymes), stability, embryonic development, reproduction. Sequence similarity is used as evidence of structural and functional conservation, and evolutionary relationships between sequences. Consequently, functional elements are frequently identified by searching for conserved sequences in a genome.

Conservation of protein-coding sequences leads to the presence of identical amino acid residues at analogous regions of the protein structure and hence similar function. Conservative mutations alter amino acids to similar chemically residues and so may still not affect the protein's function. Among the most highly conserved sequences are the active sites of enzymes and the binding sites of protein receptors.[citation needed]

Conserved non-coding sequences do not encode protein, but often harbour cis-regulatory elements, including the evo-devo gene toolkit. Some deletions of highly conserved sequences in humans (hCONDELs) and other organisms have been suggested to be a potential cause of the anatomical and behavioural differences between humans and other mammals.[30][31] The TATA promoter sequence is an example of a highly conserved DNA sequence found in most eukaryotes.[32]

Applications

The research of conserved genetic sequences is extremely beneficial to the scientific community. The detection of similar sequences across diverse species’ genomes can provide useful information regarding the evolutionary history of these species. Additionally, the examination of conserved sequences can aid medical research. By identifying rare alleles within conserved sequences, information can be compiled and used to assess risk of disease among humans. Genome-wide association studies (GWAS) compare various alleles across the human genome and their association with risk for a particular diseases or ailments.[citation needed]

See also

References

  1. ^ "Clustal FAQ #Symbols". Clustal. Retrieved 8 December 2014.
  2. ^ Sanger, F. (24 September 1949). "Species Differences in Insulins". Nature. 164 (4169): 529–529. doi:10.1038/164529a0.
  3. ^ Marmur, J; Falkow, S; Mandel, M (October 1963). "New Approaches to Bacterial Taxonomy". Annual Review of Microbiology. 17 (1): 329–372. doi:10.1146/annurev.mi.17.100163.001553.
  4. ^ Pace, N. R.; Sapp, J.; Goldenfeld, N. (17 January 2012). "Phylogeny and beyond: Scientific, historical, and conceptual significance of the first tree of life". Proceedings of the National Academy of Sciences. 109 (4): 1011–1018. doi:10.1073/pnas.1109716109.
  5. ^ Zuckerlandl, Emile; Pauling, Linus B. (1962). "Molecular disease, evolution, and genetic heterogeneity". Horizons in Biochemistry: 189–225.
  6. ^ Margoliash, E (October 1963). "PRIMARY STRUCTURE AND EVOLUTION OF CYTOCHROME C". Proc Natl Acad Sci U S A. 50 (4): 672–679.
  7. ^ Zuckerkandl, E; Pauling, LB (1965). "Evolutionary Divergence and Convergence in Proteins". Evolving Genes and Proteins: 96–166. doi:10.1016/B978-1-4832-2734-4.50017-6.
  8. ^ Marmur, J; Falkow, S; Mandel, M (October 1963). "New Approaches to Bacterial Taxonomy". Annual Review of Microbiology. 17 (1): 329–372. doi:10.1146/annurev.mi.17.100163.001553.
  9. ^ Pace, N. R.; Sapp, J.; Goldenfeld, N. (17 January 2012). "Phylogeny and beyond: Scientific, historical, and conceptual significance of the first tree of life". Proceedings of the National Academy of Sciences. 109 (4): 1011–1018. doi:10.1073/pnas.1109716109.
  10. ^ Eck, R. V.; Dayhoff, M. O. (15 April 1966). "Evolution of the Structure of Ferredoxin Based on Living Relics of Primitive Amino Acid Sequences". Science. 152 (3720): 363–366. doi:10.1126/science.152.3720.363.
  11. ^ Kimura, M (17 February 1968). "Evolutionary Rate at the Molecular Level". Nature. 217 (5129): 624–626. doi:10.1038/217624a0.
  12. ^ King, J. L.; Jukes, T. H. (16 May 1969). "Non-Darwinian Evolution". Science. 164 (3881): 788–798. doi:10.1126/science.164.3881.788.
  13. ^ Kimura, M; Ohta, T (1974). "On Some Principles Governing Molecular Evolution" (PDF). Proc Natl Acad Sci USA. 71 (7): 2848–2852. PMC 388569.
  14. ^ Nawrocki, E. P.; Eddy, S. R. (4 September 2013). "Infernal 1.1: 100-fold faster RNA homology searches". Bioinformatics. 29 (22): 2933–2935. doi:10.1093/bioinformatics/btt509.
  15. ^ Eddy, SR; Durbin, R (11 June 1994). "RNA sequence analysis using covariance models". Nucleic acids research. 22 (11): 2079–88. PMID 8029015.
  16. ^ Earl, Dent; Nguyen, Ngan; Hickey, Glenn; Harris, Robert S.; Fitzgerald, Stephen; Beal, Kathryn; Seledtsov, Igor; Molodtsov, Vladimir; Raney, Brian J.; Clawson, Hiram; Kim, Jaebum; Kemena, Carsten; Chang, Jia-Ming; Erb, Ionas; Poliakov, Alexander; Hou, Minmei; Herrero, Javier; Kent, William James; Solovyev, Victor; Darling, Aaron E.; Ma, Jian; Notredame, Cedric; Brudno, Michael; Dubchak, Inna; Haussler, David; Paten, Benedict (December 2014). "Alignathon: a competitive assessment of whole-genome alignment methods". Genome Research. 24 (12): 2077–2089. doi:10.1101/gr.174920.114.
  17. ^ Rouli, L.; Merhej, V.; Fournier, P.-E.; Raoult, D. (September 2015). "The bacterial pangenome as a new tool for analysing pathogenic bacteria". New Microbes and New Infections. 7: 72–85. doi:10.1016/j.nmni.2015.06.005.
  18. ^ Méric, Guillaume; Yahara, Koji; Mageiros, Leonardos; Pascoe, Ben; Maiden, Martin C. J.; Jolley, Keith A.; Sheppard, Samuel K.; Bereswill, Stefan (27 March 2014). "A Reference Pan-Genome Approach to Comparative Bacterial Genomics: Identification of Novel Epidemiological Markers in Pathogenic Campylobacter". PLoS ONE. 9 (3): e92798. doi:10.1371/journal.pone.0092798.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  19. ^ "Clustal FAQ #Symbols". Clustal. Retrieved 8 December 2014.
  20. ^ Bejerano, G. (28 May 2004). "Ultraconserved Elements in the Human Genome". Science. 304 (5675): 1321–1325. doi:10.1126/science.1098119.
  21. ^ Siepel, A. (1 August 2005). "Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes". Genome Research. 15 (8): 1034–1050. doi:10.1101/gr.3715005.
  22. ^ Harmston, N.; Baresic, A.; Lenhard, B. (11 November 2013). "The mystery of extreme non-coding conservation". Philosophical Transactions of the Royal Society B: Biological Sciences. 368 (1632): 20130021–20130021. doi:10.1098/rstb.2013.0021.
  23. ^ Faircloth, B. C.; McCormack, J. E.; Crawford, N. G.; Harvey, M. G.; Brumfield, R. T.; Glenn, T. C. (9 January 2012). "Ultraconserved Elements Anchor Thousands of Genetic Markers Spanning Multiple Evolutionary Timescales". Systematic Biology. 61 (5): 717–726. doi:10.1093/sysbio/sys004.
  24. ^ Faircloth, Brant C.; Branstetter, Michael G.; White, Noor D.; Brady, Seán G. (May 2015). "Target enrichment of ultraconserved elements from arthropods provides a genomic perspective on relationships among Hymenoptera". Molecular Ecology Resources. 15 (3): 489–501. doi:10.1111/1755-0998.12328.
  25. ^ Reneker, J.; Lyons, E.; Conant, G. C.; Pires, J. C.; Freeling, M.; Shyu, C.-R.; Korkin, D. (10 April 2012). "Long identical multispecies elements in plant and animal genomes". Proceedings of the National Academy of Sciences. 109 (19): E1183–E1191. doi:10.1073/pnas.1121356109.
  26. ^ Isenbarger, Thomas A.; Carr, Christopher E.; Johnson, Sarah Stewart; Finney, Michael; Church, George M.; Gilbert, Walter; Zuber, Maria T.; Ruvkun, Gary (14 October 2008). "The Most Conserved Genome Segments for Life Detection on Earth and Other Planets". Origins of Life and Evolution of Biospheres. 38 (6): 517–533. doi:10.1007/s11084-008-9148-z.
  27. ^ Harris, J. K. (12 February 2003). "The Genetic Core of the Universal Ancestor". Genome Research. 13 (3): 407–412. doi:10.1101/gr.652803.
  28. ^ Ban, Nenad; Beckmann, Roland; Cate, Jamie HD; Dinman, Jonathan D; Dragon, François; Ellis, Steven R; Lafontaine, Denis LJ; Lindahl, Lasse; Liljas, Anders; Lipton, Jeffrey M; McAlear, Michael A; Moore, Peter B; Noller, Harry F; Ortega, Joaquin; Panse, Vikram Govind; Ramakrishnan, V; Spahn, Christian MT; Steitz, Thomas A; Tchorzewski, Marek; Tollervey, David; Warren, Alan J; Williamson, James R; Wilson, Daniel; Yonath, Ada; Yusupov, Marat (February 2014). "A new system for naming ribosomal proteins". Current Opinion in Structural Biology. 24: 165–169. doi:10.1016/j.sbi.2014.01.002.
  29. ^ Genomic Evolutionary Rate Profiling at Sidow Lab
  30. ^ McLean, Cory Y.; et al. (10 March 2011). "Human-specific loss of regulatory DNA and the evolution of human-specific traits". Nature. 471 (7337): 216–219. doi:10.1038/nature09774. PMC 3071156. PMID 21390129.
  31. ^ Gross, Liza (September 2007). "Are "Ultraconserved" Genetic Elements Really Indispensable?". PLOS Biology. 5 (9): e253. doi:10.1371/journal.pbio.0050253. PMC 1964769. PMID 20076686.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  32. ^ Patikoglou, G. A.; Kim, J. L.; Sun, L.; Yang, S.-H.; Kodadek, T.; Burley, S. K. (15 December 1999). "TATA element recognition by the TATA box-binding protein has been conserved throughout evolution". Genes & Development. 13 (24): 3217–3230. doi:10.1101/gad.13.24.3217. PMC 317201. PMID 10617571.