In genomics and related disciplines, noncoding DNA sequences are components of an organism's DNA that do not encode protein sequences. Some noncoding DNA is transcribed into functional non-coding RNA molecules (e.g. transfer RNA, ribosomal RNA, and regulatory RNAs), while others are not transcribed or give rise to RNA transcripts of unknown function. The amount of noncoding DNA varies greatly among species. For example, over 98% of the human genome is noncoding DNA, while only about 2% of a typical bacterial genome is noncoding DNA.
Initially, a large proportion of noncoding DNA had no known biological function and was therefore sometimes referred to as "junk DNA", particularly in the lay press. However, it has been known for decades that many noncoding sequences are functional. These include genes for functional RNA molecules (see above) and sequences such as origins of replication, centromeres, and telomeres.
Some sequences may have no biological function for the organism, such as endogenous retroviruses. However, many types of noncoding DNA sequences do have important biological functions, including the transcriptional and translational regulation of protein-coding sequences, origins of DNA replication, centromeres, telomeres, scaffold attachment regions (SARs), genes for functional RNAs, and many others. Other noncoding sequences have likely, but as-yet undetermined, functions. (This is inferred from high levels of sequence similarity seen in different species.)
The Encyclopedia of DNA Elements (ENCODE) project suggested in September 2012 that over 80% of DNA in the human genome "serves some purpose, biochemically speaking". This conclusion however is strongly criticized by other scientists.
- 1 Fraction of noncoding genomic DNA
- 2 Types of noncoding DNA sequences
- 3 Junk DNA
- 4 Functions of noncoding DNA
- 5 Uses of noncoding DNA
- 6 See also
- 7 References
- 8 Further reading
- 9 External links
Fraction of noncoding genomic DNA
The amount of total genomic DNA varies widely between organisms, and the proportion of coding and noncoding DNA within these genomes varies greatly as well. More than 98% of the human genome does not encode protein sequences, including most sequences within introns and most intergenic DNA.
While overall genome size, and by extension the amount of noncoding DNA, are correlated to organism complexity, there are many exceptions. For example, the genome of the unicellular Polychaos dubium (formerly known as Amoeba dubia) has been reported to contain more than 200 times the amount of DNA in humans. The pufferfish Takifugu rubripes genome is only about one eighth the size of the human genome, yet seems to have a comparable number of genes; approximately 90% of the Takifugu genome is noncoding DNA. In 2013, a new "record" for most efficient genome was discovered. Utricularia gibba, a bladderwort plant, has only 3% noncoding DNA. The discovery led project co-lead Victor Albert to declare "At least for a plant, junk DNA really is just junk - it's not required." The extensive variation in nuclear genome size among eukaryotic species is known as the C-value enigma or C-value paradox. Most of the genome size difference appears to lie in the noncoding DNA.
Types of noncoding DNA sequences
Noncoding functional RNA
MicroRNAs are predicted to control the translational activity of approximately 30% of all protein-coding genes in mammals and may be vital components in the progression or treatment of various diseases including cancer, cardiovascular disease, and the immune system response to infection.
Cis- and Trans-regulatory elements
Cis-regulatory elements are sequences that control the transcription of a nearby gene. Cis-elements may be located in 5' or 3' untranslated regions or within introns. Trans-regulatory elements control the transcription of a distant gene.
Promoters facilitate the transcription of a particular gene and are typically upstream of the coding region. Enhancer sequences may also exert very distant effects on the transcription levels of genes.
Introns are non-coding sections of a gene, transcribed into the precursor mRNA sequence, but ultimately removed by RNA splicing during the processing to mature messenger RNA. Many introns appear to be mobile genetic elements.
Studies of group I introns from Tetrahymena protozoans indicate that some introns appear to be selfish genetic elements, neutral to the host because they remove themselves from flanking exons during RNA processing and do not produce an expression bias between alleles with and without the intron. Some introns appear to have significant biological function, possibly through ribozyme functionality that may regulate tRNA and rRNA activity as well as protein-coding gene expression, evident in hosts that have become dependent on such introns over long periods of time; for example, the trnL-intron is found in all green plants and appears to have been vertically inherited for several billions of years, including more than a billion years within chloroplasts and an additional 2–3 billion years prior in the cyanobacterial ancestors of chloroplasts.
Pseudogenes are DNA sequences, related to known genes, that have lost their protein-coding ability or are otherwise no longer expressed in the cell. Pseudogenes arise from retrotransposition or genomic duplication of functional genes, and become "genomic fossils" that are nonfunctional due to mutations that prevent the transcription of the gene, such as within the gene promoter region, or fatally alter the translation of the gene, such as premature stop codons or frameshifts. Pseudogenes resulting from the retrotransposition of an RNA intermediate are known as processed pseudogenes; pseudogenes that arise from the genomic remains of duplicated genes or residues of inactivated genes are nonprocessed pseudogenes.
While Dollo's Law suggests that the loss of function in pseudogenes is likely permanent, silenced genes may actually retain function for several million years and can be "reactivated" into protein-coding sequences and a substantial number of pseudogenes are actively transcribed. Because pseudogenes are presumed to change without evolutionary constraint, they can serve as a useful model of the type and frequencies of various spontaneous genetic mutations.
Transposons and retrotransposons are mobile genetic elements. Retrotransposon repeated sequences, which include long interspersed nuclear elements (LINEs) and short interspersed nuclear elements (SINEs), account for a large proportion of the genomic sequences in many species. Alu sequences, classified as a short interspersed nuclear element, are the most abundant mobile elements in the human genome. Some examples have been found of SINEs exerting transcriptional control of some protein-encoding genes.
Endogenous retrovirus sequences are the product of reverse transcription of retrovirus genomes into the genomes of germ cells. Mutation within these retro-transcribed sequences can inactivate the viral genome.
Over 8% of the human genome is made up of (mostly decayed) endogenous retrovirus sequences, as part of the over 42% fraction that is recognizably derived of retrotransposons, while another 3% can be identified to be the remains of DNA transposons. Much of the remaining half of the genome that is currently without an explained origin is expected to have found its origin in transposable elements that were active so long ago (> 200 million years) that random mutations have rendered them unrecognizable. Genome size variation in at least two kinds of plants is mostly the result of retrotransposon sequences.
The term "junk DNA" became popular in the 1960s. It was formalized in 1972 by Susumu Ohno, who noted that the mutational load from deleterious mutations placed an upper limit on the number of functional loci that could be expected given a typical mutation rate. Ohno predicted that mammal genomes could not have more than 30,000 loci under selection before the "cost" from the mutational load would cause an inescapable decline in fitness, and eventually extinction. This prediction remains robust, with the human genome containing approximately 20,000 genes. Another source for Ohno's theory was the observation that even closely related species can have widely (orders-of-magnitude) different genome sizes, which had been dubbed the C value paradox in 1971.
Junk DNA remains a label for the portions of a genome sequence for which no discernible function has been identified and that through comparative genomics analysis appear under no functional constraint suggesting that the sequence itself has provided no adaptive advantage. Since the late 70s it has become apparent that the majority of non-coding DNA in large genomes finds its origin in the selfish amplification of transposable elements, of which W.Ford Doolittle and Carmen Sapienza in 1980 wrote in the journal Nature: "When a given DNA, or class of DNAs, of unproven phenotypic function can be shown to have evolved a strategy (such as transposition) which ensures its genomic survival, then no other explanation for its existence is necessary." The amount of junk DNA can be expected to depend on the rate of amplification of these elements and the rate at which non-functional DNA is lost. In the same issue of Nature, Leslie Orgel and Francis Crick, wrote that junk DNA has "little specificity and conveys little or no selective advantage to the organism". The term is used mainly in popular science and in a colloquial way in scientific publications and it has occasionally been suggested that its connotations may have delayed interest in the biological functions of noncoding DNA.
Several lines of evidence indicate that some "junk DNA" sequences are likely to have unidentified functional activity and that the process of exaptation of fragments of originally selfish or non-functional DNA has been commonplace throughout evolution. In 2012, the ENCODE project, a research program supported by the National Human Genome Research Institute, reported that 76% of the human genome's noncoding DNA sequences were transcribed and that nearly half of the genome was in some way accessible to genetic regulatory proteins such as transcription factors. However, the suggestion by ENCODE that over 80% of the human genome may be functional has been sharply criticized by other scientists, including Dan Graur of the University of Houston, who argue that neither accessibility of segments of the genome to transcription factors nor their transcription guarantees that those segments have biochemical function and that their transcription is selectively advantageous.
Functions of noncoding DNA
Many noncoding DNA sequences have important biological functions as indicated by comparative genomics studies that report some regions of noncoding DNA that are highly conserved, sometimes on time-scales representing hundreds of millions of years, implying that these noncoding regions are under strong evolutionary pressure and positive selection. For example, in the genomes of humans and mice, which diverged from a common ancestor 65–75 million years ago, protein-coding DNA sequences account for only about 20% of conserved DNA, with the remaining 80% of conserved DNA represented in noncoding regions. Linkage mapping often identifies chromosomal regions associated with a disease with no evidence of functional coding variants of genes within the region, suggesting that disease-causing genetic variants lie in the noncoding DNA. The significance of noncoding DNA mutations in cancer was explored in 
According to a comparative study of over 300 prokaryotic and over 30 eukaryotic genomes, eukaryotes appear to require a minimum amount of non-coding DNA. This minimum amount can be predicted using a growth model for regulatory genetic networks, implying that it is required for regulatory purposes. In humans the predicted minimum is about 5% of the total genome.
Protection of the genome
Noncoding DNA separate genes from each other with long gaps, so mutation in one gene or part of a chromosome, for example deletion or insertion, does not have the "frameshift mutation" on the whole chromosome. When genome complexity is relatively high, like in the case of human genome, not only different genes, but also inside one gene there are gaps of introns to protect the entire coding segment to minimise the changes caused by mutation.
Some noncoding DNA sequences are genetic "switches" that regulate when and where genes are expressed.
Regulation of gene expression
Some noncoding DNA sequences determine the expression levels of various genes.
Transcription factor sites
Some noncoding DNA sequences determine where transcription factors attach. A transcription factor is a protein that binds to specific non-coding DNA sequences, thereby controlling the flow (or transcription) of genetic information from DNA to mRNA. Transcription factors act at very different locations on the genomes of different people.
An operator is a segment of DNA to which a repressor binds. A repressor is a DNA-binding protein that regulates the expression of one or more genes by binding to the operator and blocking the attachment of RNA polymerase to the promoter, thus preventing transcription of the genes. This blocking of expression is called repression.
An enhancer is a short region of DNA that can be bound with proteins (trans-acting factors), much like a set of transcription factors, to enhance transcription levels of genes in a gene cluster.
A silencer is a region of DNA that inactivates gene expression when bound by a regulatory protein. It functions in a very similar way as enhancers, only differing in the inactivation of genes.
A promoter is a region of DNA that facilitates transcription of a particular gene. Promoters are typically located near the genes they regulate.
A genetic insulator is a boundary element that plays two distinct roles in gene expression, either as an enhancer-blocking code, or rarely as a barrier against condensed chromatin. An insulator in a DNA sequence is comparable to a linguistic word divider such as a comma (,) in a sentence, because the insulator indicates where an enhanced or repressed sequence ends.
Uses of noncoding DNA
Noncoding DNA and evolution
Pseudogene sequences appear to accumulate mutations more rapidly than coding sequences due to a loss of selective pressure. This allows for the creation of mutant alleles that incorporate new functions that may be favored by natural selection; thus, pseudogenes can serve as raw material for evolution and can be considered "protogenes".
Long range correlations
A statistical distinction between coding and noncoding DNA sequences has been found. It has been observed that nucleotides in non-coding DNA sequences display long range power law correlations while coding sequences do not.
"The current standard for forensic DNA testing relies on an analysis of the chromosomes located within the nucleus of all human cells. “The DNA material in chromosomes is composed of ‘coding’ and ‘noncoding’ regions. The coding regions are known as genes and contain the information necessary for a cell to make proteins. . . . Non-protein coding regions . . . are not related directly to making proteins, [and] have been referred to as ‘junk’ DNA.” The adjective “junk” may mislead the lay person, for in fact this is the DNA region used with near certainty to identify a person.
- Conserved non-coding sequence
- Eukaryotic chromosome fine structure
- Gene-centered view of evolution
- Gene regulatory network
- Intergenic region
- Intragenomic conflict
- Phylogenetic footprinting
- "Worlds Record Breaking Plant: Deletes its Noncoding "Junk" DNA". Design & Trend. May 12, 2013. Retrieved 2013-06-04.
- Elgar G, Vavouri T (July 2008). "Tuning in to the signals: noncoding sequence conservation in vertebrate genomes". Trends Genet. 24 (7): 344–52. doi:10.1016/j.tig.2008.04.005. PMID 18514361.
- The ENCODE Project Consortium (2012). "An integrated encyclopedia of DNA elements in the human genome". Nature 489 (7414): 57–74. Bibcode:2012Natur.489...57T. doi:10.1038/nature11247. PMC 3439153. PMID 22955616.
- Pennisi, E. (Sep 2012). "Genomics. ENCODE project writes eulogy for junk DNA.". Science 337 (6099): 1159, 1161. doi:10.1126/science.337.6099.1159. PMID 22955811.
- Robin McKie (24 February 2013). "Scientists attacked over claim that 'junk DNA' is vital to life". The Observer.
- Dan Graur, Yichen Zheng, Nicholas Price, Ricardo B. R. Azevedo1, Rebecca A. Zufall1 and Eran Elhaik (2013). "On the immortality of television sets: "function" in the human genome according to the evolution-free gospel of ENCODE". Genome Biology and Evolution. doi:10.1093/gbe/evt028. PMC 3622293. PMID 23431001.
- Gregory TR, Hebert PD (April 1999). "The modulation of DNA content: proximate causes and ultimate consequences". Genome Res. 9 (4): 317–24. doi:10.1101/gr.9.4.317. PMID 10207154.
- Wahls, W.P., et al. (1990). "Hypervariable minisatellite DNA is a hotspot for homologous recombination in human cells". Cell 60 (1): 95–103. doi:10.1016/0092-8674(90)90719-U. PMID 2295091.
- Pennisi, Elizabeth (2007). "DNA Study Forces Rethink of What It Means to Be a Gene". Science 316 (5831): 1556–7. doi:10.1126/science.316.5831.1556. PMID 17569836.
- Struhl, Kevin (2007). "Transcriptional noise and the fidelity of initiation by RNA polymerase II". Nature Structural & Molecular Biology 14 (2): 103–105. doi:10.1038/nsmb0207-103. PMID 17277804.
- Li M, Marin-Muller C, Bharadwaj U, Chow KH, Yao Q, Chen C (April 2009). "MicroRNAs: Control and Loss of Control in Human Physiology and Disease". World J Surg 33 (4): 667–84. doi:10.1007/s00268-008-9836-x. PMC 2933043. PMID 19030926.
- Visel A, Rubin EM, Pennacchio LA (September 2009). "Genomic Views of Distant-Acting Enhancers". Nature 461 (7261): 199–205. Bibcode:2009Natur.461..199V. doi:10.1038/nature08451. PMC 2923221. PMID 19741700.
- Nielsen H, Johansen SD (2009). "Group I introns: Moving in new directions". RNA Biol 6 (4): 375–83. doi:10.4161/rna.6.4.9334. PMID 19667762.
- Zheng D, Frankish A, Baertsch R, et al. (June 2007). "Pseudogenes in the ENCODE regions: Consensus annotation, analysis of transcription, and evolution". Genome Res. 17 (6): 839–51. doi:10.1101/gr.5586307. PMC 1891343. PMID 17568002.
- Marshall CR, Raff EC, Raff RA (December 1994). "Dollo's law and the death and resurrection of genes". Proc. Natl. Acad. Sci. U.S.A. 91 (25): 12283–7. Bibcode:1994PNAS...9112283M. doi:10.1073/pnas.91.25.12283. PMC 45421. PMID 7991619.
- Tutar, Y. (2012). "Pseudogenes.". Comp Funct Genomics 2012: 424526. doi:10.1155/2012/424526. PMC 3352212. PMID 22611337.
- Petrov DA, Hartl DL (2000). "Pseudogene evolution and natural selection for a compact genome". J. Hered. 91 (3): 221–7. doi:10.1093/jhered/91.3.221. PMID 10833048.
- Ponicsan SL, Kugel JF, Goodrich JA (February 2010). "Genomic gems: SINE RNAs regulate mRNA production". Current Opinion in Genetics & Development 20 (2): 149–55. doi:10.1016/j.gde.2010.01.004. PMC 2859989. PMID 20176473.
- Häsler J, Samuelsson T, Strub K (July 2007). "Useful 'junk': Alu RNAs in the human transcriptome". Cell. Mol. Life Sci. 64 (14): 1793–800. doi:10.1007/s00018-007-7084-0. PMID 17514354.
- Walters RD, Kugel JF, Goodrich JA (Aug 2009). "InvAluable junk: the cellular impact and function of Alu and B2 RNAs". IUBMB Life 61 (8): 831–7. doi:10.1002/iub.227. PMID 19621349.
- Nelson, PN.; Hooley, P.; Roden, D.; Davari Ejtehadi, H.; Rylance, P.; Warren, P.; Martin, J.; Murray, PG. (Oct 2004). "Human endogenous retroviruses: transposable elements with potential?". Clin Exp Immunol 138 (1): 1–9. doi:10.1111/j.1365-2249.2004.02592.x. PMID 15373898.
- International Human Genome Sequencing Consortium (February 2001). "Initial sequencing and analysis of the human genome". Nature 409 (6822): 879–888. doi:10.1038/35057062. PMID 11237011.
- Piegu, B.; Guyot, R.; Picault, N.; Roulin, A.; Sanyal, A.; Saniyal, A.; Kim, H.; Collura, K. et al. (Oct 2006). "Doubling genome size without polyploidization: dynamics of retrotransposition-driven genomic expansions in Oryza australiensis, a wild relative of rice.". Genome Res 16 (10): 1262–9. doi:10.1101/gr.5290206. PMID 16963705.
- Hawkins, JS.; Kim, H.; Nason, JD.; Wing, RA.; Wendel, JF. (Oct 2006). "Differential lineage-specific amplification of transposable elements is responsible for genome size variation in Gossypium.". Genome Res 16 (10): 1252–61. doi:10.1101/gr.5282906. PMID 16954538.
- Ehret CF, De Haller G (1963). "Origin, development, and maturation of organelles and organelle systems of the cell surface in Paramecium". Journal of Ultrastructure Research. 9 Supplement 1: 1, 3–42. doi:10.1016/S0022-5320(63)80088-X. PMID 14073743.
- Dan Graur, The Origin of Junk DNA: A Historical Whodunnit
- Ohno, Susumu (1972). H. H. Smith, ed. So Much "Junk" DNA in Our Genome. Gordon and Breach, New York. pp. 366–370. Retrieved 2013-05-15.
- Sean R. Eddy, The C-value paradox, junk DNA, and ENCODE
- Doolittle WF, Sapienza C (1980). "Selfish genes, the phenotype paradigm and genome evolution". Nature 284 (5757): 601–603. Bibcode:1980Natur.284..601D. doi:10.1038/284601a0. PMID 6245369.
- Another source is genome duplication followed by a loss of function due to redundancy.
- Orgel LE, Crick FH (April 1980). "Selfish DNA: the ultimate parasite". Nature 284 (5757): 604–7. Bibcode:1980Natur.284..604O. doi:10.1038/284604a0. PMID 7366731.
- Khajavinia A, Makalowski W (May 2007). "What is "junk" DNA, and what is it worth?". Scientific American 296 (5): 104. doi:10.1038/scientificamerican0307-104. PMID 17503549. "The term "junk DNA" repelled mainstream researchers from studying noncoding genetic material for many years"
- Biémont, Christian; Vieira, C (2006). "Genetics: Junk DNA as an evolutionary force". Nature 443 (7111): 521–4. Bibcode:2006Natur.443..521B. doi:10.1038/443521a. PMID 17024082.
- Ludwig MZ (December 2002). "Functional evolution of noncoding DNA". Current Opinion in Genetics & Development 12 (6): 634–9. doi:10.1016/S0959-437X(02)00355-6. PMID 12433575.
- Cobb J, Büsst C, Petrou S, Harrap S, Ellis J (April 2008). "Searching for functional genetic variants in non-coding DNA". Clin. Exp. Pharmacol. Physiol. 35 (4): 372–5. doi:10.1111/j.1440-1681.2008.04880.x. PMID 18307723.
- E Khurana, Y Fu, V Colonna, XJ Mu, HM Kang, T Lappalainen, A Sboner, L Lochovsky, J Chen, A Harmanci, J Das, A Abyzov, S Balasubramanian, K Beal, D Chakravarty, D Challis, Y Chen, D Clarke, L Clarke, F Cunningham, US Evani, P Flicek, R Fragoza, E Garrison, R Gibbs, ZH Gumus, J Herrero, N Kitabayashi, Y Kong, K Lage, V Liluashvili, SM Lipkin, DG MacArthur, G Marth, D Muzny, TH Pers, GR Ritchie, JA Rosenfeld, C Sisu, X Wei, M Wilson, Y Xue, F Yu, 1000 Genomes Project Consortium, ET Dermitzakis, H Yu, MA Rubin, C Tyler-Smith, M Gerstein (April 2013). "Integrative annotation of variants from 1092 humans: application to cancer genomics.". Science 342 (6154): 372–5. doi:10.1111/j.1440-1681.2008.04880.x. PMID 24092746.
- Subirana JA, Messeguer X (March 2010). "The most frequent short sequences in non-coding DNA". Nucleic Acids Res. 38 (4): 1172–81. doi:10.1093/nar/gkp1094. PMC 2831315. PMID 19966278.
- S. E. Ahnert, T. M. A. Fink and A. Zinovyev (2008). "How much non-coding DNA do eukaryotes require?" (PDF). J. Theor. Biol. 252 (4): 587–592. doi:10.1016/j.jtbi.2008.02.005. PMID 18384817.
- Carroll, Sean B., et al. (May 2008). "Regulating Evolution". Scientific American 298 (5): 60–67. doi:10.1038/scientificamerican0508-60. PMID 18444326.
- Callaway, Ewen (March 2010). "Junk DNA gets credit for making us who we are". New Scientist.
- "Plagiarized Errors and Molecular Genetics", talkorigins, by Edward E. Max, M.D., Ph.D.
- Balakirev ES, Ayala FJ (2003). "Pseudogenes: are they "junk" or functional DNA?". Annu. Rev. Genet. 37: 123–51. doi:10.1146/annurev.genet.37.040103.103949. PMID 14616058.
- C.-K. Peng, S. V. Buldyrev, A. L. Goldberger, S. Havlin, F. Sciortino, M. Simons, H. E. Stanley; Buldyrev, SV; Goldberger, AL; Havlin, S; Sciortino, F; Simons, M; Stanley, HE (1992). "Long-range correlations in nucleotide sequences". Nature 356 (6365): 168–70. Bibcode:1992Natur.356..168P. doi:10.1038/356168a0. PMID 1301010.
- W. Li and, K. Kaneko (1992). "Long-Range Correlation and Partial 1/falpha Spectrum in a Non-Coding DNA Sequence". Europhys. Lett 17: 655–660. Bibcode:1992EL.....17..655L. doi:10.1209/0295-5075/17/7/014.
- S. V. Buldyrev, A. L. Goldberger, S. Havlin, R. N. Mantegna, M. Matsa, C.-K. Peng, M. Simons, and H. E. Stanley; Goldberger, A.; Havlin, S.; Mantegna, R.; Matsa, M.; Peng, C.-K.; Simons, M.; Stanley, H. (1995). "Long-range correlations properties of coding and noncoding DNA sequences: GenBank analysis". Phys. Rev. E 51 (5): 5084. Bibcode:1995PhRvE..51.5084B. doi:10.1103/PhysRevE.51.5084.
- Slip opinion for Maryland v. King from the U.S. Supreme Court. Retrieved on 2013-06-04.
- Bennett, M.D. and I.J. Leitch (2005). "Genome size evolution in plants". In T.R. Gregory (ed.). The Evolution of the Genome. San Diego: Elsevier. pp. 89–162.
- Gregory, T.R (2005). "Genome size evolution in animals". In T.R. Gregory (ed.). The Evolution of the Genome. San Diego: Elsevier. ISBN 0-12-301463-8.
- Shabalina SA, Spiridonov NA (2004). "The mammalian transcriptome and the function of non-coding DNA sequences". Genome Biol. 5 (4): 105. doi:10.1186/gb-2004-5-4-105. PMC 395773. PMID 15059247.
- Castillo-Davis CI (October 2005). "The evolution of noncoding DNA: how much junk, how much func?". Trends Genet. 21 (10): 533–6. doi:10.1016/j.tig.2005.08.001. PMID 16098630.