Non-coding DNA: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
→‎Noncoding genes: Added an additional link to the Non-coding RNA article because it discusses junk RNA and spurious transcription..
I edited the GWAS section to make it less biased and more scientifically accurate.
Line 106: Line 106:
Furthermore, the much lower estimates of functionality prior to ENCODE were based on '''genomic conservation''' estimates across mammalian lineages.<ref name="eddy" /><ref name="doolittle2013" /><ref name="PalazzoGregory2014" /><ref name="graur" /> Widespread transcription and splicing in the human genome has been discussed as another indicator of genetic function in addition to genomic conservation which may miss poorly conserved functional sequences.<ref name="kellis" /> Furthermore, much of the apparent junk DNA is involved in [[epigenetic]] regulation and appears to be necessary for the development of complex organisms.<ref name="Nessa" /><ref name="extent functionality" /><ref name="Morris Epigenetics" /> '''Genetic approaches''' may miss functional elements that do not manifest physically on the organism, '''evolutionary approaches''' have difficulties using accurate multispecies sequence alignments since genomes of even closely related species vary considerably, and with '''biochemical approaches''', though having high reproducibility, the biochemical signatures do not always automatically signify a function.<ref name="kellis" /> Kellis et al. noted that 70% of the transcription coverage was less than 1 transcript per cell (and may thus be based on spurious background transcription). On the other hand, they argued that 12–15% fraction of human DNA may be under functional constraint, and may still be an underestimate when lineage-specific constraints are included. Ultimately genetic, evolutionary, and biochemical approaches can all be used in a complementary way to identify regions that may be functional in human biology and disease.<ref name="kellis" /> Some critics have argued that functionality can only be assessed in reference to an appropriate [[null hypothesis]]. In this case, the null hypothesis would be that these parts of the genome are non-functional and have properties, be it on the basis of conservation or biochemical activity, that would be expected of such regions based on our general understanding of [[molecular evolution]] and [[biochemistry]]. According to these critics, until a region in question has been shown to have additional features, beyond what is expected of the null hypothesis, it should provisionally be labelled as non-functional.<ref name="PalazzoLee2015">{{cite journal | vauthors = Palazzo AF, Lee ES | title = Non-coding RNA: what is functional and what is junk? | journal = Frontiers in Genetics | volume = 6 | pages = 2 | year = 2015 | pmid = 25674102 | pmc = 4306305 | doi = 10.3389/fgene.2015.00002 | doi-access = free }}</ref>
Furthermore, the much lower estimates of functionality prior to ENCODE were based on '''genomic conservation''' estimates across mammalian lineages.<ref name="eddy" /><ref name="doolittle2013" /><ref name="PalazzoGregory2014" /><ref name="graur" /> Widespread transcription and splicing in the human genome has been discussed as another indicator of genetic function in addition to genomic conservation which may miss poorly conserved functional sequences.<ref name="kellis" /> Furthermore, much of the apparent junk DNA is involved in [[epigenetic]] regulation and appears to be necessary for the development of complex organisms.<ref name="Nessa" /><ref name="extent functionality" /><ref name="Morris Epigenetics" /> '''Genetic approaches''' may miss functional elements that do not manifest physically on the organism, '''evolutionary approaches''' have difficulties using accurate multispecies sequence alignments since genomes of even closely related species vary considerably, and with '''biochemical approaches''', though having high reproducibility, the biochemical signatures do not always automatically signify a function.<ref name="kellis" /> Kellis et al. noted that 70% of the transcription coverage was less than 1 transcript per cell (and may thus be based on spurious background transcription). On the other hand, they argued that 12–15% fraction of human DNA may be under functional constraint, and may still be an underestimate when lineage-specific constraints are included. Ultimately genetic, evolutionary, and biochemical approaches can all be used in a complementary way to identify regions that may be functional in human biology and disease.<ref name="kellis" /> Some critics have argued that functionality can only be assessed in reference to an appropriate [[null hypothesis]]. In this case, the null hypothesis would be that these parts of the genome are non-functional and have properties, be it on the basis of conservation or biochemical activity, that would be expected of such regions based on our general understanding of [[molecular evolution]] and [[biochemistry]]. According to these critics, until a region in question has been shown to have additional features, beyond what is expected of the null hypothesis, it should provisionally be labelled as non-functional.<ref name="PalazzoLee2015">{{cite journal | vauthors = Palazzo AF, Lee ES | title = Non-coding RNA: what is functional and what is junk? | journal = Frontiers in Genetics | volume = 6 | pages = 2 | year = 2015 | pmid = 25674102 | pmc = 4306305 | doi = 10.3389/fgene.2015.00002 | doi-access = free }}</ref>


==Genpme-wide association studies (GWAS) and non-coding DNA==


[[Genome-wide association studies]] (GWAS) identify linkages between alleles and observable traits such as phenotypes and diseases. Most of the associations are between [[Single-nucleotide polymorphisms |single-nucleotide polymorphisms]] (SNPs) and the trait being examined and most of these SNPs are located in junk DNA. The association establishes a linkage that helps map the DNA region responsible for the trait but it doesn't necessarily identify the mutations responsible for the trait.<ref>{{ cite journal | vauthors = Korte A, Farlwo A | date = 2013 | title = The advantages and limitations of trait analysis with GWAS: a review | journal = Plant Methods | volume = 9 | pages = 29 | doi = 10.1186/1746-4811-9-29}}</ref><ref name = Manolio>{{cite journal | vauthors = Manolio TA | title = Genomewide association studies and assessment of the risk of disease | journal = The New England Journal of Medicine | volume = 363 | issue = 2 | pages = 166–76 | date = July 2010 | pmid = 20647212 | doi = 10.1056/NEJMra0905980 }}</ref><ref>{{cite journal | vauthors = Visscher PV, Wray NR, Zhang Q, Sklar P, McCarthy MI, Brown MA, Yang J | date = 2017 | title = 10 Years of GWAS Discovery: Biology, Function, and Translation | journal = American Journal of Human Genetics | volume = 101 | pages = 5-22 | doi = 10.1016/j.ajhg.2017.06.005}}</ref><ref>{{ cite journal | vauthors = Gallagher MD, Chen-Plotkin, AS | date = 2018 | title = The Post-GWAS Era: From Association to Function | journal = American Journal of Human Genetics | volume = 102 | pages = 717-730 | doi = 10.1016/j.ajhg.2018.04.002}}</ref><ref>{{ cite journal | vauthors = Marigorta UM, Rodríguez JA, Gibson G, Navarro A | date = 2018 | title = Replicability and Prediction: Lessons and Challenges from GWAS | journal = Trends in Genetics | volume = 34 | pages = 504-517 | doi = 10.1016/j.tig.2018.03.005}}</ref>
=== Evidence from Polygenic Scores and GWAS ===
SNPs that are tightly linked to traits are the ones most likely to identify a causal mutation. (The association is referred to as tight [[Linkage disequilibrium | linkage disequilibrium]].) About 12% of these polymorphisms are found in coding regions; about 40% are located in introns; and most of the rest are found in intergenic regions, including regulatory sequences.<ref name=Manolio></ref>
[[File:VarianceAccountedFor YongRabenLelloHsu.png|thumb|upright=1.35|The fraction of predictor SNPs in various [[Polygenic score|polygenic risk predictors]] that are within, or close to, protein coding regions; the horizontal axis denotes the inclusion also of SNPs that are within 0-30,000 base pairs from coding regions. These predictors were trained using LASSO.<ref name="Yong2020">{{cite journal | vauthors = Yong SY, Raben TG, Lello L, Hsu SD | title = Genetic architecture of complex traits and disease risk predictors | journal = Scientific Reports | volume = 10 | issue = 1 | pages = 12055 | date = July 2020 | pmid = 32694572 | pmc = 7374622 | doi = 10.1038/s41598-020-68881-8 | bibcode = 2020NatSR..1012055Y }}</ref>]]
[[File:VarianceAccountedFor YongRabenLelloHsu.png|thumb|upright=1.35|The fraction of predictor SNPs in various [[Polygenic score|polygenic risk predictors]] that are within, or close to, protein coding regions; the horizontal axis denotes the inclusion also of SNPs that are within 0-30,000 base pairs from coding regions. These predictors were trained using LASSO.<ref name="Yong2020">{{cite journal | vauthors = Yong SY, Raben TG, Lello L, Hsu SD | title = Genetic architecture of complex traits and disease risk predictors | journal = Scientific Reports | volume = 10 | issue = 1 | pages = 12055 | date = July 2020 | pmid = 32694572 | pmc = 7374622 | doi = 10.1038/s41598-020-68881-8 | bibcode = 2020NatSR..1012055Y }}</ref>]]
[[Genome-wide association study|Genome-wide association studies]] (GWAS) and machine learning analysis of large genomic datasets has led to the construction of [[Polygenic score|polygenic predictors]] for human traits such as height, bone density, and many disease risks. Similar predictors exist for plant and animal species and are used in agricultural breeding.<ref name="Wray2019">{{cite journal | vauthors = Wray NR, Kemper KE, Hayes BJ, Goddard ME, Visscher PM | title = Complex Trait Prediction from Genome Data: Contrasting EBV in Livestock to PRS in Humans: Genomic Prediction | journal = Genetics | volume = 211 | issue = 4 | pages = 1131–1141 | date = April 2019 | pmid = 30967442 | pmc = 6456317 | doi = 10.1534/genetics.119.301859 }}</ref> The detailed genetic architecture of human predictors has been analyzed and significant effects used in prediction are associated with DNA regions far outside coding regions. The fraction of variance accounted for (i.e., fraction of predictive power captured by the predictor) in coding vs. non-coding regions varies widely for different complex traits. For example, atrial fibrillation and coronary artery disease risk are mostly controlled by variants in non-coding regions (non-coding variance fraction over 70 percent), whereas diabetes and high cholesterol display the opposite pattern (non-coding variance roughly 20-30 percent).<ref name="Yong2020"/>
Individual differences between humans are clearly affected in a significant way by non-coding genetic loci, which is strong evidence for functional effects. Whole exome genotypes (i.e., which contain information restricted to coding regions only) do not contain enough information to build or even evaluate polygenic predictors for many well-studied complex traits and disease risks.

In 2013, it was estimated that, in general, up to 85% of GWAS loci have non-coding variants as the likely causal association. The variants are often common in populations and were predicted to affect disease risks through small phenotypic effects, as opposed to the large effects of [[Mendelian inheritance|Mendelian variants]].<ref name=Pennachio2013>{{cite journal | vauthors = Pennacchio LA, Bickmore W, Dean A, Nobrega MA, Bejerano G | title = Enhancers: five essential questions | journal = Nature Reviews. Genetics | volume = 14 | issue = 4 | pages = 288–295 | date = April 2013 | pmid = 23503198 | pmc = 4445073 | doi = 10.1038/nrg3458 }}</ref>


==Uses==
==Uses==

Revision as of 14:31, 25 May 2022

Non-coding DNA (ncDNA) sequences are components of an organism's DNA that do not encode protein sequences. Some non-coding DNA is transcribed into functional non-coding RNA molecules (e.g. transfer RNA, microRNA, piRNA, ribosomal RNA, and regulatory RNAs). Other functional regions of the non-coding DNA fraction include regulatory sequences that control gene expression; scaffold attachment regions; origins of DNA replication; centromeres; and telomeres. Some regions appear to be mostly nonfunctional such as introns, pseudogenes, intergenic DNA, and fragments of transposons and viruses. These apparently non-functional regions take up most of the genome of many eukaryotes and many scientists think that they are junk DNA.

Fraction of non-coding genomic DNA

Utricularia gibba has only 3% non-coding DNA.[1]

The amount of total genomic DNA varies widely between organisms, and the proportion of coding and non-coding DNA within these genomes varies greatly as well. For example, it was originally suggested that over 98% of the human genome does not encode protein sequences, including most sequences within introns and most intergenic DNA,[2] while 20% of a typical prokaryote genome is non-coding.[3]

In eukaryotes, genome size, and by extension the amount of non-coding DNA, is not correlated to organism complexity, an observation known as the C-value enigma.[4] For example, the genome of the unicellular Polychaos dubium (formerly known as Amoeba dubia) has been reported to contain more than 200 times the amount of DNA in humans.[5] The pufferfish Takifugu rubripes genome is only about one eighth the size of the human genome, yet seems to have a comparable number of genes; approximately 90% of the Takifugu genome is non-coding DNA.[2] Therefore, most of the difference in genome size is not due to variation in amount of coding DNA, rather, it is due to a difference in the amount of non-coding DNA.[6]

In 2013, a new "record" for the most efficient eukaryotic genome was discovered with Utricularia gibba, a bladderwort plant that has only 3% non-coding DNA and 97% of coding DNA. Parts of the non-coding DNA were being deleted by the plant and this suggested that non-coding DNA may not be as critical for plants, even though non-coding DNA is useful for humans.[1] Other studies on plants have discovered crucial functions in portions of non-coding DNA that were previously thought to be negligible and have added a new layer to the understanding of gene regulation.[7]

Types of non-coding DNA sequences

Noncoding genes

There are two types of genes: protein coding genes and noncoding genes.[8] Noncoding genes are an important part of non-coding DNA and they include genes for transfer RNA and ribosomal RNA. These genes were discovered in the 1960s. Prokaryotic genomes contain genes for a number of other noncoding RNAs but noncoding RNA genes are much more common in eukaryotes.

Typical classes of noncoding genes in eukaryotes include genes for small nuclear RNAs (snRNAs), small nucleolar RNAs (sno RNAs), microRNAs (miRNAs), short interfering RNAs (siRNAs), PIWI-interacting RNAs (piRNAs), and long noncoding RNAs (lncRNAs). In addition, there are a number of unique RNA genes that produce catalytic RNAs.[9]

Noncoding genes account for only a few percent of prokaryotic genomes[10] but they can represent a vastly higher fraction in eukaryotic genomes.[11] In humans, the noncoding genes take up at least 6% of the genome, largely because there are hundreds of copies of ribosomal RNA genes.[citation needed] Protein-coding genes occupy about 38% of the genome; a fraction that is much higher than the coding region because genes contain large introns.[citation needed]

The total number of noncoding genes in the human genome is controversial. Some scientists think that there are only about 5,000 noncoding genes while others believe that there may be more than 100,000 (see the article on Non-coding RNA). The difference is largely due to debate over the number of lncRNA genes.[12]

Promoters and regulatory elements

Promoters are DNA segments near the 5' end of the gene where transcription begins. They are the sites where RNA polymerase binds to initiate RNA synthesis. Every gene has a noncoding promoter.

Regulatory elements are sites that control the transcription of a nearby gene. They are almost always sequences where transcription factors bind to DNA and these transcription factors can either activate transcription (activators) or repress transcription (repressors). Regulatory elements were discoverd in the 1960s and their general characteristics were worked out in the 1970s by studying specific transcription factors in bacteria and bacteriophage.

Promoters and regulatory sequences represent an abundant class of noncoding DNA but they mostly consist of a collection of relatively short sequences so they don't take up a very large fraction of the genome. The exact amount of regulatory DNA in mammalian genome is unclear because it is difficult to distinguish between spurious transcription factor binding sites and those that are functional. The binding characteristics of typical DNA-binding proteins were characterized in the 1970s and the biochemical properties of transcription factors predict that in cells with large genomes the majority of binding sites will be fortuitous and not biologiacally functional.

Many regulatory sequences occur near promoters, usually upstream of the transcription start site of the gene. Some occur within a gene and a few are located downstream of the transcription termination site. In eukaryotes, there are some regulatory sequences that are located at a considerable distance from the promoter region. These distant regulatory sequences are often called enhancers but there is no rigorous definition of enhancer that distinguishes it from other transcription factor binding sites.[13][14]

Origins of replication

DNA synthesis begins at specific sites called origins of replication. These are regions of the genome where the DNA replication machinery is assembled and the DNA is unwound to begin DNA synthesis. In most cases, replication proceeds in both directions from the replication origin.

The main features of replication origins are sequences where specific initiation proteins are bound. A typical replication origin covers about 100-200 base pairs of DNA. Prokaryotes have one origin of replication per chromosome or plasmid but there are usually multiple origins in eukaryotic chromosomes. The human genome contains about 100,000 origins of replication representing about 0.3% of the genome.[15][16][17]

Centromeres

Centromeres are the sites where spindle fibers attach to newly replicated chromosomes in order to segregate them them into daughter cells when the cell divides. Each eukaryotic chromosome has a single functional centromere that's seen as a constricted region in a condensed metaphase chromosome. Centromeric DNA consists of a number of repetitive DNA sequences that often take up a significant fraction of the genome because each centromere can be millions of base pairs in length. In humans, for example, the sequences of all 24 centromeres have been determined[18] and they account for about 6% of the genome. However, it's unlikely that all of this noncoding DNA is essential since there is considerable variation in the total amount of centromeric DNA in different individuals.[19] Centromeres are another example of functional noncoding DNA sequences that have been known for almost half a century and it's likely that they are more abundant than coding DNA.

Telomeres

Telomeres are regions of repetitive DNA at the end of a chromosome, which provide protection from chromosomal deterioration during DNA replication. Recent studies have shown that telomeres function to aid in its own stability. Telomeric repeat-containing RNA (TERRA) are transcripts derived from telomeres. TERRA has been shown to maintain telomerase activity and lengthen the ends of chromosomes.[20]

Scaffold attachment regions

Both prokaryotic and eukarotic genomes are organized into large loops of protein-bound DNA. In eukaryotes, the bases of the loops are called scaffold attachment regions (SARs) and they consist of stretches of DNA that bind an RNA/protein complex to stabiize the loop. There are about 100,000 loops in the human genome and each one consists of about 100 bp of DNA. The total amount of DNA devoted to SARs accounts for about 0.3% of the human genome.[21]

Introns

Illustration of an unspliced pre-mRNA precursor, with five introns and six exons (top). After the introns have been removed via splicing, the mature mRNA sequence is ready for translation (bottom).

introns are the parts of a gene that are transcribed into the precursor RNA sequence, but ultimately removed by RNA splicing during the processing to mature RNA. Introns are found in both types of genes: protein-coding genes and noncoding genes. They are present in prokaryotes but they are much more common in eukaryotic genomes.

Group I and group II introns take up only a small percentage of the genome when they are present. Spliceosomal introns (see Figure) are only found in eukaryotes and they can represent a substantial proportion of the genome. In humans, for example, introns in protein-coding genes cover 37% of the genome. Combining that with about 1% coding sequences means that protein-coding genes occupy about 39% of the human genome. The calculations for noncoding genes are more complicated because there's considerable dispute over the total number of noncoding genes but taking only the well-defined examples means that noncoding genes occupy at least 6% of the genome.[22][23]

Thus, genes take up 45% of the human genome and most of this is noncoding DNA in introns.

There are good reasons to believe that most of the intron DNA is junk DNA (see the discussion in the separate Wikipedia article on introns).

Pseudogenes

Pseudogenes are mostly former genes that have become non-functional due to mutation but the term also refers to inactive DNA sequences that are derived from RNAs produced by functional genes (processed pseudogenes). Pseudogenes are only a small fraction of noncoding DNA in prokaryotic genomes because they are eliminated by negative selection. In some eukaryotes, however, pseudogenes can accumulate because selection isn't powerful enough to eliminate them (see Nearly neutral theory of molecular evolution).

The human genome contains about 15,000 pseudogenes derived from protein-coding genes and an unknown number derived from noncoding genes.[24] They may cover a substantial fraction of the genome (~5%) since many of them contain former intron sequences, .

Pseudogenes are junk DNA by definition and they evolve at the neutral rate as expected for junk DNA.[25] Some former pseudogenes have secondarily acquired a function and this leads some scientists to speculate that most pseudogenes are not junk because they have a yet-to-be-discovered function.[26]

Repeat sequences, transposons and viral elements

Mobile genetic elements in the cell (left) and how they can be acquired (right)

Transposons and retrotransposons are mobile genetic elements. Retrotransposon repeated sequences, which include long interspersed nuclear elements (LINEs) and short interspersed nuclear elements (SINEs), account for a large proportion of the genomic sequences in many species. Alu sequences, classified as a short interspersed nuclear element, are the most abundant mobile elements in the human genome. Some examples have been found of SINEs exerting transcriptional control of some protein-encoding genes.[27][28][29]

Endogenous retrovirus sequences are the product of reverse transcription of retrovirus genomes into the genomes of germ cells. Mutation within these retro-transcribed sequences can inactivate the viral genome.[30]

Over 8% of the human genome is made up of (mostly decayed) endogenous retrovirus sequences, as part of the over 42% fraction that is recognizably derived of retrotransposons, while another 3% can be identified to be the remains of DNA transposons. Much of the remaining half of the genome that is currently without an explained origin is expected to have found its origin in transposable elements that were active so long ago (> 200 million years) that random mutations have rendered them unrecognizable.[31] Genome size variation in at least two kinds of plants is mostly the result of retrotransposon sequences.[32][33]

Highly repetitive DNA

Highly repetitive DNA consists of short stretches of DNA that are repeated many times in tandem (one after the other). The repeat segments are usually between 2 bp and 10 bp but longer ones are known. Highly repetitive DNA is rare in prokaryotes but common in eukaryotes, especially those with large genomes. It is sometimes called satellite DNA.

Most of the highly repetitive DNA is found in centromeres and telomeres (see above) and most of it is functional although some might be redundant. The other significant fraction resides in short tandem repeats (STRs; also called microsatellites) consisting of short stretches of a simple repeat such as ATC. There are about 350,000 STRs in the human genome and they are scattered throughout the genome with an average length of about 25 repeats.[34][35]

Variations in the number of STR repeats can cause genetic diseases when they lie within a gene but most of these regions appear to be non-functional junk DNA that where the number of repeats can vary considerably from individual to individual. This is why these length differences are used extensively in DNA fingerprinting.

Junk DNA

The term "junk DNA" became popular in the 1960s.[36][37] According to T. Ryan Gregory, the nature of junk DNA was first discussed explicitly in 1972 by a genomic biologist, David Comings, who applied the term to all non-coding DNA.[38] The term was formalized that same year by Susumu Ohno,[6] who noted that the mutational load from deleterious mutations placed an upper limit on the number of functional loci that could be expected given a typical mutation rate. Ohno hypothesized that mammal genomes could not have more than 30,000 loci under selection before the "cost" from the mutational load would cause an inescapable decline in fitness, and eventually extinction. This prediction remains robust, with the human genome containing approximately (protein-coding) 20,000 genes[citation needed]. Another source for Ohno's theory was the observation that even closely related species can have widely (orders-of-magnitude) different genome sizes, which had been dubbed the C-value paradox in 1971.[39]

The term "junk DNA" has been questioned on the grounds that it provokes a strong a priori assumption of total non-functionality and some have recommended using more neutral terminology such as "non-coding DNA" instead.[38] Yet "junk DNA" remains a label for the portions of a genome sequence for which no discernible function has been identified and that through comparative genomics analysis appear under no functional constraint suggesting that the sequence itself has provided no adaptive advantage.

Since the late 1970s it has become apparent that the majority of non-coding DNA in large genomes finds its origin in the selfish amplification of transposable elements, of which W. Ford Doolittle and Carmen Sapienza in 1980 wrote in the journal Nature: "When a given DNA, or class of DNAs, of unproven phenotypic function can be shown to have evolved a strategy (such as transposition) which ensures its genomic survival, then no other explanation for its existence is necessary."[40] The amount of junk DNA can be expected to depend on the rate of amplification of these elements and the rate at which non-functional DNA is lost.[41] In the same issue of Nature, Leslie Orgel and Francis Crick wrote that junk DNA has "little specificity and conveys little or no selective advantage to the organism".[42] The term occurs mainly in popular science and in a colloquial way in scientific publications, and it has been suggested that its connotations may have delayed interest in the biological functions of non-coding DNA.[43]

Some evidence indicate that some "junk DNA" sequences are sources for (future) functional activity in evolution through exaptation of originally selfish or non-functional DNA.[44]

Since the 1990s, the degree of non-functionality has been disputed among scientists since there epigenetic and regulatory functions have been discovered,[45] and the ENCODE project results in 2012 provided insight to what these non-coding parts of DNA do.[46]

ENCODE Project

The Encyclopedia of DNA Elements (ENCODE) project uncovered, by direct biochemical approaches, that at least 80% of human genomic DNA has biochemical activity such as "transcription, transcription factor association, chromatin structure, and histone modification".[47] Though this was not necessarily unexpected due to previous decades of research discovering many functional non-coding regions,[3][45] some scientists criticized the conclusion for conflating biochemical activity with biological function.[48][39][49][50][51] Estimates for the biologically functional fraction of the human genome based on comparative genomics range between 8 and 15%.[52][53][54] However, others have argued against relying solely on estimates from comparative genomics due to its limited scope since non-coding DNA has been found to be involved in epigenetic activity and complex networks of genetic interactions and is explored in evolutionary developmental biology.[45][53][55][56] One consistent indication of biological functionality of a genomic region is if the sequence of that genomic region was maintained by purifying selection (or if mutating away the sequence is deleterious to the organism). Under this definition, 90% of the genome is 'junk'. However, some stress that 'junk' is not 'garbage'[57] and the large body of nonfunctional transcripts produced by 'junk DNA' can evolve functional elements de novo.[58][59]

The meaning of the results have been disputed by other scientists,[48] who argue that neither accessibility of segments of the genome to transcription factors nor their transcription guarantees that those segments have biochemical function and that their transcription is selectively advantageous. After all, non-functional sections of the genome can be transcribed, given that transcription factors typically bind to short sequences that are found (randomly) all over the whole genome.[60]

Furthermore, the much lower estimates of functionality prior to ENCODE were based on genomic conservation estimates across mammalian lineages.[39][49][50][51] Widespread transcription and splicing in the human genome has been discussed as another indicator of genetic function in addition to genomic conservation which may miss poorly conserved functional sequences.[53] Furthermore, much of the apparent junk DNA is involved in epigenetic regulation and appears to be necessary for the development of complex organisms.[45][55][56] Genetic approaches may miss functional elements that do not manifest physically on the organism, evolutionary approaches have difficulties using accurate multispecies sequence alignments since genomes of even closely related species vary considerably, and with biochemical approaches, though having high reproducibility, the biochemical signatures do not always automatically signify a function.[53] Kellis et al. noted that 70% of the transcription coverage was less than 1 transcript per cell (and may thus be based on spurious background transcription). On the other hand, they argued that 12–15% fraction of human DNA may be under functional constraint, and may still be an underestimate when lineage-specific constraints are included. Ultimately genetic, evolutionary, and biochemical approaches can all be used in a complementary way to identify regions that may be functional in human biology and disease.[53] Some critics have argued that functionality can only be assessed in reference to an appropriate null hypothesis. In this case, the null hypothesis would be that these parts of the genome are non-functional and have properties, be it on the basis of conservation or biochemical activity, that would be expected of such regions based on our general understanding of molecular evolution and biochemistry. According to these critics, until a region in question has been shown to have additional features, beyond what is expected of the null hypothesis, it should provisionally be labelled as non-functional.[61]

Genpme-wide association studies (GWAS) and non-coding DNA

Genome-wide association studies (GWAS) identify linkages between alleles and observable traits such as phenotypes and diseases. Most of the associations are between single-nucleotide polymorphisms (SNPs) and the trait being examined and most of these SNPs are located in junk DNA. The association establishes a linkage that helps map the DNA region responsible for the trait but it doesn't necessarily identify the mutations responsible for the trait.[62][63][64][65][66]

SNPs that are tightly linked to traits are the ones most likely to identify a causal mutation. (The association is referred to as tight linkage disequilibrium.) About 12% of these polymorphisms are found in coding regions; about 40% are located in introns; and most of the rest are found in intergenic regions, including regulatory sequences.[63]

The fraction of predictor SNPs in various polygenic risk predictors that are within, or close to, protein coding regions; the horizontal axis denotes the inclusion also of SNPs that are within 0-30,000 base pairs from coding regions. These predictors were trained using LASSO.[67]

Uses

Evolution

Shared sequences of apparently non-functional DNA are a major line of evidence of common descent.[68]

Pseudogene sequences appear to accumulate mutations more rapidly than coding sequences due to a loss of selective pressure.[69] This allows for the creation of mutant alleles that incorporate new functions that may be favored by natural selection; thus, pseudogenes can serve as raw material for evolution and can be considered "protogenes".[70]

A study published in 2019 shows that new genes (termed de novo gene birth) can be fashioned from non-coding regions.[71] Some studies suggest at least one-tenth of genes could be made in this way.[71]

See also

References

  1. ^ a b "Worlds Record Breaking Plant: Deletes its Noncoding "Junk" DNA". Design & Trend. May 12, 2013. Retrieved 2013-06-04.
  2. ^ a b Elgar G, Vavouri T (July 2008). "Tuning in to the signals: noncoding sequence conservation in vertebrate genomes". Trends in Genetics. 24 (7): 344–352. doi:10.1016/j.tig.2008.04.005. PMID 18514361.
  3. ^ a b Costa F (2012). "7 Non-coding RNAs, Epigenomics, and Complexity in Human Cells". In Morris KV (ed.). Non-coding RNAs and Epigenetic Regulation of Gene Expression: Drivers of Natural Selection. Caister Academic Press. ISBN 978-1904455943.
  4. ^ Thomas CA (1971). "The genetic organization of chromosomes". Annual Review of Genetics. 5: 237–256. doi:10.1146/annurev.ge.05.120171.001321. PMID 16097657.
  5. ^ Gregory TR, Hebert PD (April 1999). "The modulation of DNA content: proximate causes and ultimate consequences". Genome Research. 9 (4): 317–324. doi:10.1101/gr.9.4.317. PMID 10207154. S2CID 16791399.
  6. ^ a b Ohno S (1972). Smith HH (ed.). "So much "junk" DNA in our genome". Brookhaven Symposia in Biology. 23. Gordon and Breach, New York: 366–370. PMID 5065367. Retrieved 2013-05-15.
  7. ^ Waterhouse PM, Hellens RP (April 2015). "Plant biology: Coding in non-coding RNAs". Nature. 520 (7545): 41–42. Bibcode:2015Natur.520...41W. doi:10.1038/nature14378. PMID 25807488. S2CID 205243381.
  8. ^ Kampourakis K (2017). Making sense of genes. Cambridge UK: Cambridge University Press. ISBN 978-1-107-12813-2.
  9. ^ Cech TR, Steitz JA (2014). "The Noncoding RNA Revolution - Trashing Old Rules to Forge New Ones". Cell. 157 (1): 77–94. doi:10.1016/j.cell.2014.03.008. PMID 24679528. S2CID 14852160.
  10. ^ Rogozin, I. B. (1 October 2002). "Congruent evolution of different classes of non-coding DNA in prokaryotic genomes". Nucleic Acids Research. 30 (19): 4264–4271. doi:10.1093/nar/gkf549. PMC 140549. PMID 12364605.
  11. ^ Bielawski, J. P.; Jones, C. (1 January 2016). "Adaptive Molecular Evolution: Detection Methods". Encyclopedia of Evolutionary Biology: 16–25. doi:10.1016/B978-0-12-800049-6.00171-2. ISBN 9780128004265.
  12. ^ Ponting CP, and Haerty W (2022). "Genome-Wide Analysis of Human Long Noncoding RNAs: A Provocative Review". Annual Review of Genomics and Human Genetics. 23. doi:10.1146/annurev-genom-112921-123710. PMID 35395170. S2CID 248049706.
  13. ^ Compe E, Egly JM (2021). "The Long Road to Understanding RNAPII Transcription Initiation and Related Syndromes". Annual Review of Biochemistry. 90: 193–219. doi:10.1146/annurev-biochem-090220-112253. PMID 34153211. S2CID 235595550.
  14. ^ Visel A, Rubin EM, Pennacchio LA (September 2009). "Genomic views of distant-acting enhancers". Nature. 461 (7261): 199–205. Bibcode:2009Natur.461..199V. doi:10.1038/nature08451. PMC 2923221. PMID 19741700.
  15. ^ Leonard AC, Méchali M (2013). "DNA replication origins". Cold Spring Harbor Perspectives in Biology. 5 (10): a010116. doi:10.1101/cshperspect.a010116. PMC 3783049. PMID 23838439.
  16. ^ Urban JM, Foulk MS, Casella C, Gerbi SA (2015). "The hunt for origins of DNA replication in multicellular eukaryotes". F1000Prime Reports. 7: 30. doi:10.12703/P7-30. PMC 4371235. PMID 25926981.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  17. ^ Prioleau M, MacAlpine DM (2016). "DNA replication origins—where do we begin?". Genes & Development. 30 (15): 1683–1697. doi:10.1101/gad.285114.116. PMC 5002974. PMID 27542827.
  18. ^ Altemose N, Logsdon GA, Bzikadze AV, Sidhwani P, Langley SA, Caldas GV, et al. (2021). "Complete genomic and epigenetic maps of human centromeres". Science. 376 (6588): 56. doi:10.1126/science.abl4178. PMID 35357911. S2CID 247853627.
  19. ^ Miga KH (2019). "Centromeric satellite DNAs: hidden sequence variation in the human population". Genes. 10 (5): 353. doi:10.3390/genes10050352. PMC 6562703. PMID 31072070.
  20. ^ Cusanelli E, Chartrand P (May 2014). "Telomeric noncoding RNA: telomeric repeat-containing RNA in telomere biology". Wiley Interdisciplinary Reviews. RNA. 5 (3): 407–419. doi:10.1002/wrna.1220. PMID 24523222. S2CID 36918311.
  21. ^ Mistreli T (2020). "The self-organizing genome: Principles of genome architecture and function". Cell. 183 (1): 28–45. doi:10.1016/j.cell.2020.09.014. PMC 7541718. PMID 32976797.
  22. ^ Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, Aken BL, Barrell D, Zadissa A, Searle S (2012). "GENCODE: the reference human genome annotation for The ENCODE Project". Genome Research. 22 (9): 1760–1774. doi:10.1101/gr.135350.111. PMC 3431492. PMID 22955987.
  23. ^ Piovesan A, Antonaros F, Vitale L, Strippoli P, Pelleri MC, Caracausi M (2019). "Human protein-coding genes and gene feature statistics in 2019". BMC Research Notes. 12 (1): 315. doi:10.1186/s13104-019-4343-8. PMC 6549324. PMID 31164174.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  24. ^ "Ensemble Human reference genome GRCh38.p13".
  25. ^ Xu J, Zhang J (2015). "Are human translated pseudogenes functional?". Molecular Biology and Evolution. 33 (3): 755–760. doi:10.1093/molbev/msv268. PMC 5009996. PMID 26589994.
  26. ^ Wen YZ, Zheng LL, Qu LH, Ayala FJ, Lun ZR (2012). "Pseudogenes are not pseudo any more". RNA Biology. 9 (1): 27–32. doi:10.4161/rna.9.1.18277. PMID 22258143. S2CID 13161678.
  27. ^ Ponicsan SL, Kugel JF, Goodrich JA (April 2010). "Genomic gems: SINE RNAs regulate mRNA production". Current Opinion in Genetics & Development. 20 (2): 149–155. doi:10.1016/j.gde.2010.01.004. PMC 2859989. PMID 20176473.
  28. ^ Häsler J, Samuelsson T, Strub K (July 2007). "Useful 'junk': Alu RNAs in the human transcriptome". Cellular and Molecular Life Sciences (Submitted manuscript). 64 (14): 1793–1800. doi:10.1007/s00018-007-7084-0. PMID 17514354. S2CID 5938630.
  29. ^ Walters RD, Kugel JF, Goodrich JA (August 2009). "InvAluable junk: the cellular impact and function of Alu and B2 RNAs". IUBMB Life. 61 (8): 831–837. doi:10.1002/iub.227. PMC 4049031. PMID 19621349.
  30. ^ Nelson PN, Hooley P, Roden D, Davari Ejtehadi H, Rylance P, Warren P, et al. (October 2004). "Human endogenous retroviruses: transposable elements with potential?". Clinical and Experimental Immunology. 138 (1): 1–9. doi:10.1111/j.1365-2249.2004.02592.x. PMC 1809191. PMID 15373898.
  31. ^ Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, et al. (February 2001). "Initial sequencing and analysis of the human genome". Nature. 409 (6822): 860–921. Bibcode:2001Natur.409..860L. doi:10.1038/35057062. PMID 11237011.
  32. ^ Piegu B, Guyot R, Picault N, Roulin A, Sanyal A, Saniyal A, et al. (October 2006). "Doubling genome size without polyploidization: dynamics of retrotransposition-driven genomic expansions in Oryza australiensis, a wild relative of rice". Genome Research. 16 (10): 1262–1269. doi:10.1101/gr.5290206. PMC 1581435. PMID 16963705.
  33. ^ Hawkins JS, Kim H, Nason JD, Wing RA, Wendel JF (October 2006). "Differential lineage-specific amplification of transposable elements is responsible for genome size variation in Gossypium". Genome Research. 16 (10): 1252–1261. doi:10.1101/gr.5282906. PMC 1581434. PMID 16954538.
  34. ^ Gymrek M, Willems T, Guilmatre A, Zeng H, Markus B, Georgiev S, Daly MJ, Price AL, Pritchard JK, Sharp AJ, Erlich Y (2016). "Abundant contribution of short tandem repeats to gene expression variation in humans". Nature Genetics. 48: 22–29. doi:10.1038/ng.3461.
  35. ^ Kronenberg ZN, Fiddes IT, Gordon D, Murali S, Cantsilieris S, Meyerson OS, Underwood JG, Nelson BJ, Chaisson MJ, Dougherty ML (2018). "High-resolution comparative analysis of great ape genomes". Science. 360: 1085. doi:10.1126/science.aar6343.
  36. ^ Ehret CF, De Haller G (October 1963). "Origin, development, and maturation of organelles and organelle systems of the cell surface in Paramecium". Journal of Ultrastructure Research. 23: SUPPL6:1–SUPPL642. doi:10.1016/S0022-5320(63)80088-X. PMID 14073743.
  37. ^ Dan Graur, The Origin of Junk DNA: A Historical Whodunnit
  38. ^ a b TR, ed. (2005). The Evolution of the Genome. Elsevier. pp. 29–31. ISBN 978-0123014634. Comings (1972), on the other hand, gave what must be considered the first explicit discussion of the nature of "junk DNA," and was the first to apply the term to all non-coding DNA."; "For this reason, it is unlikely that any one function for non-coding DNA can account for either its sheer mass or its unequal distribution among taxa. However, dismissing it as no more than "junk" in the pejorative sense of "useless" or "wasteful" does little to advance the understanding of genome evolution. For this reason, the far less loaded term "noncoding DNA" is used throughout this chapter and is recommended in preference to "junk DNA" for future treatments of the subject."
  39. ^ a b c Eddy SR (November 2012). "The C-value paradox, junk DNA and ENCODE". Current Biology. 22 (21): R898–R899. doi:10.1016/j.cub.2012.10.002. PMID 23137679. S2CID 28289437.
  40. ^ Doolittle WF, Sapienza C (April 1980). "Selfish genes, the phenotype paradigm and genome evolution". Nature. 284 (5757): 601–603. Bibcode:1980Natur.284..601D. doi:10.1038/284601a0. PMID 6245369. S2CID 4311366.
  41. ^ Another source is genome duplication followed by a loss of function due to redundancy.
  42. ^ Orgel LE, Crick FH (April 1980). "Selfish DNA: the ultimate parasite". Nature. 284 (5757): 604–607. Bibcode:1980Natur.284..604O. doi:10.1038/284604a0. PMID 7366731. S2CID 4233826.
  43. ^ Khajavinia A, Makalowski W (May 2007). "What is "junk" DNA, and what is it worth?". Scientific American. 296 (5): 104. Bibcode:2007SciAm.296c.104.. doi:10.1038/scientificamerican0307-104. PMID 17503549. The term "junk DNA" repelled mainstream researchers from studying noncoding genetic material for many years
  44. ^ Biémont C, Vieira C (October 2006). "Genetics: junk DNA as an evolutionary force". Nature. 443 (7111): 521–524. Bibcode:2006Natur.443..521B. doi:10.1038/443521a. PMID 17024082. S2CID 205033991.
  45. ^ a b c d Carey M (2015). Junk DNA: A Journey Through the Dark Matter of the Genome. Columbia University Press. ISBN 9780231170840.
  46. ^ Pennisi E (September 2012). "Genomics. ENCODE project writes eulogy for junk DNA". Science. 337 (6099): 1159, 1161. doi:10.1126/science.337.6099.1159. PMID 22955811.
  47. ^ Dunham I, Kundaje A, Aldred SF, Collins PJ, Davis CA, Doyle F, et al. (The ENCODE Project Consortium) (September 2012). "An integrated encyclopedia of DNA elements in the human genome". Nature. 489 (7414): 57–74. Bibcode:2012Natur.489...57T. doi:10.1038/nature11247. PMC 3439153. PMID 22955616..
  48. ^ a b McKie R (24 February 2013). "Scientists attacked over claim that 'junk DNA' is vital to life". The Observer.
  49. ^ a b Doolittle WF (April 2013). "Is junk DNA bunk? A critique of ENCODE". Proceedings of the National Academy of Sciences of the United States of America. 110 (14): 5294–5300. Bibcode:2013PNAS..110.5294D. doi:10.1073/pnas.1221376110. PMC 3619371. PMID 23479647.
  50. ^ a b Palazzo AF, Gregory TR (May 2014). "The case for junk DNA". PLOS Genetics. 10 (5): e1004351. doi:10.1371/journal.pgen.1004351. PMC 4014423. PMID 24809441.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  51. ^ a b Graur D, Zheng Y, Price N, Azevedo RB, Zufall RA, Elhaik E (2013). "On the immortality of television sets: "function" in the human genome according to the evolution-free gospel of ENCODE". Genome Biology and Evolution. 5 (3): 578–590. doi:10.1093/gbe/evt028. PMC 3622293. PMID 23431001.
  52. ^ Ponting CP, Hardison RC (November 2011). "What fraction of the human genome is functional?". Genome Research. 21 (11): 1769–1776. doi:10.1101/gr.116814.110. PMC 3205562. PMID 21875934.
  53. ^ a b c d e Kellis M, Wold B, Snyder MP, Bernstein BE, Kundaje A, Marinov GK, et al. (April 2014). "Defining functional DNA elements in the human genome". Proceedings of the National Academy of Sciences of the United States of America. 111 (17): 6131–6138. Bibcode:2014PNAS..111.6131K. doi:10.1073/pnas.1318948111. PMC 4035993. PMID 24753594.
  54. ^ Rands CM, Meader S, Ponting CP, Lunter G (July 2014). "8.2% of the Human genome is constrained: variation in rates of turnover across functional element classes in the human lineage". PLOS Genetics. 10 (7): e1004525. doi:10.1371/journal.pgen.1004525. PMC 4109858. PMID 25057982.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  55. ^ a b Liu G, Mattick JS, Taft RJ (July 2013). "A meta-analysis of the genomic and transcriptomic composition of complex life". Cell Cycle. 12 (13): 2061–2072. doi:10.1186/1877-6566-7-2. PMC 4685169. PMID 23759593.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  56. ^ a b Morris K, ed. (2012). Non-Coding RNAs and Epigenetic Regulation of Gene Expression: Drivers of Natural Selection. Norfolk, UK: Caister Academic Press. ISBN 978-1904455943.
  57. ^ Brenner S (1998-09-24). "Refuge of spandrels". Current Biology. 8 (19): R669. doi:10.1016/S0960-9822(98)70427-0. PMID 9776723. S2CID 2918533.
  58. ^ Palazzo AF, Koonin EV (November 2020). "Functional Long Non-coding RNAs Evolve from Junk Transcripts". Cell. 183 (5): 1151–1161. doi:10.1016/j.cell.2020.09.047. PMID 33068526. S2CID 222815635.
  59. ^ Graur D, Zheng Y, Azevedo RB (January 2015). "An evolutionary classification of genomic function". Genome Biology and Evolution. 7 (3): 642–645. doi:10.1093/gbe/evv021. PMC 5322545. PMID 25635041.
  60. ^ Lambert SA, Jolma A, Campitelli LF, Das PK, Yin Y, Albu M, et al. (February 2018). "The Human Transcription Factors". Cell. 172 (4): 650–665. doi:10.1016/j.cell.2018.01.029. PMID 29425488. S2CID 3599827.
  61. ^ Palazzo AF, Lee ES (2015). "Non-coding RNA: what is functional and what is junk?". Frontiers in Genetics. 6: 2. doi:10.3389/fgene.2015.00002. PMC 4306305. PMID 25674102.
  62. ^ Korte A, Farlwo A (2013). "The advantages and limitations of trait analysis with GWAS: a review". Plant Methods. 9: 29. doi:10.1186/1746-4811-9-29.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  63. ^ a b Manolio TA (July 2010). "Genomewide association studies and assessment of the risk of disease". The New England Journal of Medicine. 363 (2): 166–76. doi:10.1056/NEJMra0905980. PMID 20647212.
  64. ^ Visscher PV, Wray NR, Zhang Q, Sklar P, McCarthy MI, Brown MA, Yang J (2017). "10 Years of GWAS Discovery: Biology, Function, and Translation". American Journal of Human Genetics. 101: 5–22. doi:10.1016/j.ajhg.2017.06.005.
  65. ^ Gallagher MD, Chen-Plotkin, AS (2018). "The Post-GWAS Era: From Association to Function". American Journal of Human Genetics. 102: 717–730. doi:10.1016/j.ajhg.2018.04.002.
  66. ^ Marigorta UM, Rodríguez JA, Gibson G, Navarro A (2018). "Replicability and Prediction: Lessons and Challenges from GWAS". Trends in Genetics. 34: 504–517. doi:10.1016/j.tig.2018.03.005.
  67. ^ Yong SY, Raben TG, Lello L, Hsu SD (July 2020). "Genetic architecture of complex traits and disease risk predictors". Scientific Reports. 10 (1): 12055. Bibcode:2020NatSR..1012055Y. doi:10.1038/s41598-020-68881-8. PMC 7374622. PMID 32694572.
  68. ^ "Plagiarized Errors and Molecular Genetics", talkorigins, by Edward E. Max, M.D., Ph.D.
  69. ^ Petrov DA, Hartl DL (2000). "Pseudogene evolution and natural selection for a compact genome". The Journal of Heredity. 91 (3): 221–227. doi:10.1093/jhered/91.3.221. PMID 10833048.
  70. ^ Balakirev ES, Ayala FJ (2003). "Pseudogenes: are they "junk" or functional DNA?". Annual Review of Genetics. 37: 123–151. doi:10.1146/annurev.genet.37.040103.103949. PMID 14616058. S2CID 24683075.
  71. ^ a b Levy A (October 2019). "How evolution builds genes from scratch". Nature. 574 (7778): 314–316. Bibcode:2019Natur.574..314L. doi:10.1038/d41586-019-03061-x. PMID 31619796. S2CID 204707405.

Further reading

External links