Graphical representation of the idealized human diploid karyotype, showing the organization of the genome into chromosomes. This drawing shows both the female (XX) and male (XY) versions of the 23rd chromosome pair. Chromosomes are shown aligned at their centromeres. The mitochondrial DNA is not shown.
|NCBI genome ID|
|Genome size||3,234.83 Mb (Mega-basepairs)|
|Number of chromosomes||23 pairs|
The human genome is the complete set of nucleic acid sequence for humans (Homo sapiens), encoded as DNA within the 23 chromosome pairs in cell nuclei and in a small DNA molecule found within individual mitochondria. Human genomes include both protein-coding DNA genes and noncoding DNA. Haploid human genomes, which are contained in germ cells (the egg and sperm gamete cells created in the meiosis phase of sexual reproduction before fertilization creates a zygote) consist of three billion DNA base pairs, while diploid genomes (found in somatic cells) have twice the DNA content. While there are significant differences among the genomes of human individuals (on the order of 0.1%), these are considerably smaller than the differences between humans and their closest living relatives, the chimpanzees (approximately 4%) and bonobos. Humans share 50% of their DNA with bananas.
The Human Genome Project produced the first complete sequences of individual human genomes, with the first draft sequence and initial analysis being published on February 12, 2001. The human genome was the first of all vertebrates to be completely sequenced. As of 2012, thousands of human genomes have been completely sequenced, and many more have been mapped at lower levels of resolution. The resulting data are used worldwide in biomedical science, anthropology, forensics and other branches of science. There is a widely held expectation that genomic studies will lead to advances in the diagnosis and treatment of diseases, and to new insights in many fields of biology, including human evolution.
Although the sequence of the human genome has been (almost) completely determined by DNA sequencing, it is not yet fully understood. Most (though probably not all) genes have been identified by a combination of high throughput experimental and bioinformatics approaches, yet much work still needs to be done to further elucidate the biological functions of their protein and RNA products. Recent results suggest that most of the vast quantities of noncoding DNA within the genome have associated biochemical activities, including regulation of gene expression, organization of chromosome architecture, and signals controlling epigenetic inheritance.
There are an estimated 20,000-25,000 human protein-coding genes. The estimate of the number of human genes has been repeatedly revised down from initial predictions of 100,000 or more as genome sequence quality and gene finding methods have improved, and could continue to drop further. Protein-coding sequences account for only a very small fraction of the genome (approximately 1.5%), and the rest is associated with non-coding RNA molecules, regulatory DNA sequences, LINEs, SINEs, introns, and sequences for which as yet no function has been elucidated.
- 1 Molecular organization and gene content
- 2 Coding vs. noncoding DNA
- 3 Mutation Rate of Human Genome
- 4 Coding sequences (protein-coding genes)
- 5 Noncoding DNA (ncDNA)
- 6 Genomic variation in humans
- 7 Human genetic disorders
- 8 Evolution
- 9 Mitochondrial DNA
- 10 Epigenome
- 11 See also
- 12 References
- 13 External links
Molecular organization and gene content
The total length of the human genome is over 3 billion base pairs. The genome is organized into 22 paired chromosomes, plus the X chromosome (one in males, two in females) and, in males only, one Y chromosome. These are all large linear DNA molecules contained within the cell nucleus. The genome also includes the mitochondrial DNA, a comparatively small circular molecule present in each mitochondrion. Basic information about these molecules and their gene content, based on a reference genome that does not represent the sequence of any specific individual, are provided in the following table. (Data source: Ensembl genome browser release 68, July 2012)
|Chromosome||Length (mm)||Base pairs||Variations||Confirmed proteins||Putative proteins||Pseudogenes||miRNA||rRNA||snRNA||snoRNA||Misc ncRNA||Links||Centromere position (Mbp)||Cumulative (%)|
Table 1 (above) summarizes the physical organization and gene content of the human reference genome, with links to the original analysis, as published in the Ensembl database at the European Bioinformatics Institute (EBI) and Wellcome Trust Sanger Institute. Chromosome lengths were estimated by multiplying the number of base pairs by 0.34 nanometers, the distance between base pairs in the DNA double helix. The number of proteins is based on the number of initial precursor mRNA transcripts, and does not include products of alternative pre-mRNA splicing, or modifications to protein structure that occur after translation.
The number of variations is a summary of unique DNA sequence changes that have been identified within the sequences analyzed by Ensembl as of July, 2012; that number is expected to increase as further personal genomes are sequenced and examined. In addition to the gene content shown in this table, a large number of non-expressed functional sequences have been identified throughout the human genome (see below). Links open windows to the reference chromosome sequence in the EBI genome browser. The table also describes prevalence of genes encoding structural RNAs in the genome.
MiRNA, or MicroRNA, functions as a post-transcriptional regulator of gene expression. Ribosomal RNA, or rRNA, makes up the RNA portion of the ribosome and is critical in the synthesis of proteins. Small nuclear RNA, or snRNA, is found in the nucleus of the cell. Its primary function is in the processing of pre-mRNA molecules and also in the regulation of transcription factors. SnoRNA, or Small nucleolar RNA, primarily functions in guiding chemical modifications to other RNA molecules.
Completeness of the human genome sequence
Although the human genome has been completely sequenced for all practical purposes, there are still hundreds of gaps in the sequence. A recent study noted more than 160 euchromatic gaps of which 50 gaps were closed. However, there are still numerous gaps in the heterochromatic parts of the genome which is much harder to sequence due to numerous repeats and other intractable sequence features.
Coding vs. noncoding DNA
The content of the human genome is commonly divided into coding and noncoding DNA sequences. Coding DNA is defined as those sequences that can be transcribed into mRNA and translated into proteins during the human life cycle; these sequences occupy only a small fraction of the genome (<2%). Noncoding DNA is made up of all of those sequences (ca. 98% of the genome) that are not used to encode proteins.
Some noncoding DNA contains genes for RNA molecules with important biological functions (noncoding RNA, for example ribosomal RNA and transfer RNA). The exploration of the function and evolutionary origin of noncoding DNA is an important goal of contemporary genome research, including the ENCODE (Encyclopedia of DNA Elements) project, which aims to survey the entire human genome, using a variety of experimental tools whose results are indicative of molecular activity.
Mutation Rate of Human Genome
Mutation rate of human genome is a very important factor in calculating evolutionary time points. Researchers calculated the number of genetic variations between human and apes. Dividing that number by age of fossil of most recent common ancestor of humans and ape, researchers calculated the mutation rate. Recent studies using next generation sequencing technologies concluded a slow mutation rate which doesn't add up with human migration pattern time points and suggesting a new evolutionary time scale. 100,000 year old human fossil found in Israel threw more questions on human migration time points.
Coding sequences (protein-coding genes)
Protein-coding sequences represent the most widely studied and best understood component of the human genome. These sequences ultimately lead to the production of all human proteins, although several biological processes (e.g. DNA rearrangements and alternative pre-mRNA splicing) can lead to the production of many more unique proteins than the number of protein-coding genes.
The complete modular protein-coding capacity of the genome is contained within the exome, and consists of DNA sequences encoded by exons that can be translated into proteins. Because of its biological importance, and the fact that it constitutes less than 2% of the genome, sequencing of the exome was the first major milepost of the Human Genome Project.
Number of protein-coding genes. About 20,000 human proteins have been annotated in databases such as Uniprot. Historically, estimates for the number of protein genes have varied widely, ranging up to 2,000,000 in the late 1960s, but several researchers pointed out in the early 1970s that the estimated mutational load from deleterious mutations placed an upper limit of approximately 40,000 for the total number of functional loci (this includes protein-coding and functional non-coding genes).
The number of human protein-coding genes is not significantly larger than that of many less complex organisms, such as the roundworm and the fruit fly. This difference may result from the extensive use of alternative pre-mRNA splicing in humans, which provides the ability to build a very large number of modular proteins through the selective incorporation of exons.
Protein-coding capacity per chromosome. Protein-coding genes are distributed unevenly across the chromosomes, ranging from a few dozen to more than 2000, with an especially high gene density within chromosomes 19, 11, and 1 (Table 1). Each chromosome contains various gene-rich and gene-poor regions, which may be correlated with chromosome bands and GC-content. The significance of these nonrandom patterns of gene density is not well understood.
Size of protein-coding genes. The size of protein-coding genes within the human genome shows enormous variability (Table 2). For example, the gene for histone H1a (HIST1HIA) is relatively small and simple, lacking introns and encoding mRNA sequences of 781 nt and a 215 amino acid protein (648 nt open reading frame). Dystrophin (DMD) is the largest protein-coding gene in the human reference genome, spanning a total of 2.2 MB, while Titin (TTN) has the longest coding sequence (80,780 bp), the largest number of exons (364), and the longest single exon (17,106 bp). Over the whole genome, the median size of an exon is 122 bp (mean = 145 bp), the median number of exons is 7 (mean = 8.8), and the median coding sequence encodes 367 amino acids (mean = 447 amino acids; Table 21 in ).
|Protein||Chrom||Gene||Length||Exons||Exon length||Intron length||Alt splicing|
|Breast cancer type 2 susceptibility protein||13||BRCA2||83,736||27||11,386||72,350||yes|
|Cystic fibrosis transmembrane conductance regulator||7||CFTR||202,881||27||4,440||198,441||yes|
|Hemoglobin beta subunit||11||HBB||1,605||3||626||979||no|
Table 2. Examples of human protein-coding genes. Chrom, chromosome. Alt splicing, alternative pre-mRNA splicing. (Data source: Ensembl genome browser release 68, July 2012)
Noncoding DNA (ncDNA)
Noncoding DNA is defined as all of the DNA sequences within a genome that are not found within protein-coding exons, and so are never represented within the amino acid sequence of expressed proteins. By this definition, more than 98% of the human genomes is composed of ncDNA.
Numerous classes of noncoding DNA have been identified, including genes for noncoding RNA (e.g. tRNA and rRNA), pseudogenes, introns, untranslated regions of mRNA, regulatory DNA sequences, repetitive DNA sequences, and sequences related to mobile genetic elements.
Numerous sequences that are included within genes are also defined as noncoding DNA. These include genes for noncoding RNA (e.g. tRNA, rRNA), and untranslated components of protein-coding genes (e.g. introns, and 5' and 3' untranslated regions of mRNA).
Protein-coding sequences (specifically, coding exons) constitute less than 1.5% of the human genome. In addition, about 26% of the human genome is introns. Aside from genes (exons and introns) and known regulatory sequences (8–20%), the human genome contains regions of noncoding DNA. The exact amount of noncoding DNA that plays a role in cell physiology has been hotly debated. Recent analysis by the ENCODE project indicates that 80% of the entire human genome is either transcribed, binds to regulatory proteins, or is associated with some other biochemical activity.
It however remains controversial whether all of this biochemical activity contributes to cell physiology, or whether a substantial portion of this is the result transcriptional and biochemical noise, which must be actively filtered out by the organism. Excluding protein-coding sequences, introns, and regulatory regions, much of the non-coding DNA is composed of: Many DNA sequences that do not play a role in gene expression have important biological functions. Comparative genomics studies indicate that about 5% of the genome contains sequences of noncoding DNA that are highly conserved, sometimes on time-scales representing hundreds of millions of years, implying that these noncoding regions are under strong evolutionary pressure and positive selection.
Many of these sequences regulate the structure of chromosomes by limiting the regions of heterochromatin formation and regulating structural features of the chromosomes, such as the telomeres and centromeres. Other noncoding regions serve as origins of DNA replication. Finally several regions are transcribed into functional noncoding RNA that regulate the expression of protein-coding genes (for example ), mRNA translation and stability (see miRNA), chromatin structure (including histone modifications, for example ), DNA methylation (for example ), DNA recombination (for example ), and cross-regulate other noncoding RNAs (for example ). It is also likely that many transcribed noncoding regions do not serve any role and that this transcription is the product of non-specific RNA Polymerase activity.
Pseudogenes are inactive copies of protein-coding genes, often generated by gene duplication, that have become nonfunctional through the accumulation of inactivating mutations. Table 1 shows that the number of pseudogenes in the human genome is on the order of 13,000, and in some chromosomes is nearly the same as the number of functional protein-coding genes. Gene duplication is a major mechanism through which new genetic material is generated during molecular evolution.
For example, the olfactory receptor gene family is one of the best-documented examples of pseudogenes in the human genome. More than 60 percent of the genes in this family are non-functional pseudogenes in humans. By comparison, only 20 percent of genes in the mouse olfactory receptor gene family are pseudogenes. Research suggests that this is a species-specific characteristic, as the most closely related primates all have proportionally fewer pseudogenes. This genetic discovery helps to explain the less acute sense of smell in humans relative to other mammals.
Genes for noncoding RNA (ncRNA)
Noncoding RNA molecules play many essential roles in cells, especially in the many reactions of protein synthesis and RNA processing. ncRNAs include tRNA, ribosomal RNA, microRNA, snRNA and other non-coding RNA genes including about 60,000 long non coding RNAs (lncRNAs). It should be noted that while the number of reported lncRNA genes continues to rise and the exact number in the human genome is yet to be defined, many of them are argued to be non-functional.
Many ncRNAs are critical elements in gene regulation and expression. Noncoding RNA also contributes to epigenetics, transcription, RNA splicing, and the translational machinery. The role of RNA in genetic regulation and disease offers a new potential level of unexplored genomic complexity.
Introns and untranslated regions of mRNA
In addition to the ncRNA molecules that are encoded by discrete genes, the initial transcripts of protein coding genes usually contain extensive noncoding sequences, in the form of introns, 5'-untranslated regions (5'-UTR), and 3'-untranslated regions (3'-UTR). Within most protein-coding genes of the human genome, the length of intron sequences is 10- to 100-times the length of exon sequences (Table 2).
Regulatory DNA sequences
The human genome has many different regulatory sequences which are crucial to controlling gene expression. Conservative estimates indicate that these sequences make up 8% of the genome, however extrapolations from the ENCODE project give that 20-40% of the genome is gene regulatory sequence. Some types of non-coding DNA are genetic "switches" that do not encode proteins, but do regulate when and where genes are expressed (called enhancers).
Regulatory sequences have been known since the late 1960s. The first identification of regulatory sequences in the human genome relied on recombinant DNA technology. Later with the advent of genomic sequencing, the identification of these sequences could be inferred by evolutionary conservation. The evolutionary branch between the primates and mouse, for example, occurred 70–90 million years ago. So computer comparisons of gene sequences that identify conserved non-coding sequences will be an indication of their importance in duties such as gene regulation.
Other genomes have been sequenced with the same intention of aiding conservation-guided methods, for exampled the pufferfish genome. However, regulatory sequences disappear and re-evolve during evolution at a high rate.
As of 2012, the efforts have shifted toward finding interactions between DNA and regulatory proteins by the technique ChIP-Seq, or gaps where the DNA is not packaged by histones (DNase hypersensitive sites), both of which tell where there are active regulatory sequences in the investigated cell type.
Repetitive DNA sequences
About 8% of the human genome consists of tandem DNA arrays or tandem repeats, low complexity repeat sequences that have multiple adjacent copies (e.g. "CAGCAGCAG..."). The tandem sequences may be of variable lengths, from two nucleotides to tens of nucleotides. These sequences are highly variable, even among closely related individuals, and so are used for genealogical DNA testing and forensic DNA analysis.
Repeated sequences of fewer than ten nucleotides (e.g. the dinucleotide repeat (AC)n) are termed microsatellite sequences. Among the microsatellite sequences, trinucleotide repeats are of particular importance, as sometimes occur within coding regions of genes for proteins and may lead to genetic disorders. For example, Huntington's disease results from an expansion of the trinucleotide repeat (CAG)n within the Huntingtin gene on human chromosome 4. Telomeres (the ends of linear chromosomes) end with a microsatellite hexanucleotide repeat of the sequence (TTAGGG)n.
Tandem repeats of longer sequences (arrays of repeated sequences 10–60 nucleotides long) are termed minisatellites.
Mobile genetic elements (transposons) and their relics
Transposable genetic elements, DNA sequences that can replicate and insert copies of themselves at other locations within a host genome, are an abundant component in the human genome. The most abundant transposon lineage, Alu, has about 50,000 active copies, and can be inserted into intragenic and intergenic regions. One other lineage, LINE-1, has about 100 active copies per genome (the number varies between people). Together with non-functional relics of old transposons, they account for over half of total human DNA. Sometimes called "jumping genes", transposons have played a major role in sculpting the human genome. Some of these sequences represent endogenous retroviruses, DNA copies of viral sequences that have become permanently integrated into the genome and are now passed on to succeeding generations.
Mobile elements within the human genome can be classified into LTR retrotransposons (8.3% of total genome), SINEs (13.1% of total genome) including Alu elements, LINEs (20.4% of total genome), SVAs and Class II DNA transposons (2.9% of total genome).
Genomic variation in humans
Human Reference Genome
With the exception of identical twins, all humans show significant variation in genomic DNA sequences. The Human Reference Genome (HRG) is used as a standard sequence reference.
There are several important points concerning the Human Reference Genome--
- The HRG is a haploid sequence. Each chromosome is represented once.
- The HRG is a composite sequence, and does not correspond to any actual human individual.
- The HRG is periodically updated to correct errors and ambiguities.
- The HRG in no way represents an "ideal" or "perfect" human individual. It is simply a standardized representation or model that is used for comparative purposes.
Measuring human genetic variation
Most studies of human genetic variation have focused on single-nucleotide polymorphisms (SNPs), which are substitutions in individual bases along a chromosome. Most analyses estimate that SNPs occur 1 in 1000 base pairs, on average, in the euchromatic human genome, although they do not occur at a uniform density. Thus follows the popular statement that "we are all, regardless of race, genetically 99.9% the same", although this would be somewhat qualified by most geneticists. For example, a much larger fraction of the genome is now thought to be involved in copy number variation. A large-scale collaborative effort to catalog SNP variations in the human genome is being undertaken by the International HapMap Project.
The genomic loci and length of certain types of small repetitive sequences are highly variable from person to person, which is the basis of DNA fingerprinting and DNA paternity testing technologies. The heterochromatic portions of the human genome, which total several hundred million base pairs, are also thought to be quite variable within the human population (they are so repetitive and so long that they cannot be accurately sequenced with current technology). These regions contain few genes, and it is unclear whether any significant phenotypic effect results from typical variation in repeats or heterochromatin.
Most gross genomic mutations in gamete germ cells probably result in inviable embryos; however, a number of human diseases are related to large-scale genomic abnormalities. Down syndrome, Turner Syndrome, and a number of other diseases result from nondisjunction of entire chromosomes. Cancer cells frequently have aneuploidy of chromosomes and chromosome arms, although a cause and effect relationship between aneuploidy and cancer has not been established.
Mapping human genomic variation
Whereas a genome sequence lists the order of every DNA base in a genome, a genome map identifies the landmarks. A genome map is less detailed than a genome sequence and aids in navigating around the genome.
An example of a variation map is the HapMap being developed by the International HapMap Project. The HapMap is a haplotype map of the human genome, "which will describe the common patterns of human DNA sequence variation." It catalogs the patterns of small-scale variations in the genome that involve single DNA letters, or bases.
Researchers published the first sequence-based map of large-scale structural variation across the human genome in the journal Nature in May 2008. Large-scale structural variations are differences in the genome among people that range from a few thousand to a few million DNA bases; some are gains or losses of stretches of genome sequence and others appear as re-arrangements of stretches of sequence. These variations include differences in the number of copies individuals have of a particular gene, deletions, translocations and inversions.
A personal genome sequence is a (nearly) complete sequence of the chemical base pairs that make up the DNA of a single person. Because medical treatments have different effects on different people due to genetic variations such as single-nucleotide polymorphisms (SNPs), the analysis of personal genomes may lead to personalized medical treatment based on individual genotypes.
The first personal genome sequence to be determined was that of Craig Venter in 2007. Personal genomes had not been sequenced in the public Human Genome Project to protect the identity of volunteers who provided DNA samples. That sequence was derived from the DNA of several volunteers from a diverse population. However, early in the Venter-led Celera Genomics genome sequencing effort the decision was made to switch from sequencing a composite sample to using DNA from a single individual, later revealed to have been Venter himself. Thus the Celera human genome sequence released in 2000 was largely that of one man. Subsequent replacement of the early composite-derived data and determination of the diploid sequence, representing both sets of chromosomes, rather than a haploid sequence originally reported, allowed the release of the first personal genome. In April 2008, that of James Watson was also completed. Since then hundreds of personal genome sequences have been released, including those of Desmond Tutu, and of a Paleo-Eskimo. In November 2013, a Spanish family made their personal genomics data obtained by direct-to-consumer genetic testing with 23andMe publicly available under a Creative Commons public domain license. This is believed to be the first such public genomics dataset for a whole family.
The sequencing of individual genomes further unveiled levels of genetic complexity that had not been appreciated before. Personal genomics helped reveal the significant level of diversity in the human genome attributed not only to SNPs but structural variations as well. However, the application of such knowledge to the treatment of disease and in the medical field is only in its very beginnings. Exome sequencing has become increasingly popular as a tool to aid in diagnosis of genetic disease because the exome contributes only 1% of the genomic sequence but accounts for roughly 85% of mutations that contribute significantly to disease.
Human genetic disorders
Most aspects of human biology involve both genetic (inherited) and non-genetic (environmental) factors. Some inherited variation influences aspects of our biology that are not medical in nature (height, eye color, ability to taste or smell certain compounds, etc.). Moreover, some genetic disorders only cause disease in combination with the appropriate environmental factors (such as diet). With these caveats, genetic disorders may be described as clinically defined diseases caused by genomic DNA sequence variation. In the most straightforward cases, the disorder can be associated with variation in a single gene. For example, cystic fibrosis is caused by mutations in the CFTR gene, and is the most common recessive disorder in caucasian populations with over 1,300 different mutations known.
Disease-causing mutations in specific genes are usually severe in terms of gene function, and are fortunately rare, thus genetic disorders are similarly individually rare. However, since there are many genes that can vary to cause genetic disorders, in aggregate they constitute a significant component of known medical conditions, especially in pediatric medicine. Molecularly characterized genetic disorders are those for which the underlying causal gene has been identified, currently there are approximately 2,200 such disorders annotated in the OMIM database.
Studies of genetic disorders are often performed by means of family-based studies. In some instances population based approaches are employed, particularly in the case of so-called founder populations such as those in Finland, French-Canada, Utah, Sardinia, etc. Diagnosis and treatment of genetic disorders are usually performed by a geneticist-physician trained in clinical/medical genetics. The results of the Human Genome Project are likely to provide increased availability of genetic testing for gene-related disorders, and eventually improved treatment. Parents can be screened for hereditary conditions and counselled on the consequences, the probability it will be inherited, and how to avoid or ameliorate it in their offspring.
As noted above, there are many different kinds of DNA sequence variation, ranging from complete extra or missing chromosomes down to single nucleotide changes. It is generally presumed that much naturally occurring genetic variation in human populations is phenotypically neutral, i.e. has little or no detectable effect on the physiology of the individual (although there may be fractional differences in fitness defined over evolutionary time frames). Genetic disorders can be caused by any or all known types of sequence variation. To molecularly characterize a new genetic disorder, it is necessary to establish a causal link between a particular genomic sequence variant and the clinical disease under investigation. Such studies constitute the realm of human molecular genetics.
With the advent of the Human Genome and International HapMap Project, it has become feasible to explore subtle genetic influences on many common disease conditions such as diabetes, asthma, migraine, schizophrenia, etc. Although some causal links have been made between genomic sequence variants in particular genes and some of these diseases, often with much publicity in the general media, these are usually not considered to be genetic disorders per se as their causes are complex, involving many different genetic and environmental factors. Thus there may be disagreement in particular cases whether a specific medical condition should be termed a genetic disorder. The categorized table below provides the prevalence as well as the genes or chromosomes associated with some human genetic disorders.
|Disorder||Prevalence||Chromosome or gene involved|
|Down syndrome||1:600||Chromosome 21|
|Klinefelter syndrome||1:500–1000 males||Additional X chromosome|
|Turner syndrome||1:2000 females||Loss of X chromosome|
|Sickle cell anemia||1 in 50 births in parts of Africa; rarer elsewhere||β-globin (on chromosome 11)|
|Breast/Ovarian cancer (susceptibility)||~5% of cases of these cancer types||BRCA1, BRCA2|
|FAP (hereditary nonpolyposis coli)||1:3500||APC|
|Lynch syndrome||5–10% of all cases of bowel cancer||MLH1, MSH2, MSH6, PMS2|
|Alzheimer disease ‐ early onset||1:2500||PS1, PS2, APP|
|Duchenne muscular dystrophy||1:3500 boys||Dystrophin|
Comparative genomics studies of mammalian genomes suggest that approximately 5% of the human genome has been conserved by evolution since the divergence of extant lineages approximately 200 million years ago, containing the vast majority of genes. The published chimpanzee genome differs from that of the human genome by 1.23% in direct sequence comparisons. Around 20% of this figure is accounted for by variation within each species, leaving only ~1.06% consistent sequence divergence between humans and chimps at shared genes. This nucleotide by nucleotide difference is dwarfed, however, by the portion of each genome that is not shared, including around 6% of functional genes that are unique to either humans or chimps.
In other words, the considerable observable differences between humans and chimps may be due as much or more to genome level variation in the number, function and expression of genes rather than DNA sequence changes in shared genes. Indeed, even within humans, there has been found to be a previously unappreciated amount of copy number variation (CNV) which can make up as much as 5 – 15% of the human genome. In other words, between humans, there could be +/- 500,000,000 base pairs of DNA, some being active genes, others inactivated, or active at different levels. The full significance of this finding remains to be seen. On average, a typical human protein-coding gene differs from its chimpanzee ortholog by only two amino acid substitutions; nearly one third of human genes have exactly the same protein translation as their chimpanzee orthologs. A major difference between the two genomes is human chromosome 2, which is equivalent to a fusion product of chimpanzee chromosomes 12 and 13 (later renamed to chromosomes 2A and 2B, respectively).
Humans have undergone an extraordinary loss of olfactory receptor genes during our recent evolution, which explains our relatively crude sense of smell compared to most other mammals. Evolutionary evidence suggests that the emergence of color vision in humans and several other primate species has diminished the need for the sense of smell.
The human mitochondrial DNA is of tremendous interest to geneticists, since it undoubtedly plays a role in mitochondrial disease. It also sheds light on human evolution; for example, analysis of variation in the human mitochondrial genome has led to the postulation of a recent common ancestor for all humans on the maternal line of descent (see Mitochondrial Eve).
Due to the lack of a system for checking for copying errors, mitochondrial DNA (mtDNA) has a more rapid rate of variation than nuclear DNA. This 20-fold increase in the mutation rate allows mtDNA to be used for more accurate tracing of maternal ancestry. Studies of mtDNA in populations have allowed ancient migration paths to be traced, such as the migration of Native Americans from Siberia or Polynesians from southeastern Asia. It has also been used to show that there is no trace of Neanderthal DNA in the European gene mixture inherited through purely maternal lineage. Due to the restrictive all or none manner of mtDNA inheritance, this result (no trace of Neanderthal mtDNA) would be likely unless there were a large percentage of Neanderthal ancestry, or there was strong positive selection for that mtDNA (for example, going back 5 generations, only 1 of your 32 ancestors contributed to your mtDNA, so if one of these 32 was pure Neanderthal you would expect that ~3% of your autosomal DNA would be of Neanderthal origin, yet you would have a ~97% chance to have no trace of Neanderthal mtDNA).
Epigenetics describes a variety of features of the human genome that transcend its primary DNA sequence, such as chromatin packaging, histone modifications and DNA methylation, and which are important in regulating gene expression, genome replication and other cellular processes. Epigenetic markers strengthen and weaken transcription of certain genes but do not affect the actual sequence of DNA nucleotides. DNA methylation is a major form of epigenetic control over gene expression and one of the most highly studied topics in epigenetics. During development, the human DNA methylation profile experiences dramatic changes. In early germ line cells, the genome has very low methylation levels. These low levels generally describe active genes. As development progresses, parental imprinting tags lead to increased methylation activity.
Epigenetic patterns can be identified between tissues within an individual as well as between individuals themselves. Identical genes that have differences only in their epigenetic state are called epialleles. Epialleles can be placed into three categories: those directly determined by an individual’s genotype, those influenced by genotype, and those entirely independent of genotype. The epigenome is also influenced significantly by environmental factors. Diet, toxins, and hormones impact the epigenetic state. Studies in dietary manipulation have demonstrated that methyl-deficient diets are associated with hypomethylation of the epigenome. Such studies establish epigenetics as an important interface between the environment and the genome.
- Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, Kang HM, Marth GT, McVean GA (Nov 2012). "An integrated map of genetic variation from 1,092 human genomes". Nature 491 (7422): 56–65. doi:10.1038/nature11632. PMID 23128226.
- Varki A, Altheide TK (Dec 2005). "Comparing the human and chimpanzee genomes: searching for needles in a haystack". Genome Research 15 (12): 1746–58. doi:10.1101/gr.3737405. PMID 16339373.
- "Humans share 50% DNA with bananas: The fascinating facts about the scientific world around us". Mirror.
- International Human Genome Sequencing Consortium Publishes Sequence and Analysis of the Human Genome
- International Human Genome Sequencing Consortium (Oct 2004). "Finishing the euchromatic sequence of the human genome". Nature 431 (7011): 931–45. Bibcode:2004Natur.431..931H. doi:10.1038/nature03001. PMID 15496913.
- Pennisi E (Sep 2012). "Genomics. ENCODE project writes eulogy for junk DNA". Science 337 (6099): 1159, 1161. doi:10.1126/science.337.6099.1159. PMID 22955811.
- International Human Genome Sequencing Consortium (Feb 2001). "Initial sequencing and analysis of the human genome". Nature 409 (6822): 860–921. doi:10.1038/35057062. PMID 11237011.
- Chaisson MJ, Huddleston J, Dennis MY, Sudmant PH, Malig M, Hormozdiari F, Antonacci F, Surti U, Sandstrom R, Boitano M, Landolin JM, Stamatoyannopoulos JA, Hunkapiller MW, Korlach J, Eichler EE (Jan 2015). "Resolving the complexity of the human genome using single-molecule sequencing". Nature 517 (7536): 608–11. doi:10.1038/nature13907. PMID 25383537.
- Ken Waters (2007-03-07). "Molecular Genetics". Stanford Encyclopedia of Philosophy. Retrieved 2013-07-18.
- Lisa Gannett (2008-10-26). "The Human Genome Project". Stanford Encyclopedia of Philosophy. Retrieved 2013-07-18.
- Callaway E (2012). "Studies slow the human DNA clock". Nature 489 (7416): 343–4. doi:10.1038/489343a. PMID 22996522.
- PANTHER Pie Chart at the PANTHER Classification System homepage. Retrieved May 25, 2011
- List of human proteins in the Uniprot Human reference proteome; accessed 28 Jan 2015
- Kauffman SA (Mar 1969). "Metabolic stability and epigenesis in randomly constructed genetic nets". Journal of Theoretical Biology (Elsevier) 22 (3): 437–67. doi:10.1016/0022-5193(69)90015-0. PMID 5803332.
- Ohno, S. (1972). "An argument for the genetic simplicity of man and other mammals". Journal of Human Evolution 1 (6): 651–662. doi:10.1016/0047-2484(72)90011-5.
- M. Huang, H. Zhu, B. Shen, G. Gao, "A non-random gait through the human genome", 3rd International Conference on Bioinformatics and Biomedical Engineering (UCBBE, 2009), 1–3
- Gregory TR (Sep 2005). "Synergy between sequence and size in large-scale genomics". Nature Reviews. Genetics 6 (9): 699–708. doi:10.1038/nrg1674. PMID 16151375.
- Palazzo AF, Akef A (Jun 2012). "Nuclear export as a key arbiter of "mRNA identity" in eukaryotes". Biochimica Et Biophysica Acta 1819 (6): 566–77. doi:10.1016/j.bbagrm.2011.12.012. PMID 22248619.
- Ludwig MZ (Dec 2002). "Functional evolution of noncoding DNA". Current Opinion in Genetics & Development 12 (6): 634–9. doi:10.1016/S0959-437X(02)00355-6. PMID 12433575.
- Martens JA, Laprade L, Winston F (Jun 2004). "Intergenic transcription is required to repress the Saccharomyces cerevisiae SER3 gene". Nature 429 (6991): 571–4. Bibcode:2004Natur.429..571M. doi:10.1038/nature02538. PMID 15175754.
- Tsai MC, Manor O, Wan Y, Mosammaparast N, Wang JK, Lan F, Shi Y, Segal E, Chang HY (Aug 2010). "Long noncoding RNA as modular scaffold of histone modification complexes". Science 329 (5992): 689–93. Bibcode:2010Sci...329..689T. doi:10.1126/science.1192002. PMC 2967777. PMID 20616235.
- Bartolomei MS, Zemel S, Tilghman SM (May 1991). "Parental imprinting of the mouse H19 gene". Nature 351 (6322): 153–5. Bibcode:1991Natur.351..153B. doi:10.1038/351153a0. PMID 1709450.
- Kobayashi T, Ganley AR (Sep 2005). "Recombination regulation by transcription-induced cohesin dissociation in rDNA repeats". Science 309 (5740): 1581–4. Bibcode:2005Sci...309.1581K. doi:10.1126/science.1116102. PMID 16141077.
- Salmena L, Poliseno L, Tay Y, Kats L, Pandolfi PP (Aug 2011). "A ceRNA hypothesis: the Rosetta Stone of a hidden RNA language?". Cell 146 (3): 353–8. doi:10.1016/j.cell.2011.07.014. PMC 3235919. PMID 21802130.
- Pei B, Sisu C, Frankish A, Howald C, Habegger L, Mu XJ, Harte R, Balasubramanian S, Tanzer A, Diekhans M, Reymond A, Hubbard TJ, Harrow J, Gerstein MB (2012). "The GENCODE pseudogene resource". Genome Biology 13 (9): R51. doi:10.1186/gb-2012-13-9-r51. PMC 3491395. PMID 22951037.
- Gilad Y, Man O, Pääbo S, Lancet D (Mar 2003). "Human specific loss of olfactory receptor genes". Proceedings of the National Academy of Sciences of the United States of America 100 (6): 3324–7. Bibcode:2003PNAS..100.3324G. doi:10.1073/pnas.0535697100. PMC 152291. PMID 12612342.
- Iyer MK, Niknafs YS, Malik R, Singhal U, Sahu A, Hosono Y, Barrette TR, Prensner JR, Evans JR, Zhao S, Poliakov A, Cao X, Dhanasekaran SM, Wu YM, Robinson DR, Beer DG, Feng FY, Iyer HK, Chinnaiyan AM (Mar 2015). "The landscape of long noncoding RNAs in the human transcriptome". Nature Genetics 47 (3): 199–208. doi:10.1038/ng.3192. PMID 25599403.
- Eddy SR (Dec 2001). "Non-coding RNA genes and the modern RNA world". Nature Reviews. Genetics (Nature Publishing Group) 2 (12): 919–29. doi:10.1038/35103511. PMID 11733745.
- Managadze D, Lobkovsky AE, Wolf YI, Shabalina SA, Rogozin IB, Koonin EV (2013). "The vast, conserved mammalian lincRNome". PLoS Computational Biology 9 (2): e1002917. doi:10.1371/journal.pcbi.1002917. PMID 23468607.
- Palazzo AF, Lee ES (2015). "Non-coding RNA: what is functional and what is junk?". Frontiers in Genetics 6: 2. doi:10.3389/fgene.2015.00002. PMID 25674102.
- Mattick JS, Makunin IV (Apr 2006). "Non-coding RNA". Human Molecular Genetics. 15 Spec No 1: R17–29. doi:10.1093/hmg/ddl046. PMID 16651366.
- Bernstein BE, Birney E, Dunham I, Green ED, Gunter C, Snyder M (Sep 2012). "An integrated encyclopedia of DNA elements in the human genome". Nature 489 (7414): 57–74. doi:10.1038/nature11247. PMC 3439153. PMID 22955616.
- Birney E (5 September 2012). "ENCODE: My own thoughts". Ewan's Blog: Bioinformatician at large.
- Stamatoyannopoulos JA (Sep 2012). "What does our genome encode?". Genome Research 22 (9): 1602–11. doi:10.1101/gr.146506.112. PMC 3431477. PMID 22955972.
- Carroll SB, Gompel N, Prudhomme B (May 2008). "Regulating Evolution". Scientific American: 60–67.
- Miller JH, Ippen K, Scaife JG, Beckwith JR (1968). "The promoter-operator region of the lac operon of Escherichia coli". J. Mol. Biol. 38 (3): 413–20. doi:10.1016/0022-2836(68)90395-1. PMID 4887877.
- Wright S, Rosenthal A, Flavell R, Grosveld F (1984). "DNA sequences required for regulated expression of beta-globin genes in murine erythroleukemia cells". Cell 38 (1): 265–73. doi:10.1016/0092-8674(84)90548-8. PMID 6088069.
- Nei M, Xu P, Glazko G (Feb 2001). "Estimation of divergence times from multiprotein sequences for a few mammalian species and several distantly related organisms". Proceedings of the National Academy of Sciences of the United States of America 98 (5): 2497–502. Bibcode:2001PNAS...98.2497N. doi:10.1073/pnas.051611498. PMC 30166. PMID 11226267.
- Loots GG, Locksley RM, Blankespoor CM, Wang ZE, Miller W, Rubin EM, Frazer KA (Apr 2000). "Identification of a coordinate regulator of interleukins 4, 13, and 5 by cross-species sequence comparisons". Science 288 (5463): 136–40. Bibcode:2000Sci...288..136L. doi:10.1126/science.288.5463.136. PMID 10753117. Summary
- Meunier M. "Genoscope and Whitehead announce a high sequence coverage of the Tetraodon nigroviridis genome". Genoscope. Archived from the original on 16 October 2006. Retrieved 2006-09-12.
- Romero IG, Ruvinsky I, Gilad Y (Jul 2012). "Comparative studies of gene expression and the evolution of gene regulation". Nature Reviews. Genetics 13 (7): 505–16. doi:10.1038/nrg3229. PMID 22705669.
- Schmidt D, Wilson MD, Ballester B, Schwalie PC, Brown GD, Marshall A, Kutter C, Watt S, Martinez-Jimenez CP, Mackay S, Talianidis I, Flicek P, Odom DT (May 2010). "Five-vertebrate ChIP-seq reveals the evolutionary dynamics of transcription factor binding". Science 328 (5981): 1036–40. doi:10.1126/science.1186176. PMC 3008766. PMID 20378774.
- Wilson MD, Barbosa-Morais NL, Schmidt D, Conboy CM, Vanes L, Tybulewicz VL, Fisher EM, Tavaré S, Odom DT (Oct 2008). "Species-specific transcription in mice carrying human chromosome 21". Science 322 (5900): 434–8. doi:10.1126/science.1160930. PMC 3717767. PMID 18787134.
- Treangen TJ, Salzberg SL (Jan 2012). "Repetitive DNA and next-generation sequencing: computational challenges and solutions". Nature Reviews. Genetics 13 (1): 36–46. doi:10.1038/nrg3117. PMC 3324860. PMID 22124482.
- Bennett EA, Keller H, Mills RE, Schmidt S, Moran JV, Weichenrieder O, Devine SE (Dec 2008). "Active Alu retrotransposons in the human genome". Genome Research 18 (12): 1875–83. doi:10.1101/gr.081737.108. PMC 2593586. PMID 18836035.
- Liang KH, Yeh CT. "A gene expression restriction network mediated by sense and antisense Alu sequences located on protein-coding messenger RNAs". BMC Genomics 14: 325. doi:10.1186/1471-2164-14-325. PMC 3655826. PMID 23663499.
- Brouha B, Schustak J, Badge RM, Lutz-Prigge S, Farley AH, Moran JV, Kazazian HH (Apr 2003). "Hot L1s account for the bulk of retrotransposition in the human population". Proceedings of the National Academy of Sciences of the United States of America 100 (9): 5280–5. doi:10.1073/pnas.0831042100. PMC 154336. PMID 12682288.
- Barton NH, Briggs DE, Eisen JA, Goldstein DB, Patel NH (2007). Evolution. Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press. ISBN 0-87969-684-2.
- from Bill Clinton's 2000 State of the Union address
- Nature. "Global variation in copy number in the human genome : Article : Nature". Nature. Retrieved 2009-08-09.
- "What's a Genome?". Genomenewsnetwork.org. 2003-01-15. Retrieved 2009-05-31.
- NCBI_user_services (2004-03-29). "Mapping Factsheet". Ncbi.nlm.nih.gov. Retrieved 2009-05-31.
- "About the Project". HapMap. Retrieved 2009-05-31.
- "2008 Release: Researchers Produce First Sequence Map of Large-Scale Structural Variation in the Human Genome". genome.gov. Retrieved 2009-05-31.
- Kidd JM, Cooper GM, Donahue WF, Hayden HS, Sampas N, Graves T, et al. (May 2008). "Mapping and sequencing of structural variation from eight human genomes". Nature 453 (7191): 56–64. doi:10.1038/nature06862. PMC 2424287. PMID 18451855.
- "Human Genome Project Completion: Frequently Asked Questions". genome.gov. Retrieved 2009-05-31.
- Singer, Emily (September 4, 2007). "Technology Review". Technology review. Retrieved May 25, 2010.
- "Complete Genomics Adds 29 High-Coverage, Complete Human Genome Sequencing Datasets to Its Public Genomic Repository".
- Ian Sample (17 February 2010). "Desmond Tutu's genome sequenced as part of genetic diversity study". The Guardian.
- Schuster SC, Miller W, Ratan A, Tomsho LP, Giardine B, Kasson LR, Harris RS, Petersen DC, Zhao F, Qi J, Alkan C, Kidd JM, Sun Y, Drautz DI, Bouffard P, Muzny DM, Reid JG, Nazareth LV, Wang Q, Burhans R, Riemer C, Wittekindt NE, Moorjani P, Tindall EA, Danko CG, Teo WS, Buboltz AM, Zhang Z, Ma Q, Oosthuysen A, Steenkamp AW, Oostuisen H, Venter P, Gajewski J, Zhang Y, Pugh BF, Makova KD, Nekrutenko A, Mardis ER, Patterson N, Pringle TH, Chiaromonte F, Mullikin JC, Eichler EE, Hardison RC, Gibbs RA, Harkins TT, Hayes VM (2010). "Complete Khoisan and Bantu genomes from southern Africa". Nature 463 (7283): 943–7. doi:10.1038/nature08795. PMC 3890430. PMID 20164927.
- Rasmussen M, Li Y, Lindgreen S, Pedersen JS, Albrechtsen A, Moltke I, et al. (Feb 2010). "Ancient human genome sequence of an extinct Palaeo-Eskimo". Nature 463 (7282): 757–62. Bibcode:2010Natur.463..757R. doi:10.1038/nature08835. PMC 3951495. PMID 20148029.
- Corpas M, Cariaso M, Coletta A, Weiss D, Harrison AP, Moran F, Yang H (November 12, 2013). "A Complete Public Domain Family Genomics Dataset". BioRxiv. doi:10.1101/000216. Retrieved November 15, 2013.
- Gonzaga-Jauregui C, Lupski JR, Gibbs RA (2012). "Human genome sequencing in health and disease". Annual Review of Medicine 63: 35–61. doi:10.1146/annurev-med-051010-162644. PMID 22248320.
- Choi M, Scholl UI, Ji W, Liu T, Tikhonova IR, Zumbo P, Nayir A, Bakkaloğlu A, Ozen S, Sanjad S, Nelson-Williams C, Farhi A, Mane S, Lifton RP (Nov 2009). "Genetic diagnosis by whole exome capture and massively parallel DNA sequencing". Proceedings of the National Academy of Sciences of the United States of America 106 (45): 19096–101. Bibcode:2009PNAS..10619096C. doi:10.1073/pnas.0910672106. PMC 2768590. PMID 19861545.
- Online Mendelian Inheritance in Man (OMIM)
- "Sickle-cell anaemia – Report by the Secretariat" (pdf). Fifty-ninth World Health Assembly. World Health Organization. 24 April 2006.
- Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P, Agarwala R, Ainscough R, Alexandersson M, et al. (Dec 2002). "Initial sequencing and comparative analysis of the mouse genome". Nature 420 (6915): 520–62. Bibcode:2002Natur.420..520W. doi:10.1038/nature01262. PMID 12466850.
the proportion of small (50–100 bp) segments in the mammalian genome that is under (purifying) selection can be estimated to be about 5%. This proportion is much higher than can be explained by protein-coding sequences alone, implying that the genome contains many additional features (such as untranslated regions, regulatory elements, non-protein-coding genes, and chromosomal structural elements) under selection for biological function.
- Birney E, Stamatoyannopoulos JA, Dutta A, Guigó R, Gingeras TR, Margulies EH, et al. (Jun 2007). "Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project". Nature 447 (7146): 799–816. doi:10.1038/nature05874. PMC 2212820. PMID 17571346.
- The Chimpanzee Sequencing and Analysis Consortium (Sep 2005). "Initial sequence of the chimpanzee genome and comparison with the human genome". Nature 437 (7055): 69–87. Bibcode:2005Natur.437...69.. doi:10.1038/nature04072. PMID 16136131.
We calculate the genome-wide nucleotide divergence between human and chimpanzee to be 1.23%, confirming recent results from more limited studies.
- The Chimpanzee Sequencing and Analysis Consortium (Sep 2005). "Initial sequence of the chimpanzee genome and comparison with the human genome". Nature 437 (7055): 69–87. Bibcode:2005Natur.437...69.. doi:10.1038/nature04072. PMID 16136131.
we estimate that polymorphism accounts for 14–22% of the observed divergence rate and thus that the fixed divergence is ~1.06% or less
- Demuth JP, De Bie T, Stajich JE, Cristianini N, Hahn MW (2006). "The evolution of mammalian gene families". PloS One 1 (1): e85. Bibcode:2006PLoSO...1...85D. doi:10.1371/journal.pone.0000085. PMC 1762380. PMID 17183716.
Our results imply that humans and chimpanzees differ by at least 6% (1,418 of 22,000 genes) in their complement of genes, which stands in stark contrast to the oft-cited 1.5% difference between orthologous nucleotide sequences
- The Chimpanzee Sequencing and Analysis Consortium (Sep 2005). "Initial sequence of the chimpanzee genome and comparison with the human genome". Nature 437 (7055): 69–87. Bibcode:2005Natur.437...69.. doi:10.1038/nature04072. PMID 16136131.
Human chromosome 2 resulted from a fusion of two ancestral chromosomes that remained separate in the chimpanzee lineage
Olson MV, Varki A (Jan 2003). "Sequencing the chimpanzee genome: insights into human evolution and disease". Nature Reviews. Genetics 4 (1): 20–8. doi:10.1038/nrg981. PMID 12509750.
Large-scale sequencing of the chimpanzee genome is now imminent.
- Gilad Y, Wiebe V, Przeworski M, Lancet D, Pääbo S (Jan 2004). "Loss of olfactory receptor genes coincides with the acquisition of full trichromatic vision in primates". PLoS Biology 2 (1): E5. doi:10.1371/journal.pbio.0020005. PMC 314465. PMID 14737185.
Our findings suggest that the deterioration of the olfactory repertoire occurred concomitant with the acquisition of full trichromatic color vision in primates.
- Sykes, Bryan (2003-10-09). "Mitochondrial DNA and human history". The Human Genome. Retrieved 2006-09-19.
- Misteli T (Feb 2007). "Beyond the sequence: cellular organization of genome function". Cell 128 (4): 787–800. doi:10.1016/j.cell.2007.01.028. PMID 17320514.
- Bernstein BE, Meissner A, Lander ES (Feb 2007). "The mammalian epigenome". Cell 128 (4): 669–81. doi:10.1016/j.cell.2007.01.033. PMID 17320505.
- Scheen AJ, Junien C (May–Jun 2012). "[Epigenetics, interface between environment and genes: role in complex diseases]". Revue Médicale De Liège 67 (5-6): 250–7. PMID 22891475.
|Wikiquote has quotations related to: Human genome|
|Wikinews has related news: Mexico presents first population-wide genome map for a Latin country|
- The National Human Genome Research Institute
- Ensembl The Ensembl Genome Browser Project
- National Library of Medicine human genome viewer
- UCSC Genome Browser.
- Human Genome Project.
- The National Office of Public Health Genomics
- New findings challenge established views about human genome
- INMEGEN: Complete genetic map of some mexican native groups
- Missing bits of DNA may define humans