Chargaff's rules

Chargaff's rules (given by Erwin Chargaff) state that in the DNA of any species and any organism, the amount of guanine should be equal to the amount of cytosine and the amount of adenine should be equal to the amount of thymine. Further, a 1:1 stoichiometric ratio of purine and pyrimidine bases (i.e., A+G=T+C) should exist. This pattern is found in both strands of the DNA. They were discovered by Austrian-born chemist Erwin Chargaff^[1]^[2] in the late 1940s.

Definitions

First parity rule

The first rule holds that a double-stranded DNA molecule, globally has percentage base pair equality: A% = T% and G% = C%. The rigorous validation of the rule constitutes the basis of Watson–Crick base pairs in the DNA double helix model.

Second parity rule

The second rule holds that both Α% ≈ Τ% and G% ≈ C% are valid for each of the two DNA strands.^[3] This describes only a global feature of the base composition in a single DNA strand.^[4]

Research

The second parity rule was discovered in 1968.^[3] It states that, in single-stranded DNA, the number of adenine units is approximately equal to that of thymine (%A ≈ %T), and the number of cytosine units is approximately equal to that of guanine (%C ≈ %G).

The first empirical generalization of Chargaff's second parity rule, called the Symmetry Principle, was proposed by Vinayakumar V. Prabhu^[5] in 1993. This principle states that for any given oligonucleotide, its frequency is approximately equal to the frequency of its complementary reverse oligonucleotide. A theoretical generalization^[6] was mathematically derived by Michel E. B. Yamagishi and Roberto H. Herai in 2011.^[7]

In 2006, it was shown that this rule applies to four^[2] of the five types of double stranded genomes; specifically it applies to the eukaryotic chromosomes, the bacterial chromosomes, the double stranded DNA viral genomes, and the archaeal chromosomes.^[8] It does not apply to organellar genomes (mitochondria and plastids) smaller than ~20-30 kbp, nor does it apply to single stranded DNA (viral) genomes or any type of RNA genome. The basis for this rule is still under investigation, although genome size may play a role.

Histogram showing how 20309 chromosomes adhere to Chargaff's second parity rule

The rule itself has consequences. In most bacterial genomes (which are generally 80-90% coding) genes are arranged in such a fashion that approximately 50% of the coding sequence lies on either strand. Wacław Szybalski, in the 1960s, showed that in bacteriophage coding sequences purines (A and G) exceed pyrimidines (C and T).^[9] This rule has since been confirmed in other organisms and should probably be now termed "Szybalski's rule". While Szybalski's rule generally holds, exceptions are known to exist.^[10]^[11]^[12] The biological basis for Szybalski's rule is not yet known.

The combined effect of Chargaff's second rule and Szybalski's rule can be seen in bacterial genomes where the coding sequences are not equally distributed. The genetic code has 64 codons of which 3 function as termination codons: there are only 20 amino acids normally present in proteins. (There are two uncommon amino acids—selenocysteine and pyrrolysine—found in a limited number of proteins and encoded by the stop codons—TGA and TAG respectively.) The mismatch between the number of codons and amino acids allows several codons to code for a single amino acid—such codons normally differ only at the third codon base position.

Multivariate statistical analysis of codon use within genomes with unequal quantities of coding sequences on the two strands has shown that codon use in the third position depends on the strand on which the gene is located. This seems likely to be the result of Szybalski's and Chargaff's rules. Because of the asymmetry in pyrimidine and purine use in coding sequences, the strand with the greater coding content will tend to have the greater number of purine bases (Szybalski's rule). Because the number of purine bases will, to a very good approximation, equal the number of their complementary pyrimidines within the same strand and, because the coding sequences occupy 80–90% of the strand, there appears to be (1) a selective pressure on the third base to minimize the number of purine bases in the strand with the greater coding content; and (2) that this pressure is proportional to the mismatch in the length of the coding sequences between the two strands.

Chargaff's 2nd parity rule for prokaryotic 6-mers

The origin of the deviation from Chargaff's rule in the organelles has been suggested to be a consequence of the mechanism of replication.^[13] During replication the DNA strands separate. In single stranded DNA, cytosine spontaneously slowly deaminates to adenosine (a C to A transversion). The longer the strands are separated the greater the quantity of deamination. For reasons that are not yet clear the strands tend to exist longer in single form in mitochondria than in chromosomal DNA. This process tends to yield one strand that is enriched in guanine (G) and thymine (T) with its complement enriched in cytosine (C) and adenosine (A), and this process may have given rise to the deviations found in the mitochondria. ^{[citation needed]}^{[dubious – discuss]}

Chargaff's second rule appears to be the consequence of a more complex parity rule: within a single strand of DNA any oligonucleotide (k-mer or n-gram; length ≤ 10) is present in equal numbers to its reverse complementary nucleotide. Because of the computational requirements this has not been verified in all genomes for all oligonucleotides. It has been verified for triplet oligonucleotides for a large data set.^[14] Albrecht-Buehler has suggested that this rule is the consequence of genomes evolving by a process of inversion and transposition.^[14] This process does not appear to have acted on the mitochondrial genomes. Chargaff's second parity rule appears to be extended from the nucleotide-level to populations of codon triplets, in the case of whole single-stranded Human genome DNA.^[15] A kind of "codon-level second Chargaff's parity rule" is proposed as follows:

Intra-strand relation among percentages of codon populations
First codon	Second codon	Relation proposed	Details
`Twx` (1st base position is T)	`yzA` (3rd base position is A)	% `Twx` $\simeq$ % `yzA`	`Twx` and `yzA` are mirror codons, e.g. `TCG` and `CGA`
`Cwx` (1st base position is C)	`yzG` (3rd base position is G)	% `Cwx` $\simeq$ % `yzG`	`Cwx` and `yzG` are mirror codons, e.g. `CTA` and `TAG`
`wTx` (2nd base position is T)	`yAz` (2nd base position is A)	% `wTx` $\simeq$ % `yAz`	`wTx` and `yAz` are mirror codons, e.g. `CTG` and `CAG`
`wCx` (2nd base position is C)	`yGz` (2nd base position is G)	% `wCx` $\simeq$ % `yGz`	`wCx` and `yGz` are mirror codons, e.g. `TCT` and `AGA`
`wxT` (3rd base position is T)	`Ayz` (1st base position is A)	% `wxT` $\simeq$ % `Ayz`	`wxT` and `Ayz` are mirror codons, e.g. `CTT` and `AAG`
`wxC` (3rd base position is C)	`Gyz` (1st base position is G)	% `wxC` $\simeq$ % `Gyz`	`wxC` and `Gyz` are mirror codons, e.g. `GGC` and `GCC`

Examples — computing whole human genome using the first codons reading frame provides:

36530115 TTT and 36381293 AAA (ratio % = 1.00409). 2087242 TCG and 2085226 CGA (ratio % = 1.00096), etc...

In 2020, it is suggested that the physical properties of the dsDNA (double stranded DNA) and the tendency to maximum entropy of all the physical systems are the cause of Chargaff's second parity rule.^[16] The symmetries and patterns present in the dsDNA sequences can emerge from the physical peculiarities of the dsDNA molecule and the maximum entropy principle alone, rather than from biological or environmental evolutionary pressure.

Percentages of bases in DNA

The following table is a representative sample of Erwin Chargaff's 1952 data, listing the base composition of DNA from various organisms and support both of Chargaff's rules.^[17] An organism such as φX174 with significant variation from A/T and G/C equal to one, is indicative of single stranded DNA.

Organism	Taxon	%A	%G	%C	%T	A / T	G / C	%GC	%AT
Maize	Zea	26.8	22.8	23.2	27.2	0.99	0.98	46.1	54.0
Octopus	Octopus	33.2	17.6	17.6	31.6	1.05	1.00	35.2	64.8
Chicken	Gallus	28.0	22.0	21.6	28.4	0.99	1.02	43.7	56.4
Rat	Rattus	28.6	21.4	20.5	28.4	1.01	1.00	42.9	57.0
Human	Homo	29.3	20.7	20.0	30.0	0.98	1.04	40.7	59.3
Grasshopper	Orthoptera	29.3	20.5	20.7	29.3	1.00	0.99	41.2	58.6
Sea urchin	Echinoidea	32.8	17.7	17.3	32.1	1.02	1.02	35.0	64.9
Wheat	Triticum	27.3	22.7	22.8	27.1	1.01	1.00	45.5	54.4
Yeast	Saccharomyces	31.3	18.7	17.1	32.9	0.95	1.09	35.8	64.4
E. coli	Escherichia	24.7	26.0	25.7	23.6	1.05	1.01	51.7	48.3
φX174	PhiX174	24.0	23.3	21.5	31.2	0.77	1.08	44.8	55.2

References

^ Elson D, Chargaff E (1952). "On the deoxyribonucleic acid content of sea urchin gametes". Experientia. 8 (4): 143–145. doi:10.1007/BF02170221. PMID 14945441. S2CID 36803326.
^ ^a ^b Chargaff E, Lipshitz R, Green C (1952). "Composition of the deoxypentose nucleic acids of four genera of sea-urchin". J Biol Chem. 195 (1): 155–160. doi:10.1016/S0021-9258(19)50884-5. PMID 14938364. S2CID 11358561.
^ ^a ^b Rudner, R; Karkas, JD; Chargaff, E (1968). "Separation of B. Subtilis DNA into complementary strands. 3. Direct analysis". Proceedings of the National Academy of Sciences of the United States of America. 60 (3): 921–2. Bibcode:1968PNAS...60..921R. doi:10.1073/pnas.60.3.921. PMC 225140. PMID 4970114.
^ Zhang CT, Zhang R, Ou HY (2003). "The Z curve database: a oraphic representation of genome sequences". Bioinformatics. 19 [issue=5 (5): 590–599. doi:10.1093/bioinformatics/btg041. PMID 12651717.
^ Prabhu VV (1993). "Symmetry observation in long nucleotide sequences". Nucleic Acids Research. 21 (12): 2797–2800. doi:10.1093/nar/21.12.2797. PMC 309655. PMID 8332488.
^ Yamagishi MEB (2017). Mathematical Grammar of Biology. SpringerBriefs in Mathematics. Springer. arXiv:1112.1528. doi:10.1007/978-3-319-62689-5. ISBN 978-3-319-62688-8. S2CID 16742066.
^ Yamagishi ME, Herai RH (2011). Chargaff's "Grammar of Biology": New Fractal-like Rules. SpringerBriefs in Mathematics. arXiv:1112.1528. doi:10.1007/978-3-319-62689-5. ISBN 978-3-319-62688-8. S2CID 16742066.
^ Mitchell D, Bridge R (2006). "A test of Chargaff's second rule". Biochem Biophys Res Commun. 340 (1): 90–94. doi:10.1016/j.bbrc.2005.11.160. PMID 16364245.
^ Szybalski W, Kubinski H, Sheldrick O (1966). "Pyrimidine clusters on the transcribing strand of DNA and their possible role in the initiation of RNA synthesis". Cold Spring Harb Symp Quant Biol. 31: 123–127. doi:10.1101/SQB.1966.031.01.019. PMID 4966069.
^ Cristillo AD (1998). Characterization of G0/G1 switch genes in cultured T lymphocytes. Kingston, Ontario, Canada: Queen's University.
^ Bell SJ, Forsdyke DR (1999). "Deviations from Chargaff's second parity rule correlate with direction of transcription". J Theor Biol. 197 (1): 63–76. Bibcode:1999JThBi.197...63B. doi:10.1006/jtbi.1998.0858. PMID 10036208.
^ Lao PJ, Forsdyke DR (2000). "Thermophilic Bacteria Strictly Obey Szybalski's Transcription Direction Rule and Politely Purine-Load RNAs with Both Adenine and Guanine". Genome Research. 10 (2): 228–236. doi:10.1101/gr.10.2.228. PMC 310832. PMID 10673280.
^ Nikolaou C, Almirantis Y (2006). "Deviations from Chargaff's second parity rule in organellar DNA. Insights into the evolution of organellar genomes". Gene. 381: 34–41. doi:10.1016/j.gene.2006.06.010. PMID 16893615.
^ ^a ^b Albrecht-Buehler G (2006). "Asymptotically increasing compliance of genomes with Chargaff's second parity rules through inversions and inverted transpositions". Proc Natl Acad Sci USA. 103 (47): 17828–17833. Bibcode:2006PNAS..10317828A. doi:10.1073/pnas.0605553103. PMC 1635160. PMID 17093051.
^ Perez, J.-C. (September 2010). "Codon populations in single-stranded whole human genome DNA are fractal and fine-tuned by the Golden Ratio 1.618". Interdisciplinary Sciences: Computational Life Sciences. 2 (3): 228–240. doi:10.1007/s12539-010-0022-0. PMID 20658335. S2CID 54565279.
^ Piero Farisell, Cristian Taccioli, Luca Pagani & Amos Maritan (April 2020). "DNA sequence symmetries from randomness: the origin of the Chargaff's second parity rule". Briefings in Bioinformatics. 22 (bbaa04): 2172–2181. doi:10.1093/bib/bbaa041. PMC 7986665. PMID 32266404.{{cite journal}}: CS1 maint: multiple names: authors list (link)
^ Bansal M (2003). "DNA structure: Revisiting the Watson-Crick double helix" (PDF). Current Science. 85 (11): 1556–1563. Archived from the original (PDF) on 2014-07-26. Retrieved 2013-07-26.

External links

CBS Genome Atlas Database Archived 2016-05-16 at the Portuguese Web Archive — contains hundreds of examples of base skews and had problems.^[1]
The Z curve database of genomes — a 3-dimensional visualization and analysis tool of genomes.^[2]

^ Hallin PF, David Ussery D (2004). "CBS Genome Atlas Database: A dynamic storage for bioinformatic results and sequence data". Bioinformatics. 20 (18): 3682–3686. doi:10.1093/bioinformatics/bth423. PMID 15256401.
^ Zhang CT, Zhang R, Ou HY (2003). "The Z curve database: a graphic representation of genome sequences". Bioinformatics. 19 (5): 593–599. doi:10.1093/bioinformatics/btg041. PMID 12651717.

[Elson1952-1] Elson D, Chargaff E (1952). "On the deoxyribonucleic acid content of sea urchin gametes". Experientia. 8 (4): 143–145. doi:10.1007/BF02170221. PMID 14945441. S2CID 36803326.

[Chargaff1952-2] Chargaff E, Lipshitz R, Green C (1952). "Composition of the deoxypentose nucleic acids of four genera of sea-urchin". J Biol Chem. 195 (1): 155–160. doi:10.1016/S0021-9258(19)50884-5. PMID 14938364. S2CID 11358561.

[Rudner1968-3] Rudner, R; Karkas, JD; Chargaff, E (1968). "Separation of B. Subtilis DNA into complementary strands. 3. Direct analysis". Proceedings of the National Academy of Sciences of the United States of America. 60 (3): 921–2. Bibcode:1968PNAS...60..921R. doi:10.1073/pnas.60.3.921. PMC 225140. PMID 4970114.

[Zhung2003_externallinks-4] Zhang CT, Zhang R, Ou HY (2003). "The Z curve database: a oraphic representation of genome sequences". Bioinformatics. 19 [issue=5 (5): 590–599. doi:10.1093/bioinformatics/btg041. PMID 12651717.

[5] Prabhu VV (1993). "Symmetry observation in long nucleotide sequences". Nucleic Acids Research. 21 (12): 2797–2800. doi:10.1093/nar/21.12.2797. PMC 309655. PMID 8332488.

[6] Yamagishi MEB (2017). Mathematical Grammar of Biology. SpringerBriefs in Mathematics. Springer. arXiv:1112.1528. doi:10.1007/978-3-319-62689-5. ISBN 978-3-319-62688-8. S2CID 16742066.

[7] Yamagishi ME, Herai RH (2011). Chargaff's "Grammar of Biology": New Fractal-like Rules. SpringerBriefs in Mathematics. arXiv:1112.1528. doi:10.1007/978-3-319-62689-5. ISBN 978-3-319-62688-8. S2CID 16742066.

[8] Mitchell D, Bridge R (2006). "A test of Chargaff's second rule". Biochem Biophys Res Commun. 340 (1): 90–94. doi:10.1016/j.bbrc.2005.11.160. PMID 16364245.

[Szybalski1966-9] Szybalski W, Kubinski H, Sheldrick O (1966). "Pyrimidine clusters on the transcribing strand of DNA and their possible role in the initiation of RNA synthesis". Cold Spring Harb Symp Quant Biol. 31: 123–127. doi:10.1101/SQB.1966.031.01.019. PMID 4966069.

[Cristillo1998-10] Cristillo AD (1998). Characterization of G0/G1 switch genes in cultured T lymphocytes. Kingston, Ontario, Canada: Queen's University.

[Bell1999-11] Bell SJ, Forsdyke DR (1999). "Deviations from Chargaff's second parity rule correlate with direction of transcription". J Theor Biol. 197 (1): 63–76. Bibcode:1999JThBi.197...63B. doi:10.1006/jtbi.1998.0858. PMID 10036208.

[Lao2000-12] Lao PJ, Forsdyke DR (2000). "Thermophilic Bacteria Strictly Obey Szybalski's Transcription Direction Rule and Politely Purine-Load RNAs with Both Adenine and Guanine". Genome Research. 10 (2): 228–236. doi:10.1101/gr.10.2.228. PMC 310832. PMID 10673280.

[Nikolaou2006-13] Nikolaou C, Almirantis Y (2006). "Deviations from Chargaff's second parity rule in organellar DNA. Insights into the evolution of organellar genomes". Gene. 381: 34–41. doi:10.1016/j.gene.2006.06.010. PMID 16893615.

[Albrecht-Buehler2006-14] Albrecht-Buehler G (2006). "Asymptotically increasing compliance of genomes with Chargaff's second parity rules through inversions and inverted transpositions". Proc Natl Acad Sci USA. 103 (47): 17828–17833. Bibcode:2006PNAS..10317828A. doi:10.1073/pnas.0605553103. PMC 1635160. PMID 17093051.

[15] Perez, J.-C. (September 2010). "Codon populations in single-stranded whole human genome DNA are fractal and fine-tuned by the Golden Ratio 1.618". Interdisciplinary Sciences: Computational Life Sciences. 2 (3): 228–240. doi:10.1007/s12539-010-0022-0. PMID 20658335. S2CID 54565279.

[16] Piero Farisell, Cristian Taccioli, Luca Pagani & Amos Maritan (April 2020). "DNA sequence symmetries from randomness: the origin of the Chargaff's second parity rule". Briefings in Bioinformatics. 22 (bbaa04): 2172–2181. doi:10.1093/bib/bbaa041. PMC 7986665. PMID 32266404.{{cite journal}}: CS1 maint: multiple names: authors list (link)

[Bansal2003-17] Bansal M (2003). "DNA structure: Revisiting the Watson-Crick double helix" (PDF). Current Science. 85 (11): 1556–1563. Archived from the original (PDF) on 2014-07-26. Retrieved 2013-07-26.

[Hallin2004-18] Hallin PF, David Ussery D (2004). "CBS Genome Atlas Database: A dynamic storage for bioinformatic results and sequence data". Bioinformatics. 20 (18): 3682–3686. doi:10.1093/bioinformatics/bth423. PMID 15256401.

[Zhang2003_externallinks-19] Zhang CT, Zhang R, Ou HY (2003). "The Z curve database: a graphic representation of genome sequences". Bioinformatics. 19 (5): 593–599. doi:10.1093/bioinformatics/btg041. PMID 12651717.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[1]

[2]