GC-content

Nucleotide bonds showing AT and GC pairs. Arrows point to the hydrogen bonds.

In molecular biology and genetics, GC-content (or guanine-cytosine content) is the percentage of nitrogenous bases on a DNA or RNA molecule that are either guanine or cytosine (from a possibility of four different ones, also including adenine and thymine in DNA and adenine and uracil in RNA).[1] This may refer to a certain fragment of DNA or RNA, or that of the whole genome. When it refers to a fragment of the genetic material, it may denote the GC-content of section of a gene (domain), single gene, group of genes (or gene clusters), or even a non-coding region. G (guanine) and C (cytosine) undergo a specific hydrogen bonding, whereas A (adenine) bonds specifically with T (thymine, in DNA) or U (uracil, in RNA).

The GC pair is bound by three hydrogen bonds, while AT and AU pairs are bound by two hydrogen bonds. To emphasize this difference in the number of hydrogen bonds, the base pairings can be represented as respectively G≡C versus A=T and A=U. DNA with low GC-content is less stable than DNA with high GC-content; however, the hydrogen bonds themselves do not have a particularly significant impact on stabilization, the stabilization is due mainly to interactions of base stacking.[2] In spite of the higher thermostability conferred to the genetic material, it is envisaged that cells with DNA of high GC-content undergo autolysis, thereby reducing the longevity of the cell per se.[3] Due to the thermostability given to the genetic materials in high GC organisms, it was commonly believed that the GC content played a necessary role in adaptation temperatures, a hypothesis that was refuted in 2001.[4] However, it has been shown that there is a strong correlation between the prokaryotic optimal growth at higher temperatures and the GC content of structured RNAs (such as ribosomal RNA, transfer RNA, and many other non-coding RNAs).[4][5] The AU base pairs are less stable than the GC base pairs because GC bonds have 3 hydrogen bonds and AU only has 2 hydrogen bonds, which makes high-GC-content RNA structures more resistant to the effects of high temperatures. More recently, one of the earliest large-scale systematic gene-centric association analysis recently demonstrated the correlation between GC content and temperature for certain genomic regions while not for others.[6]

In PCR experiments, the GC-content of primers are used to predict their annealing temperature to the template DNA. A higher GC-content level indicates a higher melting temperature.

Determination

GC content is usually expressed as a percentage value, but sometimes as a ratio (called G+C ratio or GC-ratio). GC-content percentage is calculated as[7]

${\displaystyle {\cfrac {G+C}{A+T+G+C}}}$

whereas the AT/GC ratio is calculated as[8]

${\displaystyle {\cfrac {A+T}{G+C}}}$ .

The GC-content percentages as well as GC-ratio can be measured by several means, but one of the simplest methods is to measure what is called the melting temperature of the DNA double helix using spectrophotometry. The absorbance of DNA at a wavelength of 260 nm increases fairly sharply when the double-stranded DNA separates into two single strands when sufficiently heated.[9] The most commonly used protocol for determining GC ratios uses flow cytometry for large number of samples.[10]

In alternative manner, if the DNA or RNA molecule under investigation has been sequenced then the GC-content can be accurately calculated by simple arithmetic or by using the free online GC calculator.

GC ratio of genomes

GC ratios within a genome is found to be markedly variable. These variations in GC ratio within the genomes of more complex organisms result in a mosaic-like formation with islet regions called isochores.[11] This results in the variations in staining intensity in the chromosomes.[12] GC-rich isochores include in them many protein coding genes, and thus determination of ratio of these specific regions contributes in mapping gene-rich regions of the genome.[13][14]

GC ratios and coding sequence

Within a long region of genomic sequence, genes are often characterised by having a higher GC-content in contrast to the background GC-content for the entire genome. Evidence of GC ratio with that of length of the coding region of a gene has shown that the length of the coding sequence is directly proportional to higher G+C content.[15] This has been pointed to the fact that the stop codon has a bias towards A and T nucleotides, and, thus, the shorter the sequence the higher the AT bias.[16]

Application in systematics

GC content is found to be variable with different organisms, the process of which is envisaged to be contributed to by variation in selection, mutational bias, and biased recombination-associated DNA repair.[17] The species problem in prokaryotic taxonomy has led to various suggestions in classifying bacteria, and the ad hoc committee on reconciliation of approaches to bacterial systematics has recommended use of GC ratios in higher level hierarchical classification.[18] For example, the Actinobacteria are characterised as "high GC-content bacteria".[19] In Streptomyces coelicolor A3(2), GC content is 72%.[20] The GC-content of Yeast (Saccharomyces cerevisiae) is 38%,[21] and that of another common model organism, Thale Cress (Arabidopsis thaliana), is 36%.[22] Because of the nature of the genetic code, it is virtually impossible for an organism to have a genome with a GC-content approaching either 0% or 100%. A species with an extremely low GC-content is Plasmodium falciparum (GC% = ~20%),[23] and it is usually common to refer to such examples as being AT-rich instead of GC-poor.[24]

References

1. ^
2. ^ Yakovchuk P, Protozanova E, Frank-Kamenetskii MD (2006). "Base-stacking and base-pairing contributions into thermal stability of the DNA double helix". Nucleic Acids Res. 34 (2): 564–74. PMC . PMID 16449200. doi:10.1093/nar/gkj454.
3. ^ Levin RE, Van Sickle C (1976). "Autolysis of high-GC isolates of Pseudomonas putrefaciens". Antonie Van Leeuwenhoek. 42 (1–2): 145–55. PMID 7999. doi:10.1007/BF00399459.
4. ^ a b Hurst LD, Merchant AR (March 2001). "High guanine-cytosine content is not an adaptation to high temperature: a comparative analysis amongst prokaryotes". Proc. Biol. Sci. 268 (1466): 493–7. PMC . PMID 11296861. doi:10.1098/rspb.2000.1397.
5. ^ Galtier, N.; Lobry, J.R. (1997). "Relationships between genomic G+C content, RNA secondary structures, and optimal growth temperature in Prokaryotes". Journal of Molecular Evolution. 44 (6): 632–636. PMID 9169555. doi:10.1007/PL00006186.
6. ^ Zheng H, Wu H (December 2010). "Gene-centric association analysis for the correlation between the guanine-cytosine content levels and temperature range conditions of prokaryotic species". BMC Bioinformatics. 11: S7. PMC . PMID 21172057. doi:10.1186/1471-2105-11-S11-S7.
7. ^ Madigan,MT. and Martinko JM. (2003). Brock biology of microorganisms (10th ed.). Pearson-Prentice Hall. ISBN 84-205-3679-2.
8. ^ Definition of GC-ratio on Northwestern University, IL, USA
9. ^ Wilhelm J, Pingoud A, Hahn M (May 2003). "Real-time PCR-based method for the estimation of genome sizes". Nucleic Acids Res. 31 (10): e56. PMC . PMID 12736322. doi:10.1093/nar/gng056.
10. ^ Vinogradov AE (May 1994). "Measurement by flow cytometry of genomic AT/GC ratio and genome size". Cytometry. 16 (1): 34–40. PMID 7518377. doi:10.1002/cyto.990160106.
11. ^ Bernardi G (January 2000). "Isochores and the evolutionary genomics of vertebrates". Gene. 241 (1): 3–17. PMID 10607893. doi:10.1016/S0378-1119(99)00485-0.
12. ^ Furey TS, Haussler D (May 2003). "Integration of the cytogenetic map with the draft human genome sequence". Hum. Mol. Genet. 12 (9): 1037–44. PMID 12700172. doi:10.1093/hmg/ddg113.
13. ^ Sumner AT, de la Torre J, Stuppia L (August 1993). "The distribution of genes on chromosomes: a cytological approach". J. Mol. Evol. 37 (2): 117–22. PMID 8411200. doi:10.1007/BF02407346.
14. ^ Aïssani B, Bernardi G (October 1991). "CpG islands, genes and isochores in the genomes of vertebrates". Gene. 106 (2): 185–95. PMID 1937049. doi:10.1016/0378-1119(91)90198-K.
15. ^ Pozzoli U, Menozzi G, Fumagalli M, et al. (2008). "Both selective and neutral processes drive GC content evolution in the human genome". BMC Evol. Biol. 8: 99. PMC . PMID 18371205. doi:10.1186/1471-2148-8-99.
16. ^ Wuitschick JD, Karrer KM (1999). "Analysis of genomic G + C content, codon usage, initiator codon context and translation termination sites in Tetrahymena thermophila". J. Eukaryot. Microbiol. 46 (3): 239–47. PMID 10377985. doi:10.1111/j.1550-7408.1999.tb05120.x.
17. ^ Birdsell JA (1 July 2002). "Integrating genomics, bioinformatics, and classical genetics to study the effects of recombination on genome evolution". Mol. Biol. Evol. 19 (7): 1181–97. PMID 12082137. doi:10.1093/oxfordjournals.molbev.a004176.
18. ^ Wayne LG; et al. (1987). "Report of the ad hoc committee on reconciliation of approaches to bacterial systematic". International Journal of Systematic Bacteriology. 37 (4): 463–4. doi:10.1099/00207713-37-4-463.
19. ^ Taxonomy browser on NCBI
20. ^ Whole genome data of "Streptomyces coelicolor" A3(2) on NCBI
21. ^ Whole genome data of Saccharomyces cerevisiae on NCBI
22. ^ Whole genome data of Arabidopsis thaliana on NCBI
23. ^ Whole genome data of Plasmodium falciparum on NCBI
24. ^ Musto H, Cacciò S, Rodríguez-Maseda H, Bernardi G (1997). "Compositional constraints in the extremely GC-poor genome of Plasmodium falciparum" (PDF). Mem. Inst. Oswaldo Cruz. 92 (6): 835–41. PMID 9566216. doi:10.1590/S0074-02761997000600020.