= Codon usage bias =

Codon usage bias refers to differences in the frequency of occurrence of synonymous codons in coding DNA. A codon is a series of three nucleotides (a triplet) that encodes a specific amino acid residue in a polypeptide chain or for the termination of translation (stop codons).

There are 64 different codons (61 codons encoding for amino acids and 3 stop codons) but only 20 different translated amino acids. The overabundance in the number of codons allows many amino acids to be encoded by more than one codon. Because of such redundancy it is said that the genetic code is degenerate. The genetic codes of different organisms are often biased towards using one of the several codons that encode the same amino acid over the others—that is, a greater frequency of one will be found than expected by chance. How such biases arise is a much debated area of molecular evolution. Codon usage tables detailing genomic codon usage bias for organisms in GenBank and RefSeq can be found in the HIVE-Codon Usage Tables (HIVE-CUTs) project which contains two distinct databases, CoCoPUTs and TissueCoCoPUTs. Together, these two databases provide comprehensive, up-to-date codon, codon pair and dinucleotide usage statistics for all organisms with available sequence information and 52 human tissues, respectively.

It is generally acknowledged that codon biases reflect the contributions of 3 main factors: GC-biased gene conversion that favors GC-ending codons in diploid organisms, arrival biases reflecting mutational preferences (typically favoring AT-ending codons), and natural selection for codons that are favorable in regard to translation. Optimal codons in fast-growing microorganisms, like Escherichia coli or Saccharomyces cerevisiae (baker's yeast), reflect the composition of their respective genomic transfer RNA (tRNA) pool. It is thought that optimal codons help to achieve faster translation rates and high accuracy. As a result of these factors, translational selection is expected to be stronger in highly expressed genes, as is indeed the case for the above-mentioned organisms. In other organisms that do not show high growing rates or that present small genomes, codon usage optimization is normally absent, and codon preferences are determined by the characteristic mutational biases seen in that particular genome. Examples of this are Homo sapiens (human) and Helicobacter pylori. Organisms that show an intermediate level of codon usage optimization include Drosophila melanogaster (fruit fly), Caenorhabditis elegans (nematode worm), Strongylocentrotus purpuratus (sea urchin), and Arabidopsis thaliana (thale cress). Several viral families (herpesvirus, lentivirus, papillomavirus, polyomavirus, adenovirus, and parvovirus) are known to encode structural proteins that display heavily skewed codon usage compared to the host cell. The suggestion has been made that these codon biases play a role in the temporal regulation of their late proteins.

The nature of the codon usage-tRNA optimization has been fiercely debated. It is not clear whether codon usage drives tRNA evolution or vice versa. At least one mathematical model has been developed where both codon usage and tRNA expression co-evolve in feedback fashion (i.e., codons already present in high frequencies drive up the expression of their corresponding tRNAs, and tRNAs normally expressed at high levels drive up the frequency of their corresponding codons). However, this model does not seem to yet have experimental confirmation. Another problem is that the evolution of tRNA genes has been a very inactive area of research.

== Contributing factors ==

Different factors have been proposed to be related to codon usage bias, including gene expression level (reflecting selection for optimizing the translation process by tRNA abundance), species effective population size (also reflecting selection), guanine-cytosine content (GC content, reflecting horizontal gene transfer or mutation bias), guanine-cytosine skew (GC skew, reflecting strand-specific mutational bias), amino acid conservation, protein hydropathy, transcriptional selection, RNA stability, optimal growth temperature, hypersaline adaptation, and dietary nitrogen.

== Evolutionary theories ==

=== Mutation bias versus selection ===
Although the mechanism of codon bias selection remains controversial, possible explanations for this bias fall into two general categories. One explanation revolves around natural selection, in which codon bias contributes to the efficiency and/or accuracy of protein expression and therefore undergoes positive selection. Selection can explain why more frequent codons are recognized by more abundant tRNA molecules, as well as the correlation between preferred codons, tRNA levels, and gene copy numbers. Amino acids incorporation at more frequent codons occurs faster than that of rare codons. Faster translation can increase the cellular concentration of free ribosomes and potentially the rate of initiation for messenger RNAs (mRNAs). Codon usage can have substantial effects on cotranslational folding.

The second explanation for codon usage invokes neutral effects. These include both mutation bias (which tends to make codons with Gs and Cs more common than codons with As and Cs) and GC-biased gene conversion (which does the reverse). Different organisms exhibit different mutation biases, and the level of genome-wide GC-content is a significant parameter in explaining codon bias differences among organisms.

=== Mutation-selection-drift balance model ===
To reconcile the evidence from both mutational pressures and selection, the prevailing hypothesis for codon bias can be explained by the mutation-selection-drift balance model, or more broadly a version that also includes GC-biased gene conversion. This hypothesis states that codon usage is shaped by both neutral forces and selection, given genetic drift strong enough to prevent selection from entirely dominating codon choice at every site in the genome. It also suggests that selection is generally weak, but that it tends to be stronger in genes with higher expression and more functional constraints of coding sequences.

The "codon adaptation index" measures differences among genes in how strong selection on codon usage is relative to mutation, drift, and gene conversion. Alternatives include the 'frequency of optimal codons' (Fop), the relative codon adaptation (RCA), and the 'effective number of codons' (Nc). Multivariate statistical methods, such as correspondence analysis and principal component analysis, are widely used to analyze variations in codon usage among genes. Codon optimization has applications in designing synthetic genes and DNA vaccines.

The "codon adaptation index of species" measures the degree to which codon usage across an entire genome departs from what would be expected under mutation, gene conversion, and genetic drift. This captures effective population size, representing the strength of genetic drift in limiting the power of selection to overcome mutation bias and GC-biased gene conversion.

== Consequences of codon composition ==

=== Effect on RNA secondary structure ===
Because secondary structure of the 5’ end of mRNA influences translational efficiency, synonymous changes at this region on the mRNA can result in profound effects on gene expression. Codon usage in noncoding DNA regions can therefore play a major role in RNA secondary structure and downstream protein expression, which can undergo further selective pressures. In particular, strong secondary structure at the ribosome-binding site or initiation codon can inhibit translation, and mRNA folding at the 5’ end generates a large amount of variation in protein levels.

=== Effect on transcription or gene expression ===
Heterologous gene expression is used in many biotechnological applications, including protein production and metabolic engineering. Because tRNA pools vary between different organisms, the rate of transcription and translation of a particular coding sequence can be less efficient when placed in a non-native context. For an overexpressed transgene, the corresponding mRNA makes a large percent of total cellular RNA, and the presence of rare codons along the transcript can lead to inefficient use and depletion of ribosomes and ultimately reduce levels of heterologous protein production. In addition, the composition of the gene (e.g. the total number of rare codons and the presence of consecutive rare codons) may also affect translation accuracy. However, using codons that are optimized for tRNA pools in a particular host to overexpress a heterologous gene may also cause amino acid starvation and alter the equilibrium of tRNA pools. This method of adjusting codons to match host tRNA abundances, called codon optimization, has traditionally been used for expression of a heterologous gene. However, new strategies for optimization of heterologous expression consider global nucleotide content such as local mRNA folding, codon pair bias, a codon ramp, codon harmonization or codon correlations. With the number of nucleotide changes introduced, artificial gene synthesis is often necessary for the creation of such an optimized gene.

Specialized codon bias is further seen in some endogenous genes such as those involved in amino acid starvation. For example, amino acid biosynthetic enzymes preferentially use codons that are poorly adapted to normal tRNA abundances, but have codons that are adapted to tRNA pools under starvation conditions. Thus, codon usage can introduce an additional level of transcriptional regulation for appropriate gene expression under specific cellular conditions.

=== Effect on speed of translation elongation ===
Generally speaking for highly expressed genes, translation elongation rates are faster along transcripts with higher codon adaptation to tRNA pools, and slower along transcripts with rare codons. This correlation between codon translation rates and cognate tRNA concentrations provides additional modulation of translation elongation rates, which can provide several advantages to the organism. Specifically, codon usage can allow for global regulation of these rates, and rare codons may contribute to the accuracy of translation at the expense of speed.

=== Effect on protein folding ===
Protein folding in vivo is vectorial, such that the N-terminus of a protein exits the translating ribosome and becomes solvent-exposed before its more C-terminal regions. As a result, co-translational protein folding introduces several spatial and temporal constraints on the nascent polypeptide chain in its folding trajectory. Because mRNA translation rates are coupled to protein folding, and codon adaptation is linked to translation elongation, it has been hypothesized that manipulation at the sequence level may be an effective strategy to regulate or improve protein folding. Several studies have shown that pausing of translation as a result of local mRNA structure occurs for certain proteins, which may be necessary for proper folding. Furthermore, synonymous mutations have been shown to have significant consequences in the folding process of the nascent protein and can even change substrate specificity of enzymes. These studies suggest that codon usage influences the speed at which polypeptides emerge vectorially from the ribosome, which may further impact protein folding pathways throughout the available structural space.
