In genetics, the Ka/Ks ratio (or ω, dN/dS), is the ratio of the number of Nonsynonymous substitutions per non-synonymous site (Ka) to the number of synonymous substitutions per synonymous site (Ks), which can be used as an indicator of selective pressure acting on a protein-coding gene. Comparisons of homologous genes with a high Ka/Ks ratio are usually said to be evolving under positive selection. The terms Ka/Ks and dN/dS are used interchangeably. Note however that Dn and Ds are different parameters from dN and dS (or KA and KS ). Dn and Ds are count estimates, which represent the total numbers of non-synonymous and synonymous substitutions.
Methods for estimating Ka and Ks use a sequence alignment of two or more nucleotide sequences of homologous protein-coding genes. Methods can be classified into three groups: approximate methods, maximum-likelihood methods, and counting methods. However, unless the sequences to be compared are distantly related (in which case ML methods prevail), the class of method used makes a minimal impact on the results obtained; more important are the assumptions implicit in the chosen method.:498
Approximate methods involve three basic steps:
- counting the number of synonymous and nonsynonymous sites in the two sequences – usually by multiplying the sequence length by the proportion of each class of substitution;
- counting the number of synonymous and nonsynonymous substitutions; and
- correcting for multiple substitutions.
These steps, particularly the latter, require simplistic assumptions to be made if they are to be achieved computationally; for reasons discussed later, it is impossible to exactly determine the number of multiple substitutions.
The maximum-likelihood approach uses probability theory to complete all three steps simultaneously. It estimates critical parameters, including the divergence between sequences and the transition/transversion ratio, by deducing the most likely values to produce the input data.
In order to quantify the number of substitutions, one may reconstruct the ancestral sequence and record the inferred changes at sites (straight counting — likely to provide an underestimate); fitting the substitution rates at sites into predetermined categories (Bayesian approach; poor for small data sets); and generating an individual substitution rate for each codon (computationally expensive). Given enough data, all three of these approaches will tend to the same result.
The dN/dS ratio is used to infer the direction and magnitude of natural selection acting on protein coding genes. A ratio greater than one implies positive or Darwinian selection; less than one implies purifying (stabilizing) selection; and a ratio of one indicates neutral (i.e. no) selection. However, a combination of positive and purifying selection at different points within the gene or at different times along its evolution may cancel each other out, giving an average value that may be lower, equal or higher than one.
Of course, it is necessary to perform a statistical analysis to determine whether a result is significantly different from 1, or whether any apparent difference may occur as a result of a limited data set. The appropriate statistical test for an approximate method involves approximating dN − dS with a normal approximation, and determining whether zero falls within the central region of the approximation. More sophisticated likelihood techniques can be used to analyse the results of a Maximum Likelihood analysis, by performing a chi-squared test to distinguish between a null model (dN/dS = 1) and the observed results.
The ratio is a more powerful test of the neutral model of evolution than many others available in population genetics as it requires fewer assumptions.
There is often a systematic bias in the frequency at which various nucleotides are swapped, as certain mutations are more probable than others. For instance, some lineages may swap C to T more frequently than they swap C to A. In the case of the amino acid Asparagine, which is coded by the codons AAT or AAC, a high C->T exchange rate will increase the proportion of synonymous substitutions at this codon, whereas a high C→A exchange rate will increase the rate of non-synonymous substitutions. Because it is rather common for transitions (T↔C & A↔G) to be favoured over transversions (other changes), models must account for the possibility of non-homogeneous rates of exchange. Some simpler approximate methods, such as those of Miyata & Yasunaga and Nei & Gojobori, neglect to take these into account, which generates a faster computational time at the expense of accuracy; these methods will systematically overestimate N and underestimate S.
Further, there may be a bias in which certain codons are preferred in a gene, as a certain combination of codons may improve translational efficiency.
In addition, as time progresses, it is possible for a site to undergo multiple modifications. For instance, a codon may switch from AAA→AAC→AAT;→AAA. There is no way of detecting multiple substitutions at a single site, thus the estimate of the number of substitutions is always an underestimate. In addition, in the example above two non-synonymous and one synonymous substitution occurred at the third site; however, because substitutions restored the original sequence, there is no evidence of any substitution. As the divergence time between two sequences increases, so too does the amount of multiple substitutions. Thus "long branches" in a dN/dS analysis can lead to underestimates of both dN and dS, and the longer the branch, the harder it is to correct for the introduced noise. Of course, the ancestral sequence is usually unknown, and two lineages being compared will have been evolving in parallel since their last common ancestor. This effect can be mitigated by constructing the ancestral sequence; the accuracy of this sequence is enhanced by having a large number of sequences descended from that common ancestor to constrain its sequence by phylogenetic methods.
Methods that account for biases in codon usage and transition/transversion rates are substantially more reliable than those that do not.
Although dN/dS is a good indicator of selective pressure at the sequence level, evolutionary change can often take in the regulatory region of gene which affect the level, timing or location of gene expression. Ka/Ks analysis will not detect such change. It will only calculate selective pressure within protein coding regions. In addition, selection that does not cause differences at an amino acid level—for instance, balancing selection—cannot be detected by these techniques.
Another issue is that heterogeneity within a gene can make a result hard to interpret. For example, if Ka/Ks = 1, it could be due to relaxed selection, or to a chimera of positive and purifying selection at the locus. A solution to this limitation would be to apply Ka/Ks analysis across many species at individual codons.
The dN/dS method requires a rather strong signal in order to detect selection. In order to detect selection between lineages, then the selection, averaged over all sites in the sequence, must produce a dN/dS greater than one—quite a feat if regions of the gene are strongly conserved. In order to detect selection at specific sites, then the dN/dS ratio must be greater than one when averaged over all included lineages at that site—implying that the site must be under selective pressure in all sampled lineages. This limitation can be moderated by allowing the dN/dS rate to take multiple values across sites and across lineages; the inclusion of more lineages also increases the power of a sites-based approach.
Further, the method lacks the capability to distinguish between positive and negative nonsynonymous substitutions. Some amino acids are chemically similar to one another, whereas other substitutions may place an amino acid with wildly different properties to its precursor. In most situations, a smaller chemical change is more likely to allow the protein to continue to function, and a large chemical change is likely to disrupt the chemical structure and cause the protein to malfunction. However, incorporating this into a model is not straightforward as the relationship between a nucleotide substitution and the effects of the modified chemical properties is very difficult to determine.
An additional concern is that the effects of time must be incorporated into an analysis, if the lineages being compared are closely related; this is because it can take a number of generations for natural selection to "weed out" deleterious mutations from a population, especially if their effect on fitness is weak. This limits the usefulness of Ka/Ks for comparing closely related populations.
Individual codon approach
Additional information can be gleaned by determining the dN/dS ratio at specific codons within a gene sequence. For instance, the frequency-tuning region of an opsin may be under enhanced selective pressure when a species colonises and adapts to new environment, whereas the region responsible for initializing a nerve signal may be under purifying selection. In order to detect such effects, one would ideally calculate the dN/dS ratio at each site. However this is computationally expensive and in practise, a number of dN/dS classes are established, and each site is shoehorned into the best-fitting class.
The first step in identifying whether positive selection acts on sites is to compare a test where the dN/dS ratio is constrained to be < 1 in all sites to one where it may take any value, and see if permitting dN/dS to exceed 1 in some sites improves the fit of the model. If this is the case, then sites fitting into the class where dN/dS > 1 are candidates to be experiencing positive selection. This form of test can either identify sites that further laboratory research can examine to determine possible selective pressure; or, sites believed to have functional significance can be assigned into different dN/dS classes before the model is run.
- Free online server tool that calculates KaKs ratios among multiple sequences
- SeqinR: A free and open biological sequence analysis package for the R language that includes KaKs calculation
- For a simple introduction, see Hurst, L. (2002). "The Ka/Ks ratio: diagnosing the form of sequence evolution". Trends in Genetics 18: 486–489. doi:10.1016/S0168-9525(02)02722-1.
- Yang, Z.; Bielawski, J. P. (2000). "Statistical methods for detecting molecular adaptation". Trends in ecology & evolution (Personal edition) 15 (12): 496–503. doi:10.1016/S0169-5347(00)01994-7. PMID 11114436.
- Kosakovsky Pond, S. L.; Frost, S. D. W. (2005). "Not So Different After All: A Comparison of Methods for Detecting Amino Acid Sites Under Selection". Molecular Biology and Evolution 22 (5): 1208. doi:10.1093/molbev/msi105. PMID 15703242.
- Hurst, L. (2002). "The Ka/Ks ratio: diagnosing the form of sequence evolution". Trends in Genetics 18: 486–489. doi:10.1016/S0168-9525(02)02722-1.
- Rocha, E. P. C.; Smith, J. M.; Hurst, L. D.; Holden, M. T. G.; Cooper, J. E.; Smith, N. H.; Feil, E. J. (2006). "Comparisons of dN/dS are time dependent for closely related bacterial genomes". Journal of Theoretical Biology 239 (2): 226. doi:10.1016/j.jtbi.2005.08.037. PMID 16239014.
- Kryazhimskiy S, Plotkin JB (2008). "The Population Genetics of dN/dS". PLoS Genetics 4 (12): e1000304. doi:10.1371/journal.pgen.1000304. PMC 2596312. PMID 19081788.
- Peterson GI, Masel J (2009). "Quantitative Prediction of Molecular Clock and Ka/Ks at Short Timescales". Molecular Biology & Evolution 26 (11): 2595–2603. doi:10.1093/molbev/msp175. PMC 2912466. PMID 19661199.
- Li WH, Wu CI, Luo CC. 1985. A new method for estimating synonymous and nonsynonymous rates of nucleotide substitution considering the relative likelihood of nucleotide and codon changes. Mol Biol Evol 2(2):150-174.
- Nei M, Gojobori T. 1986. Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol Biol Evol 3(5):418-426.
- Li WH. 1993. Unbiased estimation of the Rates of synonymous and nonsynonymous substitution. J Mol Evol 36:96-99.
- Pamilo P, Bianchi NO.1993. Evolution of the Zfx and Zfy genes: rates and interdependence between the genes. Mol Biol Evol 10(2):271-281.
- Muse SV, Gaut BS: A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol Biol Evol 1994, 11(5):715-724.
- Goldman N, Yang Z: A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol Biol Evol 1994, 11(5):725-736.
- Comeron JM: A method for estimating the numbers of synonymous and nonsynonymous substitutions per site. J Mol Evol 1995, 41:1152-1159.
- Ina Y: New methods for estimating the numbers of synonymous and nonsynonymous substitutions. J Mol Evol 1995, 40:190-226.
- Yang Z: PAML: a program package for phylogenetic analysis by maximum likelihood. CABIOS 1997, 13:555-556.
- Yang Z, Nielsen R: Estimating Synonymous and Nonsynonymous Substitution Rates Under Realistic Evolutionary Models. Mol Biol Evol 2000, 17(1):32-43.
- Zhang Z, Li J, Yu J: Computing Ka and Ks with a consideration of unequal transitional substitutions. BMC evolutionary biology 2006, 6:44.
- Zhang, Z., Li, J., Zhao, X., Wang, J., Wong, G.K. and Yu, J. (2006) KaKs_Calculator: calculating Ka and Ks through model selection and model averaging, Genomics Proteomics Bioinformatics, 4(4): 259-263.Zhang, Z.; Li, J.; Zhao, X.; Wang, J.; Wong, G.; Yu, J. (2006). "KaKs_Calculator: Calculating Ka and Ks Through Model Selection and Model Averaging". Genomics, Proteomics & Bioinformatics 4 (4): 259. doi:10.1016/S1672-0229(07)60007-2.