Substitution matrix: Difference between revisions

Content deleted Content added

Inline

Revision as of 21:22, 1 October 2009

In bioinformatics and evolutionary biology, a substitution matrix describes the rate at which one character in a sequence changes to other character states over time. Substitution matrices are usually seen in the context of amino acid or DNA sequence alignments, where the similarity between sequences depends on their divergence time and the substitution rates as represented in the matrix.

Background

In the process of evolution, from one generation to the next the amino acid sequences of an organism's proteins are gradually altered through the action of DNA mutations. For example, the sequence

ALEIRYLRD

could mutate into the sequence

ALEINYLRD

in one generation and possibly

AQEINYQRD

over a longer period of evolutionary time. Each amino acid is more or less likely to mutate into various other amino acids. For instance, a hydrophobic residue such as valine is more likely to stay hydrophobic than not, since replacing it with a hydrophilic residue could affect the folding and/or activity of the protein.

If we have two amino acid sequences in front of us, we should be able to say something about how likely they are to be derived from a common ancestor, or homologous. If we can line up the two sequences using a sequence alignment algorithm such that the mutations required to transform a hypothetical ancestor sequence into both of the current sequences would be evolutionarily plausible, then we'd like to assign a high score to the comparison of the sequences.

To this end, we will construct a 20x20 matrix where the $(i,j)$ th entry is equal to the probability of the $i$ th amino acid being transformed into the $j$ th amino acid in a certain amount of evolutionary time. There are many different ways to construct such a matrix, called a substitution matrix. Here are the most commonly used ones:

Identity matrix

The simplest possible substitution matrix would be one in which each amino acid is considered maximally similar to itself, but not able to transform into any other amino acid. This matrix would look like:

${\begin{bmatrix}1&0&\cdots &0&0\\0&1&&0&0\\\vdots &&\ddots &&\vdots \\0&0&&1&0\\0&0&\cdots &0&1\end{bmatrix}}$

This identity matrix will succeed in the alignment of very similar amino acid sequences but will be miserable at aligning two distantly related sequences. We need to figure out all the probabilities in a more rigorous fashion. It turns out that an empirical examination of previously aligned sequences works best.

Log-odds matrices

We express the probabilities of transformation in what are called log-odds scores. The scores matrix S is defined as

$S_{i,j}=\log {\frac {p_{i}\cdot M_{i,j}}{p_{i}\cdot p_{j}}}=\log {\frac {M_{i,j}}{p_{j}}}=\log {\frac {observed\;frequency}{expected\;frequency}}$

where $M_{i,j}$ is the probability that amino acid $i$ transforms into amino acid $j$ and $p_{i}$ is the frequency of amino acid i. The base of the logarithm is not important, and you will often see the same substitution matrix expressed in different bases.

PAM

One of the first amino acid substitution matrices, the PAM (Point Accepted Mutation) matrix was developed by Margaret Dayhoff in the 1970s. This matrix is calculated by observing the differences in closely related proteins. The PAM1 matrix estimates what rate of substitution would be expected if 1% of the amino acids had changed. The PAM1 matrix is used as the basis for calculating other matrices by assuming that repeated mutations would follow the same pattern as those in the PAM1 matrix, and multiple substitutions can occur at the same site. Using this logic, Dayhoff derived matrices as high as PAM250. Usually the PAM 30 and the PAM70 are used.

A matrix for divergent sequences can be calculated from a matrix for closely related sequences by taking the second matrix to a power. For instance, we can roughly approximate the WIKI2 matrix from the WIKI1 matrix by saying $W_{2}=W_{1}^{2}$ where $W_{1}$ is WIKI1 and $W_{2}$ is WIKI2. This is how the PAM250 matrix is calculated.

BLOSUM

Dayhoff's methodology of comparing closely related species turned out not to work very well for aligning evolutionarily divergent sequences. Sequence changes over long evolutionary time scales are not well approximated by compounding small changes that occur over short time scales. The BLOSUM (BLOck SUbstitution Matrix) series of matrices rectifies this problem. Henikoff and Henikoff constructed these matrices using multiple alignments of evolutionarily divergent proteins. The probabilities used in the matrix calculation are computed by looking at "blocks" of conserved sequences found in multiple protein alignments. These conserved sequences are assumed to be of functional importance within related proteins. To reduce bias from closely related sequences, segments in a block with a sequence identity above a certain threshold were clustered giving weight 1 to each such cluster (Henikoff and Henikoff). For the BLOSUM62 matrix, this threshold was set at 62%. Pairs frequencies were then counted between clusters, hence pairs were only counted between segments less than 62% identical. One would use a higher numbered BLOSUM matrix for aligning two closely related sequences and a lower number for more divergent sequences.

It turns out that the BLOSUM62 matrix does an excellent job detecting similarities in distant sequences, and this is the matrix used by default in most recent alignment applications such as BLAST.

Differences between PAM and BLOSUM

PAM matrices are based on an explicit evolutionary model (i.e. replacements are counted on the branches of a phylogenetic tree), whereas the BLOSUM matrices are based on an implicit model of evolution.
The PAM matrices are based on mutations observed throughout a global alignment, this includes both highly conserved and highly mutable regions. The BLOSUM matrices are based only on highly conserved regions in series of alignments forbidden to contain gaps.
The method used to count the replacements is different: unlike the PAM matrix, the BLOSUM procedure uses groups of sequences within which not all mutations are counted the same.
Higher numbers in the PAM matrix naming scheme denote larger evolutionary distance, while larger numbers in the BLOSUM matrix naming scheme denote higher sequence similarity and therefore smaller evolutionary distance. Example: PAM150 is used for more distant sequences than PAM100; BLOSUM62 is used for closer sequences than Blosum50.

Extensions and improvements

Many specialized substitution matrices have been developed that describe the amino acid substitution rates in specific structural or sequence contexts, such as in transmembrane alpha helices,^[1] for combinations of secondary structure states and solvent accessiblity states,^[2]^[3]^[4] or for local sequence-structure contexts.^[5] These context-specific substitution matrices lead to generally improved alignment quality at some cost of speed but are not yet widely used. Recently, sequence context-specific amino acid similarities have been derived that do not need substitution matrices but that rely on a library of sequence contexts instead. Using this idea, a context-specific extension of the popular BLAST program has been demonstrated to achieve a two-fold sensitivity improvement for remotely related sequences over BLAST at similar speeds (CS-BLAST).

References

^ . PMID 11473008. {{cite journal}}: Cite journal requires |journal= (help); Missing or empty |title= (help)
^ . PMID 9135128. {{cite journal}}: Cite journal requires |journal= (help); Missing or empty |title= (help)
^ . PMID 18833291. {{cite journal}}: Cite journal requires |journal= (help); Missing or empty |title= (help)
^ . PMID 18004781. {{cite journal}}: Cite journal requires |journal= (help); Missing or empty |title= (help)
^ . PMID 16352653. {{cite journal}}: Cite journal requires |journal= (help); Missing or empty |title= (help)

External links

PAM Matrix calculator

[1] . PMID 11473008. {{cite journal}}: Cite journal requires |journal= (help); Missing or empty |title= (help)

[2] . PMID 9135128. {{cite journal}}: Cite journal requires |journal= (help); Missing or empty |title= (help)

[3] . PMID 18833291. {{cite journal}}: Cite journal requires |journal= (help); Missing or empty |title= (help)

[4] . PMID 18004781. {{cite journal}}: Cite journal requires |journal= (help); Missing or empty |title= (help)

[5] . PMID 16352653. {{cite journal}}: Cite journal requires |journal= (help); Missing or empty |title= (help)

[1]

[2]

[3]

[4]

[5]

@@ Line 1: / Line 1: @@
 {{dablink|This page is about the concept in bioinformatics.  For the economics concept also called the Slutsky matrix, see the article [[Slutsky equation]].}}
+{{Morefootnotes|article|date=October 2009}}
 In [[bioinformatics]] and [[evolutionary biology]], a '''substitution matrix''' describes the rate at which one character in a sequence changes to other character states over time. Substitution matrices are usually seen in the context of [[amino acid]] or [[DNA]] [[sequence alignment]]s, where the similarity between sequences depends on their divergence time and the substitution rates as represented in the matrix.
 == Background ==
 In the process of [[evolution]], from one generation to the next the amino acid sequences of an organism's proteins are gradually altered through the action of DNA mutations. For example, the sequence
@@ Line 24: / Line 23: @@
 == Identity matrix ==
 The simplest possible substitution matrix would be one in which each amino acid is considered maximally similar to itself, but not able to transform into any other amino acid. This matrix would look like:
@@ Line 40: / Line 38: @@
 == Log-odds matrices ==
 We express the [[probability|probabilities]] of transformation in what are called [[log-odds]] [[Score (statistics)|score]]s. The scores matrix S is defined as
@@ Line 48: / Line 45: @@
 === PAM ===
 One of the first amino acid [[substitution matrix|substitution matrices]], the PAM ''([[Point accepted mutation|Point Accepted Mutation]])'' matrix was developed by [[Margaret Oakley Dayhoff|Margaret Dayhoff]] in the 1970s. This matrix is calculated by observing the differences in closely related proteins. The PAM1 matrix estimates what rate of substitution would be expected if 1% of the amino acids had changed. The PAM1 matrix is used as the basis for calculating other matrices by assuming that repeated mutations would follow the same pattern as those in the PAM1 matrix, and multiple substitutions can occur at the same site. Using this logic, Dayhoff derived matrices as high as PAM250. Usually the [[PAM30|PAM 30]] and the PAM70 are used.
@@ Line 54: / Line 50: @@
 === BLOSUM ===
 Dayhoff's methodology of comparing closely related species turned out not to work very well for aligning evolutionarily divergent sequences. Sequence changes over long evolutionary time scales are not well approximated by compounding small changes that occur over short time scales. The [[BLOSUM]] ''(BLOck SUbstitution Matrix)'' series of matrices rectifies this problem. Henikoff and Henikoff constructed these matrices using multiple alignments of evolutionarily divergent proteins.  The probabilities used in the matrix calculation are computed by looking at "blocks" of conserved sequences found in multiple protein alignments.  These conserved sequences are assumed to be of functional importance within related proteins. To reduce bias from closely related sequences, segments in a block with a sequence identity above a certain threshold were clustered giving weight 1 to each such cluster (Henikoff and Henikoff). For the BLOSUM62 matrix, this threshold was set at 62%. Pairs frequencies were then counted between clusters, hence pairs were only counted between segments less than 62% identical. One would use a higher numbered BLOSUM matrix for aligning two closely related sequences and a lower number for more divergent sequences.
@@ Line 66: / Line 61: @@
 == Extensions and improvements ==
+Many specialized substitution matrices have been developed that describe the amino acid substitution rates in specific structural or sequence contexts, such as in transmembrane alpha helices,<ref>{{cite journal |pmid=11473008}}</ref> for combinations of secondary structure states and solvent accessiblity states,<ref>{{cite journal |pmid=9135128}}</ref><ref>{{cite journal |pmid=18833291}}</ref><ref>{{cite journal |pmid=18004781}}</ref> or for local sequence-structure contexts.<ref>{{cite journal |pmid=16352653}}</ref> These context-specific substitution matrices lead to generally improved alignment quality at some cost of speed but are not yet widely used. Recently, sequence context-specific amino acid similarities have been derived that do not need substitution matrices but that rely on a library of sequence contexts instead. Using this idea, a context-specific extension of the popular [[BLAST]] program has been demonstrated to achieve a two-fold sensitivity improvement for remotely related sequences over BLAST at similar speeds ([[CS-BLAST]]).
-Many specialized substitution matrices have been developed that describe the amino acid substitution rates in specific structural or sequence contexts, such as in transmembrane alpha helices ([http://www.ncbi.nlm.nih.gov/pubmed/11473008 Muller ''et al.'' 2001]), for combinations of secondary structure states and solvent accessiblity states ([http://www.ncbi.nlm.nih.gov/pubmed/9135128 Rice and Eisenberg 1997], [http://www.ncbi.nlm.nih.gov/pubmed/18833291 Gong and Blundell 2008], [http://www.ncbi.nlm.nih.gov/pubmed/18004781 Goonesekere and Lee, 2008]), or for local sequence-structure contexts ([http://www.ncbi.nlm.nih.gov/pubmed/16352653 HMMSUM]). These context-specific substitution matrices lead to generally improved alignment quality at some cost of speed but are not yet widely used. Recently, sequence context-specific amino acid similarities have been derived that do not need substitution matrices but that rely on a library of sequence contexts instead. Using this idea, a context-specific extension of the popular [[BLAST]] program has been demonstrated to achieve a two-fold sensitivity improvement for remotely related sequences over BLAST at similar speeds ([[CS-BLAST]]).
 == References ==
+{{reflist}}
-* Altschul, S.F. [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=2051488&query_hl=1&itool=pubmed_docsum Amino acid substitutions matrices from an information theoretic perspective.] ''J. Mol. Biol.'' 219, 555-665 (1991).
-* Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C. A model of evolutionary change in proteins. In "Atlas of Protein Sequence and Structure" 5(3) M.O. Dayhoff (ed.), 345 - 352 (1978).
+==Further reading==
-* Henikoff, S. and Henikoff, J. [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=1438297&query_hl=6&itool=pubmed_docsum Amino acid substitution matrices from protein blocks.] ''Proc. Natl. Acad. Sci. USA.'' 89(biochemistry): 10915 - 10919 (1992).
+*{{cite journal |pmid=2051488}}
-* Eddy S.R. [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&list_uids=15286655 Where did the BLOSUM62 alignment score matrix come from?] ''Nat Biotechnol.'' 22(8), 1035-6. Review. (2004).
+*{{cite journal |last1=Dayhoff |first1=M. O. |last2=Schwartz |first2=R. M. |last3=Orcutt |first3=B. C. |title=A model of evolutionary change in proteins |journal=Atlas of Protein Sequence and Structure |volume=5 |issue=3 |pages=345-352 |year=1978}}
+*{{cite journal |pmid=1438297}}
+*{{cite journal |pmid=15286655}}
+*{{cite journal |pmid=1438297 |url=http://www.pnas.org/cgi/reprint/89/22/10915}}
 ==External links==
-* [http://www.pnas.org/cgi/reprint/89/22/10915 Henikoff BLOSUM paper, 1992]
 * [http://www.bioinformatics.nl/tools/pam.html PAM Matrix calculator]