# CS-BLAST

Developer(s) Angermueller C, Biegert A, and Soeding J 2.2.3 / December 7, 2013 1.1 / April 14, 2009; 9 years ago C++ English Bioinformatics tool GNU GPL v3 http://wwwuser.gwdg.de/~compbiol/data/csblast/releases/, https://github.com/soedinglab/csblast

CS-BLAST [1] [2] [3] (Context-Specific BLAST) is a tool that searches a protein sequence that extends BLAST (Basic Local Alignment Search Tool),[4] using context-specific mutation probabilities. More specifically, CS-BLAST derives context-specific amino-acid similarities on each query sequence from short windows on the query sequences [4]. Using CS-BLAST doubles sensitivity and significantly improves alignment quality without a loss of speed in comparison to BLAST. CSI-BLAST (Context-Specific Iterated BLAST) is the context-specific analog of PSI-BLAST [5] (Position-Specific Iterated BLAST), which computes the mutation profile with substitution probabilities and mixes it with the query profile [2]. CSI-BLAST (Context-Specific Iterated BLAST) is the context specific analog of PSI-BLAST (Position-Specific Iterated BLAST). Both of these programs are available as web-server and are available for free download.

## Background

Homology is the relationship between biological structures or sequences derived from a common ancestor. Homologous proteins (proteins who have common ancestry) are inferred from their sequence similarity. Inferring homologous relationships involves calculating scores of aligned pairs minus penalties for gaps. Aligning pairs of proteins identify regions of similarity indicating a relationship between the two, or more, proteins. In order to have a homologous relationship, the sum of scores over all the aligned pairs of amino acids or nucleotides must be sufficiently high [2]. Standard methods of sequence comparisons use a substitution matrix to accomplish this [4]. Similarities between amino acids or nucleotides are quantified in these substitution matrices. The substitution score (${\displaystyle S}$) of amino acids ${\displaystyle a}$ and ${\displaystyle b}$ can we written as follows:

${\displaystyle S(a,b)=const\times \log \left({\frac {P(a|b)}{P(a)}}\right)}$

where ${\displaystyle P(a|b)}$ denotes the probability of amino acid ${\displaystyle a}$ mutating into amino acid ${\displaystyle b}$ [2]. In a large set of sequence alignments, counting the number of amino acids as well as the number of aligned pairs ${\displaystyle (a,b)}$ will allow you to derive the probabilities ${\displaystyle P(a|b)}$ and ${\displaystyle P(a)}$.

Since protein sequences need to maintain a stable structure, a residue’s substitution probabilities are largely determined by the structural context of where it is found. As a result, substitution matrices are trained for structural contexts. Since context information is encoded in transition probabilities between states, mixing mutation probabilities from substitution matrices weighted for corresponding states achieves improved alignment qualities when compared to standard substitution matrices. CS-BLAST improves further upon this concept. The figure illustrates the sequence to sequence and profile to sequence equivalence with the alignment matrix. The query profile results from the artificial mutations in which the bar heights are proportional to the corresponding amino acid probabilities [4].

(A FIGURE NEEDS TO GO HERE THIS IS THE CAPTION) “Sequence search/alignment algorithms find the path that maximizes the sum of similarity scores (color-coded blue to red). Substitution matrix scores are equivalent to profile scores if the sequence profile (colored histogram) is generated from the query sequence by adding artificial mutations with the substitution matrix pseudocount scheme. Histogram bar heights represent the fraction of amino acids in profile columns” [4].

## Performance

CS-BLAST greatly improves alignment quality over the entire range of sequence identities and especially for difficult alignments in comparison to regular BLAST and PSI-BLAST. PSI-BLAST (Position-Specific Iterated BLAST) runs at about the same speed per iteration as regular BLAST, but is able to detect weaker sequence similarities that are still biologically relevant [3]. Alignment quality is based on alignment sensitivity and alignment precision [4].

### Alignment Quality

Alignment sensitivity is measured by correctly comparing predicted alignments of residue pairs to the total number of possible alignable pairs. This is calculated with the fraction: (pairs correctly aligned)/(pairs structurally alignable)

Alignment precision is measured by the correctness of aligned residue pairs. This is calculated with the fraction: (pairs correctly aligned)/(pairs aligned)

### Search Performance

The graph is the benchmark Biegert and Söding used to evaluate homology detection. The benchmark compares CS-BLAST to BLAST using true positives from the same superfamily versus false positive of pairs from different folds [4]. (A GRAPH NEEDS TO GO HERE)

The other graph uses detects true positives (with a different scale than the previous graph) and false positives of PSI-BLAST and CSI-BLAST and compares the two for one to five iterations [4]. (A DIFFERENT GRAPH NEEDS TO GO HERE)

CS-BLAST offers improved sensitivity and alignment quality in sequence comparison. Sequence searches with CS-BLAST are more than twice as sensitive as BLAST [4]. It produces higher quality alignments and generates reliable E-values without a loss of speed. CS-BLAST detects 139% more homologous proteins at a cumulative error rate of 20% [2]. At a 10% error rate, 138% more homologs are detected, and for the easiest cases at a 1% error rate, CS-BLAST was still 96% more effective than BLAST [2]. Additionally, CS-BLAST in 2 iterations is more sensitive than 5 iterations of PSI-BLAST. About 15% more homologs were detected in comparison [4].

## Method

The CS-BLAST method derives similarities between sequence context-specific amino acids for 13 residue windows centered on each residue. CS-BLAST works by generating a sequence profile for a query sequence by using context-specific mutations and then jumpstarting a profile-to-sequence search method.

CS-BLAST starts by predicting the expected mutation probabilities for each position. For a certain residue, a sequence window of ten total surrounding residues is selected as seen in the image. Then, Biegert and Söding compared the sequence window to a library with thousands of context profiles. The library is generated by clustering a representative set of sequence profile windows. The actual predicting of mutation probabilities is achieved by weighted mixing of the central columns of the most similar context profiles [4]. This aligns short profiles that are nonhomologous and ungapped which gives higher weight to better matching profiles, making them easier to detect [4]. A sequence profile represents a multiple alignment of homologous sequences and describes what amino acids are likely to occur at each position in related sequences. With this method substitution matrices are unnecessary. In addition, there is no need for transition probabilities as a result of the fact that context information is encoded within the context profiles. This makes computation simpler and allows for runtime to be scaled linearly instead of quadratically.

(A FIGURE GOES HERE AND THIS IS THE CAPTION) “Computation of context-specific pseudocounts. The Expected mutations (i.e., pseudocounts) for a residue (highlighted in yellow) are calculated based on the sequence context around it (red box). Library profiles contribute to the context-specific sequence profile with weights determined by their similarity to the sequence context. The resulting profile can be used to jump-start PSI-BLAST, which will then perform a sequence-to-sequence search with context-specific amino acid similarities” [4].

The context specific mutation probability, the probability of observing a specific amino acid in a homologous sequence given a context, is calculated by a weighted mixing of the amino acids in the central columns of the most similar context profiles. The image illustrates the calculation of expected mutation probabilities for a specific residue at a certain position. As seen in the image, the library of context profiles all contribute based on similarity to the context specific sequence profile for the query sequence [4].

## Models

In predicting substitution probabilities using only the amino acid’s local sequence context, you gain the advantage of not needing to know the structure of the query protein while still allowing for the detection of more homologous proteins than standard substitution matrices [4]. Bigert and Söding’s approach to predicting substitution probabilities was based on a generative model. In another paper in collaboration with Angermüller, they develop a discriminative machine learning method that improves prediction accuracy [2].

### Generative Model

Given an observed variable ${\displaystyle x}$ and a target variable ${\displaystyle y}$, a generative model defines the probabilities ${\displaystyle P(x,y)}$ and ${\displaystyle P(y)}$ separately. In order to predict the unobserved target variable, ${\displaystyle y}$, Bayes’ theorem, ${\displaystyle P(y|x)=\left({\frac {P(x|y)P(y)}{[\textstyle \sum _{y}P(x|y)P(y)\displaystyle ]}}\right)}$

is used. A generative model, as the name suggests, allows one to generate new data points ${\displaystyle (x,y)}$. The joint distribution is described as ${\displaystyle P(x,y)=P(x|y)P(y)}$. To train a generative model, the following equation is used to maximize the joint probability ${\displaystyle \prod \left({\frac {P(x_{n},y_{n})}{trainingData(x_{n},y_{n})}}\right)}$.

### Discriminative Model

The discriminative model is a logistic regression maximum entropy classifier. With the discriminative model, the goal is to predict a context specific substitution probability given a query sequence. The discriminative approach for modeling substitution probabilities,${\displaystyle P(a|C_{l})}$ where ${\displaystyle C_{l}}$ describes a sequence of amino acids around position ${\displaystyle l}$ of a sequence, is based on ${\displaystyle K}$ context states. Context states are characterized by parameters emission weight (${\displaystyle v_{k}(a)}$), bias weight (${\displaystyle \pi _{k}}$), and context weight (${\displaystyle \lambda _{k}(j,a)}$) [2]. Emission probabilities from a context state are given by the emission weights as follows for ${\displaystyle d=1}$ to ${\displaystyle 20}$: ${\displaystyle P(a|k)=\left({\frac {exp(v_{k}(a))}{\sum exp(v_{k}(a'))}}\right)}$

where ${\displaystyle P(a|k)}$ is the emission probability and is the context state. In the discriminative approach, probability for a context state ${\displaystyle k}$ given context ${\displaystyle C_{l}}$ is modeled directly by the exponential of an affine function of the context account profile where ${\displaystyle C_{l}(j,a)}$ is the context count profile with a normalization constant ${\displaystyle Z(C_{l})}$ normalizes the probability to 1. This equation is as follows where the first summation takes ${\displaystyle j=-d}$ to ${\displaystyle d}$ and the second summation takes ${\displaystyle a=1}$ to ${\displaystyle 20}$: ${\displaystyle P(k|C_{l})=\left({\frac {1}{Z(C_{l})}}exp(\pi _{k}+\pi \sum \sum \lambda _{k}(j,a)(C_{l}(j,a))\right)}$.

As with the generative model, target distribution is obtained by mixing the emission probabilities of each context state weighted by the similarity.

## Using CS-BLAST

The MPI Bioinformatics toolkit in an interactive website and service that allows anyone to do comprehensive and collaborative protein analysis with a variety of different tools including CS-BLAST as well as PSI-BLAST [1]. This tool allows for input of a protein and select options for you to customize your analysis. It also can forward the output to other tools as well.

## References

1. ^ Angermüller, C.; Biegert, A.; Söding, J. (Dec 2012). "Discriminative modelling of context-specific amino acid substitution probabilities". Bioinformatics. 28 (24): 3240–7. doi:10.1093/bioinformatics/bts622. PMID 23080114.
2. ^ Biegert, A.; Söding, J. (Mar 2009). "Sequence context-specific profiles for homology searching" (PDF). Proc Natl Acad Sci U S A. 106 (10): 3770–5. doi:10.1073/pnas.0810767106. PMC 2645910. PMID 19234132.
3. ^ "Better Sequence Searches Of Genes And Proteins Devised". ScienceDaily. Mar 7, 2009. Retrieved 2009-08-14.
4. ^ Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990). "Basic local alignment search tool". J Mol Biol. 215 (3): 403–410. doi:10.1016/S0022-2836(05)80360-2. PMID 2231712.
5. ^ Altschul SF; Madden TL; Schäffer AA; Zhang J; Zhang Z; Miller W; Lipman DJ. (1997). "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs". Nucleic Acids Res. 25 (17): 3389–3402. doi:10.1093/nar/25.17.3389. PMC 146917. PMID 9254694.

[1] Alva, Vikram, Seung-Zin Nam, Johannes Söding, and Andrei N. Lupas. “The MPI Bioinformatics Toolkit as an Integrative Platform for Advanced Protein Sequence and Structure Analysis.” Nucleic Acids Research 44.Web server Issue (2016): W410-415. NCBI. Web. 2 Nov. 2016.

[2] Angermüller, Christof, Andreas Biegert, and Johannes Söding. “Discriminative Modelling of Context-specific Amino Acid Substitution Properties” BIOINFORMATICS 28.24 (2012): 3240-247. Oxford Journals. Web. 2 Nov. 2016.

[3] Astschul, Stephen F., et al. “Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs.” Nucleic Acids Research 25.17 (1997): 3389-402. Oxford University Press. Print

[4] Bigert, A., and J. Söding. “Sequence Context-specific Profiles for Homology Searching.” Proceedings of the National Academy of Sciences 106.10 (2009): 3770-3775. PNAS. Web. 23 Oct. 2016.