# ProbCons

ProbCons is an open source probabilistic consistency-based multiple alignment of amino acid sequences. It is an efficient protein multiple sequence alignment program, which has demonstrated a statistically significant improvement in accuracy compared to several leading alignment tools.[1][2]

## Algorithm

The following describes the basic outline of the ProbCons algorithm.[3]

### Step 1: Reliability of an alignment edge

For every pair of sequences compute the probability that letters ${\displaystyle x_{i}}$ and ${\displaystyle y_{i}}$ are paired in ${\displaystyle a^{*}}$ an alignment that is generated by the model.

{\displaystyle {\begin{aligned}P(x_{i}\sim y_{i}|x,y)&{\stackrel {def}{=}}Pr[x_{i}\sim y_{i}{\text{ in some a }}|x,y]\\&=\sum _{{\text{alignment a with }}x_{i}-y_{i}}Pr[a|x,y]\\&=\sum _{\text{alignment a}}\mathbf {1} \{x_{i}-y_{i}\in a\}Pr[a|x,y]\end{aligned}}}

(Where ${\displaystyle \mathbf {1} \{x_{i}\sim y_{i}\in a\}}$ is equal to 1 if ${\displaystyle x_{i}}$ and ${\displaystyle y_{i}}$ are in the alignment and 0 otherwise.)

### Step 2: Maximum expected accuracy

The accuracy of an alignment ${\displaystyle a^{*}}$ with respect to another alignment ${\displaystyle a}$ is defined as the number of common aligned pairs divided by the length of the shorter sequence.

Calculate expected accuracy of each sequence:

{\displaystyle {\begin{aligned}E_{Pr[a|x,y]}(acc(a^{*},a))&=\sum _{a}Pr[a|x,y]acc(a^{*},a)\\&={\frac {1}{min(|x|,|y|)}}\cdot \sum _{a}\mathbf {1} \{x_{i}\sim y_{i}\in a\}Pr[a|x,y]\\&={\frac {1}{min(|x|,|y|)}}\cdot \sum _{x_{i}-y_{i}}P(x_{i}\sim y_{j}|x,y)\end{aligned}}}

This yields a maximum expected accuracy (MEA) alignment:

${\displaystyle E(x,y)=\arg \max _{a^{*}}\;E_{Pr[a|x,y]}(acc(a^{*},a))}$

### Step 3: Probabilistic Consistency Transformation

All pairs of sequences x,y from the set of all sequences ${\displaystyle {\mathcal {S}}}$ are now re-estimated using all intermediate sequences z:

${\displaystyle P'(x_{i}-y_{i}|x,y)={\frac {1}{|{\mathcal {S}}|}}\sum _{z}\sum _{1\leq k\leq |z|}P(x_{i}\sim z_{i}|x,z)\cdot P(z_{i}\sim y_{i}|z,y)}$

This step can be iterated.

### Step 4: Computation of guide tree

Construct a guide tree by hierarchical clustering using MEA score as sequence similarity score. Cluster similarity is defined using weighted average over pairwise sequence similarity.

### Step 5: Compute MSA

Finally compute the MSA using progressive alignment or iterative alignment.

## References

1. ^ Do CB, Mahabhashyam MS, Brudno M, Batzoglou S (2005). "PROBCONS: Probabilistic Consistency-based Multiple Sequence Alignment". Genome Research. 15 (2): 330–340. doi:10.1101/gr.2821705. PMC . PMID 15687296.
2. ^ Roshan, Usman (2014-01-01). Russell, David J, ed. Multiple Sequence Alignment Methods. Methods in Molecular Biology. Humana Press. pp. 147–153. doi:10.1007/978-1-62703-646-7_9. ISBN 9781627036450.
3. ^ Lecture "Bioinformatics II" at University of Freiburg