Gap penalty values are designed to reduce the score when a sequence alignment has been disturbed by indels. Typically the central elements used to measure the score of an alignment have been matches, mismatches and spaces. Another important element to measure alignment scores are gaps. A gap is a consecutive run of spaces in an alignment and are used to create alignments that are better conformed to underlying biological models and more closely fit patterns that one expects to find in meaningful alignments. Gaps are represented as dashes on a protein/DNA sequence alignment. The length of a gap is scored by the number of indels (insertions/deletions) in the sequence alignment. In protein and DNA sequence matching, two sequences are aligned to determine if they have a segment each that is significantly similar. A local alignment score is assigned according to the quality of the matches in the alignment subtracted by penalties for gaps present within the alignment. The best gap costs to use with a given substitution matrix are determined empirically. Gap penalties are used with local alignment that match a contiguous sub-sequence of the first sequence with a contiguous sub-sequence of the second sequence. When comparing proteins, one uses a similarity matrix which assigns a score to each possible residue. The score should be positive for similar residues and negative for dissimilar residues pair. Gaps are usually penalized using a linear gap function that assigns an initial penalty for a gap opening, and an additional penalty for gap extensions which increase the gap length.
Introduction to Gap Penalties
A local sequence alignment matches a contiguous sub-section of one sequence with a contiguous sub-section of another. The Smith-Waterman algorithm is motivated by giving scores for matches and mismatches. Matches increase the overall score of an alignment whereas mismatches decrease the score. A good alignment then has a positive score and a poor alignment has a negative score. The local algorithm finds an alignment with the highest score by considering only alignments that score positives and picking the best one from those. The algorithm is a Dynamic programming algorithm. When comparing proteins, one uses a similarity matrix which assigns a score to each possible residue. The score should be positive for similar residues and negative for dissimilar residues pair. Gaps are usually penalized using a linear gap function that assigns an initial penalty for a gap opening, and an additional penalty for gap extensions, increasing the gap length.
Substitution matrices such as BLOSUM are used for sequence alignment of proteins. A Substitution matrix assigns a score for aligning any possible pair of residues. In general, different substitution matrices are tailored to detecting similarities among sequences that are diverged by differing degrees. A single matrix may be reasonably efficient over a relatively broad range of evolutionary change. The BLOSUM-62 matrix is one of the best substitution matrices for detecting weak protein similarities. BLOSUM matrices with high numbers are designed for comparing closely related sequences, while those with low numbers are designed for comparing distant related sequences. For example, BLOSUM-80 is used for alignments that are more similar in sequence, and BLOSUM-45 is used for alignments that have diverged from each other. For particularly long and weak alignments, the BLOSUM-45 matrix may provide the best results. Short alignments are more easily detected using a matrix with a higher "relative entropy" than that of BLOSUM-62. The BLOSUM series does not include any matrices with relative entropies suitable for the shortest queries.
During DNA Replication, the replication machinery is prone to making two types of errors while duplicating the DNA. These two replication errors are insertions and deletions of single DNA bases from the DNA strand (indels). Indels can have severe biological consequences by causing mutations in the DNA strand that could result in the inactivation or over activation of the target protein. For example if a one or two nucleotide indel occurs in a coding sequence the result will be a shift in the reading frame, or a frameshift mutation that may render the protein inactive. The biological consequences of indels are often deleterious and are frequently associated with human pathologies such as cancer. However not all indels are frameshift mutations. If indels occur in trinucleotides, the result is an extension of the protein sequence that may also have implications on protein function.
Types of Gap Penalties
The most widely used gap penalty function is the affine gap penalty. Affine gap penalty defines the basic linear form of a gap penalty function. For a given combination of a scoring method and a linear gap penalty the gap penalty parameters remain fixed in aligning different residue positions. Therefore affine gap penalty has the advantage of simplicity and easy use in dynamic programming. The penalty is composed of two parts: a penalty for the existence of a gap (gap open), and a further length-dependent penalty (gap extension).
Profile-based variable gap penalties
Profile–profile alignment algorithms are powerful tools for detecting protein homology relationships with improved alignment accuracy. Profile-profile alignments are based on the statistical indel frequency profiles from multiple sequence alignments generated by PSI-BLAST searches. Rather than using substitution matrices to measure the similarity of amino acid pairs, profile–profile alignment methods require a profile-based scoring function to measure the similarity of profile vector pairs. Profile-profile alignments employ gap penalty functions. The gap information is usually used in the form of indel frequency profiles, which is more specific for the sequences to be aligned. ClustalW and MAFFT adopted this kind of gap penalty determination for their multiple sequence alignments. Alignment accuracies can be improved using this model, especially for proteins with low sequence identity. Some profile–profile alignment algorithms also run the secondary structure information as one term in their scoring functions, which improves alignment accuracy.
Assigning Gap Penalty Values
Gap penalty values are designed to reduce the score when an alignment has been disturbed by indels. The value should be small enough to allow a previously accumulated alignment to continue with an insertion in one of the sequences but should not be so large that this previous alignment score is removed completely. There are two strategies when assigning values to gaps:
- Keep the score similar regardless of gap length. Allow a constant overall gap penalty regardless of gap length. Therefore assign no gap extension penalty and only penalize the sequence when there is a gap open. This will penalize a large gap by the same extent as a small gap.
- Make the score becomes larger as a linear function of gap length. Have a larger gap opening penalty followed by a gap extension penalty that is smaller than the gap open penalty. This will penalize several small gaps by the same extent as 1 large gap.
There are a few challenges when it comes to working with gaps. When working with popular algorithms there seems to be little theoretical basis for the form of the gap penalty functions. Consequently, for any alignment situation gap placement must be empirically determined. Also, pairwise alignment gap penalties,such as the affine gap penalty, are often implemented independent of the amino acid types in the inserted or deleted fragment or at the broken ends, despite evidence that specific residue types are preferred in gap regions. Finally, alignment of sequences implies alignment of the corresponding structures, but the relationships between structural features of gaps in proteins and their corresponding sequences are only imperfectly known. Because of this incorporating structural information into gap penalties is difficult to do. Some algorithms use predicted or actual structural information to bias the placement of gaps. However, only a minority of sequences have known structures, and most alignment problems involve sequences of unknown secondary and tertiary structure.
- Vingron, M.; Waterman, M. S. (1994). "Sequence alignment and penalty choice. Review of concepts, case studies and implications". Journal of molecular biology 235 (1): 1–12. doi:10.1016/S0022-2836(05)80006-3. PMID 8289235.
- "BLAST substitution matrices". NCBI. Retrieved 2012-11-27.
- Garcia-Diaz, Miguel (2006). "Trends in Biochemical Sciences". Trends in Biochemical Sciences 31 (4).
- Wang C, Yan RX, Wang XF, Si JN, Zhang Z (12 October 2011). "Comparison of linear gap penalties and profile-based variable gap penalties in profile-profile alignments". Comput Biol Chem 35 (5): 308–318. doi:10.1016/j.compbiolchem.2011.07.006. PMID 22000802.
- "About Gaps In Sequence Alignments". EMBL-EBI. Retrieved 2012-11-27.[dead link]
- Wrabl JO, Grishin NV (1 January 2004). "Gaps in structurally similar proteins: towards improvement of multiple sequence alignment". Proteins 54 (1): 71–87. doi:10.1002/prot.10508. PMID 14705025.
- Taylor WR, Munro RE (1997). "Multiple sequence threading: conditional gap placement". Fold Des 2 (4): S33–9. doi:10.1016/S1359-0278(97)00061-8.
- Taylor WR (1996). "A non-local gap-penalty for profile alignment". Bull Math Biol 58 (1): 1–18. doi:10.1007/BF02458279. PMID 8819751.
- Vingron M, Waterman MS (1994). "Sequence alignment and penalty choice. Review of concepts, case studies and implications". J Mol Biol 235 (1): 1–12. doi:10.1016/S0022-2836(05)80006-3. PMID 8289235.
- Panjukov VV (1993). "Finding steady alignments: similarity and distance". Comput Appl Biosci 9 (3): 285–90. doi:10.1093/bioinformatics/9.3.285. PMID 8324629.
- Alexandrov NN (1992). "Local multiple alignment by consensus matrix". Comput Appl Biosci 8 (4): 339–45. doi:10.1093/bioinformatics/8.4.339. PMID 1498689.
- Hein J (1989). "A new method that simultaneously aligns and reconstructs ancestral sequences for any number of homologous sequences, when the phylogeny is given". Mol Biol Evol 6 (6): 649–68. PMID 2488477.
- Henneke CM (1989). "A multiple sequence alignment algorithm for homologous proteins using secondary structure information and optionally keying alignments to functionally important sites". Comput Appl Biosci 5 (2): 141–50. doi:10.1093/bioinformatics/5.2.141. PMID 2751764.
- Reich JG, Drabsch H, Daumler A (1984). "On the statistical assessment of similarities in DNA sequences". Nucleic Acids Res 12 (13): 5529–43. doi:10.1093/nar/12.13.5529. PMC 318937. PMID 6462914.