# Needleman–Wunsch algorithm

(Redirected from Saul B. Needleman)

The Needleman–Wunsch algorithm is an algorithm used in bioinformatics to align protein or nucleotide sequences. It was published in 1970 by Saul B. Needleman and Christian D. Wunsch;[1] it uses dynamic programming, and was the first application of dynamic programming to biological sequence comparison. It is sometimes referred to as the optimal matching algorithm.

Needleman-Wunsch pairwise sequence alignment
Sequences    Best Alignments
---------    ----------------------
GATTACA      G-ATTACA      G-ATTACA      G-ATTACA
GCATGCU      GCATG-CU      GCA-TGCU      GCAT-GCU


## A modern presentation

Scores for aligned characters are specified by a similarity matrix. Here, $S(a, b)$ is the similarity of characters a and b. It uses a linear gap penalty, here called $d$.

For example, if the similarity matrix was

A G C T
A 10 -1 -3 -4
G -1 7 -5 -3
C -3 -5 9 0
T -4 -3 0 8

then the alignment:

AGACTAGTTAC
CGA---GACGT


with a gap penalty of -5, would have the following score:

$S(A,C) + S(G,G) + S(A,A) + (3\times d) + S(G,G) + S(T,A) + S(T,C) + S(A,G) + S(C,T)$
$= -3 + 7 + 10 - (3\times 5) + 7 + -4 + 0 + -1 + 0 = 1$

To find the alignment with the highest score, a two-dimensional array (or matrix) F is allocated. The entry in row i and column j is denoted here by $F_{ij}$. There is one column for each character in sequence A, and one row for each character in sequence B. Thus, if we are aligning sequences of sizes n and m, the amount of memory used is in $O(nm)$. Hirschberg's algorithm only holds a subset of the array in memory and uses $\Theta(\min \{n,m\})$ space, but is otherwise similar to Needleman-Wunsch (and still requires $O(nm)$ time).

As the algorithm progresses, the $F_{ij}$ will be assigned to be the optimal score for the alignment of the first $i=0,\dotsc,n$ characters in A and the first $j=0,\dotsc,m$ characters in B. The principle of optimality is then applied as follows:

• Basis:
$F_{0j} = d*j$
$F_{i0} = d*i$
• Recursion, based on the principle of optimality:
$F_{ij} = \max(F_{i-1,j-1} + S(A_{i}, B_{j}), \; F_{i,j-1} + d, \; F_{i-1,j} + d)$

The pseudo-code for the algorithm to compute the F matrix therefore looks like this:

for i=0 to length(A)
F(i,0) ← d*i
for j=0 to length(B)
F(0,j) ← d*j
for i=1 to length(A)
for j=1 to length(B)
{
Match ← F(i-1,j-1) + S(Ai, Bj)
Delete ← F(i-1, j) + d
Insert ← F(i, j-1) + d
F(i,j) ← max(Match, Insert, Delete)
}


Once the F matrix is computed, the entry $F_{nm}$ gives the maximum score among all possible alignments. To compute an alignment that actually gives this score, you start from the bottom right cell, and compare the value with the three possible sources (Match, Insert, and Delete above) to see which it came from. If Match, then $A_i$ and $B_j$ are aligned, if Delete, then $A_i$ is aligned with a gap, and if Insert, then $B_j$ is aligned with a gap. (In general, more than one choice may have the same value, leading to alternative optimal alignments.)

AlignmentA ← ""
AlignmentB ← ""
i ← length(A)
j ← length(B)
while (i > 0 or j > 0)
{
if (i > 0 and j > 0 and F(i,j) == F(i-1,j-1) + S(Ai, Bj))
{
AlignmentA ← Ai + AlignmentA
AlignmentB ← Bj + AlignmentB
i ← i - 1
j ← j - 1
}
else if (i > 0 and F(i,j) == F(i-1,j) + d)
{
AlignmentA ← Ai + AlignmentA
AlignmentB ← "-" + AlignmentB
i ← i - 1
}
else (j > 0 and F(i,j) == F(i,j-1) + d)
{
AlignmentA ← "-" + AlignmentA
AlignmentB ← Bj + AlignmentB
j ← j - 1
}
}


## Historical notes

Needleman and Wunsch describe their algorithm explicitly for the case when the alignment is penalized solely by the matches and mismatches, and gaps have no penalty (d=0). The original publication[1] from 1970 suggests the recursion $F_{ij} = \max_{h.

The corresponding dynamic programming algorithm takes cubic time. The paper also points out that the recursion can accommodate arbitrary gap penalization formulas:

A penalty factor, a number subtracted for every gap made, may be assessed as a barrier to allowing the gap. The penalty factor could be a function of the size and/or direction of the gap. [page 444]

A better dynamic programming algorithm with quadratic running time for the same problem (no gap penalty) was first introduced[2] by David Sankoff in 1972. Similar quadratic-time algorithms were discovered independently by T. K. Vintsyuk[3] in 1968 for speech processing ("time warping"), and by Robert A. Wagner and Michael J. Fischer[4] in 1974 for string matching.

Needleman and Wunsch formulated their problem in terms of maximizing similarity. Another possibility is to minimize the edit distance between sequences, introduced by Vladimir Levenshtein. Peter H. Sellers showed[5] in 1974 that the two problems are equivalent.

In modern terminology, "Needleman-Wunsch" refers to a global alignment algorithm that takes quadratic time for a linear or affine gap penalty.