Talk:Needleman–Wunsch algorithm

From Wikipedia, the free encyclopedia
Jump to: navigation, search
          This article is of interest to the following WikiProjects:
WikiProject Computational Biology (Rated Start-class, Mid-importance)
WikiProject icon This article is within the scope of WikiProject Computational Biology, a collaborative effort to improve the coverage of Computational Biology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Start-Class article Start  This article has been rated as Start-Class on the quality scale.
 Mid  This article has been rated as Mid-importance on the importance scale.
 
WikiProject Computer science (Rated Start-class, Mid-importance)
WikiProject icon This article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Start-Class article Start  This article has been rated as Start-Class on the project's quality scale.
 Mid  This article has been rated as Mid-importance on the project's importance scale.
 
WikiProject Molecular and Cellular Biology (Rated Start-class, Low-importance)
WikiProject icon This article is within the scope of the WikiProject Molecular and Cellular Biology. To participate, visit the WikiProject for more information.
Start-Class article Start  This article has been rated as Start-Class on the project's quality scale.
 Low  This article has been rated as Low-importance on the project's importance scale.
 

Error in formula[edit]

I think there's an error in the formula for the iteration of the algorithm: it currently reads:

Fij = max(Fi − 1,j − 1 + S(Ai − 1,Bj − 1),Fi,j − 1 + d,Fi − 1,j + d)

I think it should read

Fij = max(Fi − 1,j − 1 + S(Ai,Bj),Fi,j − 1 + d,Fi − 1,j + d)

(the suffices in S(A,B) need changing) as i'm new to dynamic programming, i'm not sure about this.

Any ideas anyone?

The section following this, starting with 'Basis':

As the algorithm progresses, the Fij will be assigned to be the optimal score for the alignment of the first i characters in A and the first j characters in B. The principle of optimality is then applied as follows.

Does this sequence begin at i=0? or i=1?

==> There were numerous inconsistencies in the article regarding indexing (zero based or one based)? I made everything zero based. The pseudo code is now self-consistent and consistent with the linked Java code. The above quote is now correct: "Fij represents optimal score for the alignment of the first i ...", where i and j can be zero. -- Kipton Barros, 4/18/2010.

"To compute which alignment actually gives this score, you can start from the bottom left cell" - should this not be right?

Huge error with the initialization concept?==[edit]

If you google search Needleman-Wunsch algorithm you will find one of the top matches is this: http://www.ludwig.edu.au/course/lectures2005/Likic.pdf

This shows that the initialization is based on something other than zeroes. In my solution I implemented this using the (Java) code:

// Initialize the 0,0 element
F[0][0] = 0;

// The top and left sides need to be assigned as if they were dashes
for (int i = 1; i <= A.length(); i++)
{
  F[i][0] = d * i;
}
for (int j = 1; j <= B.length(); j++)
{
  F[0][j] = d * j;
}

(Sorry about probably not following Wiki etiquette, I am a complete newbie here) —The preceding unsigned comment was added by Ravi Luthra (talkcontribs) .

Errors with the algorithm as it is presented here:[edit]

1) In the matrix calculation, the loop iterates from 1 to length(A) and length(B) where the sequences start from 0. Therefore, S(Ai,Bj) should be S(Ai-1, Bj-1) since S(Ai,Bj) would return an error.

==> I have made the suggested change here (and elsewhere) which corresponds to zero based indexing for the sequences. The pseudo-code should now be self consistent. -- Kipton Barros, 4/18/2010

====> This is still not fixed, it is true that S(Ai,Bj) should be changed to S(Ai-1, Bj-1), as well as changing Ai and Bi to Ai-1 and Bi-1 respectively in the whole pseudo-code for it to work properly (I tried it in C with this fix and it now works fine). The whole problem is that the F matrix goes from [0][0] to [lengthA][lengthB], while the sequences go from [0] to [length-1]. So the pseudo code should be changed to this:

for i=0 to length(A)
  F(i,0) ← d*i
for j=0 to length(B)
  F(0,j) ← d*j
for i=1 to length(A)
  for j=1 to length(B)
  {
    Match ← F(i-1,j-1) + S(Ai-1, Bj-1)
    Delete ← F(i-1, j) + d
    Insert ← F(i, j-1) + d
    F(i,j) ← max(Match, Insert, Delete)
  }
AlignmentA ← ""
AlignmentB ← ""
i ← length(A)
j ← length(B)
while (i > 0 or j > 0)
{
  if (i > 0 and j > 0 and F(i,j) == F(i-1,j-1) + S(Ai-1, Bj-1))
  {
    AlignmentA ← Ai-1 + AlignmentA
    AlignmentB ← Bj-1 + AlignmentB
    i ← i - 1
    j ← j - 1
  }
  else if (i > 0 and F(i,j) == F(i-1,j) + d)
  {
    AlignmentA ← Ai-1 + AlignmentA
    AlignmentB ← "-" + AlignmentB
    i ← i - 1
  }
  else (j > 0 and F(i,j) == F(i,j-1) + d)
  {
    AlignmentA ← "-" + AlignmentA
    AlignmentB ← Bj-1 + AlignmentB
    j ← j - 1
  }
}

192.55.54.40 (talk) 23:19, 6 August 2013 (UTC) Vittorio Capra.

-- IMHO, this algorithm uses the matrix with length(B) + 1 rows and length(A) + 1 columns; a cells in first row and column contain an initial gap penalty; the example of matrix is not from this algorithm —Preceding unsigned comment added by 212.96.160.35 (talk) 20:01, 3 December 2009 (UTC)

2) As in the previous post ('huge error...'), the initialization needs to change according to the "Genomic Perl" book.

3) The code in which the alignments are calculated also gives wrong results. I am using the code in "Genomic Perl" book and it works fine.

Defining d and edit distance[edit]

Note that everyone so far has failed to define the variable d in all their formulas. What is this? Also, I just backtracked up an edit distance matrix and the alignments generated definitely seemed fine to me (inserting "-" when you don't move diagonally). Is this formula a generalization of that? It seems like it. Should this be noted and explained in this page or the Levenshtein page? —Preceding unsigned comment added by 65.88.178.10 (talk) 18:17, 17 September 2007 (UTC)

d appears to be the gap penalty...

Ok, following the link posted above gives an excellent explanation. If you're confused by the wiki page, don't feel bad, it's not very clear. As for the Levenshtein relationship, the matrix you generate in that algorithim is equivalent to F where the (cost of insertion) = (cost of deletion) = (gap penalty), and S(i,j) = the substitution cost.

Error ?[edit]

I made an implementation of this algorithm in C. It seems there is some misstake : (well, I am not completly sure of me, but I did the following change in my source code to make it works correctly :

{

 i <- length(B) - 1
 j <- length(A) - 1

}

instead of

{

 i <- length(A) - 1
 j <- length(B) - 1

}

According we consider F(x,y) (x number of line, y number of column). Which has been the choice in my implementation, and seems to be the choice in this algorithm.

Thanks to confirm it

Valeuf (talk) 17:51, 17 May 2009 (UTC)

I found the same error in my code. The algorithm seems to begin assuming array indices start at 0 and then later on is assumes indices start at 1. For an algorithm in C you also need to alter the while loops so that they exit when i, j <= 0, not when i, j < 0. This still has yet to be fixed. —Preceding unsigned comment added by 18.42.1.137 (talk) 00:10, 20 April 2011 (UTC)

Illustration ?[edit]

This article really needs a visual illustration of what a typical F matrix would contain. Can anyone provide ?

Adding a link[edit]

The following link gives a clear explanation of Needleman-Wunsch algorithm, and can be included in the external links.

http://technology66.blogspot.com/

--122.169.92.175 (talk) 21:11, 12 September 2008 (UTC)

full URL: http://technology66.blogspot.de/2008/08/sequence-alignment-techniques.html --Hubalu (talk) 08:58, 16 July 2013 (UTC)

Wat is S?[edit]

In the pseudo code, what does S mean? It's never mentioned in the text above

Match ← F(i-1,j-1) + S(Ai, Bj)

I suggest a short explaination.

Mcemperor (talk) 16:27, 30 November 2011 (UTC)

Code optimization?[edit]

Why is the F(0,0) initialized two times?

for i=0 to length(A)
  F(i,0) ← d*i
for j=0 to length(B)
  F(0,j) ← d*j

Either initialize the F(0,0) in a separate statement or adjust the start index of the second loop:

F(0,0) ← 0
for i=1 to length(A)
  F(i,0) ← d*i
for j=1 to length(B)
  F(0,j) ← d*j
for i=0 to length(A)
  F(i,0) ← d*i
for j=1 to length(B)
  F(0,j) ← d*j

--Hubalu (talk) 09:00, 16 July 2013 (UTC)

Indexing problem[edit]

I think there's an issue with the indexes in the verbal description of the algorithm. There are i rows, one for each element of sequence B. We then index i over A rather than B. Likewise there are j columns based on the number of elements in A and then we index j over B. In a square matrix, this should result in the transpose of the desired match matrix and in a non-square matrix this will result in an index going out of bounds. — Preceding unsigned comment added by 107.191.32.203 (talk) 14:48, 14 August 2014 (UTC)

The ″U″ in DNA sequence[edit]

Hi! I've just noticed a mistake in one of the example sequences. The last base in the first sequence is ″U″ (GCATGCU). As it is DNA it should only contain A, T, C and G, shouldn't it?--Zlir'a (talk) 09:52, 30 December 2014 (UTC)