Talk:Levenshtein distance

From Wikipedia, the free encyclopedia
Jump to: navigation, search
WikiProject Computer science (Rated C-class, Mid-importance)
WikiProject icon This article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
C-Class article C  This article has been rated as C-Class on the project's quality scale.
 Mid  This article has been rated as Mid-importance on the project's importance scale.

Time Complexity[edit]

What is the time complexity for calculating Levenshtein distance? For example, is it O(n*m) where n and m are the lengths of the two strings being compared? I think this information would be very helpful to include for any implementations described in the article. --Showeropera (talk) 13:32, 14 January 2016 (UTC)

The Wagner-Fisher algorithm takes O(n*m) in the worst case, which is O(n²) assuming that m = Θ(n). This follows trivially from the structure of the algorithm; count the number of times the loops run, realize that each iteration performs O(1) work, and multiply. QVVERTYVS (hm?) 10:12, 18 January 2016 (UTC)

Incorrect Calculation of 'cost'[edit]

"if s[i-1] = t[j-1]" should have been "if s[i] = t[j]", since s and t are declared as "char s[1..m], char t[1..n]", and the loops go from 1 to m and 1 to n. i-1 and j-1 are needed in implementations in other languages where string character indexing starts at 0 (e.g C). Fixed. --Anon

Someone at put the "-1"s back in. We should keep an eye one this to make sure they don't do it again. Fixed, Again. Adam Butt (talk) 17:01, 26 February 2009 (UTC)

I don't get where the following has appeared from?

   int cost = (s[i-1] == t[j-1) ? 0 : 1;
   d[i,j] = min( d[i-1,j]+1, d[i,j-1]+1, d[i-1,j-1]+cost ) the pseudo-code, it is suggested that the code is actually...

   if( s[i-1] == t[j-1] )
       d[i,j] = d[i-1,j-1];
       d[i,j] = min( d[i-1,j]+1, d[i,j-1]+1, d[i-1,j-1]+1 )

I tried the latter and it works, so why do all of the other implementations I've seen use the former code??? -- Lee. — Preceding unsigned comment added by (talk) 17:34, 25 November 2011 (UTC)

Appeared where? There is no such code currently in the article. -- X7q (talk) 19:34, 25 November 2011 (UTC)
There isn't any in the actual article, but I've seen this as the implementation everywhere I've looked, as well as within this discussion page. -- Lee. — Preceding unsigned comment added by (talk) 21:42, 25 November 2011 (UTC)
Both codes work. Essentially the difference between them is only that the latter code incorporates the fact that edit distance between two strings doesn't change if you drop their common suffixes/prefixes. So you don't need to check all possibilities in this case to compute minimum, it is always optimal to just drop the common suffix. -- X7q (talk) 22:45, 25 November 2011 (UTC)

Incorrect Date[edit]

Levenshtein's paper was published in 1966, not 1965. cf. V. Levenshtein, Binary codes capable of correcting deletions, insertions and reversals. Soviet Physice – Doklady 10, pp. 707-710, 1966. ralian 03:15, 11 April 2007 (UTC)

Mention Competing Utilities[edit]

This page should link to the competing utilities such as wdiff since the Levenshtein distance and diff, wdiff, sdiff, etc all address the same set of problems (i.e., what is the difference between two strings/files)

Inconsistent Pseudocode[edit]

Why in the pseudocode do int arrays start at index 0 and char arrays start at index 1?

Because it simplifies the algorithm. Anything else would require confusing offsets to be inserted for one array or the other. If you read the correctness section you can see why this numbering fits with the invariants. Deco 02:15, 19 Feb 2005 (UTC)

Not really. It's confusing and inconsistent. Anyone just converting the pseudocode into real code (like I was doing earlier) will end up with a program that generates array out of bound errors. It was easy enough to fix when I looked at what it was actually doing, but is still terribly annoying. The offsets would be less confusing than having arrays start at different indexes.

I agree that the array indicing is confusing. Adding offsets doesnt change the algorithm, its just another way to show it. How it is at the moment is not especially useful given how most language lack features as nice as: "declare int d[0..lenStr1, 0..lenStr2]", "declare int d[size1,size2]" is much more common. More importantly though you can express code using the first type with the second, but not the other way around, which makes it more practical for psuedocode. Jheriko 05:41, 8 March 2006 (UTC)

The purpose of pseudocode is not to make the algorithm easily implementable in your-limited-language-of-choice (if it was, we would write algorithms in it, not pseudocode.) It is to give the clearest explanation of the algorithm, and conceptually, this algorithm indexes strings from one and the distance matrix from 0. As stated in the invariants section, this is because D[0][0] means the distance from the zero-length string to the zero-length string (and the algorithm needs this) and string[..0] must be the zero-length string, not a length-1 string. I've just implemented this in my-limited-language-of-choice, and I thought about it this way (without looking at this page -- I came here to check myself), although of course I didn't implement it that way. On a lighter note, this reminded me of XKCD #163 --Taejo|대조 08:15, 7 September 2007 (UTC)

I agree it may make sense for the pseudocode to be indexed this way, but could we at least put a comment to indicate it is the ith or jth character? The array syntax implies it is a index, not the character number James Van Boxtel (talk) 20:12, 28 May 2010 (UTC)

This pseudocode is garbage and needs to be cleaned up — Preceding unsigned comment added by (talk) 06:53, 26 January 2013 (UTC)

Haskell Example[edit]

Complicated Haskell[edit]

Why is the Haskell code so complicated? Isn't this enough?

editDistance :: Eq a => [a] -> [a] -> Int
editDistance s [] = length s
editDistance [] t = lenght t
editDistance (s:ss) (t:ts) = minimum [ (if s == t then 0 else 1) + editDistance ss ts,
                                       1 + editDistance ss (t:ts),
                                       1 + editDistance (s:ss) ts ]

Okay, I replaced the original code below with the shorter code above. Jirka

min3 :: Int->Int->Int->Int
min3 x y z = min x (min y z)

cost :: Char->Char->Int
cost a b
    | a == b = 0
    | otherwise = 1

sumCosts :: String->Char->Int
sumCosts [] a = 0
sumCosts (s:ss) a = cost s a + sumCosts ss a

editDistance :: String->String->Int
editDistance [] [] = 0
editDistance s [] = sumCosts s '-'
editDistance [] t = sumCosts t '-'
editDistance (s:ss) (t:ts) = min3 (cost s t + editDistance ss ts)
                                  (cost s '-' + editDistance ss (t:ts))
                                  (cost '-' t + editDistance (s:ss) ts)

On memoization[edit]

Also... I really don't think Haskell memoizes like this page claims it does. That would either take tons of space or one crafty LRU cache.... Could we get a reference for that, outside of Wikipedia?

--Daren Brantley 04:15, 16 July 2005 (UTC)

If you want to verify Haskell's memoization features, visit the Dynamic programming article where the claim is also made. I'm sure somebody could give you a reference to the Haskell standard. -- 21:38, 18 July 2005 (UTC)

On the other hand, I wrote both these statements, so I may just be wrong. Deco 05:14, 20 August 2005 (UTC)

I think the current Haskell implementation is wrong. The levenshtein algorithm should use no more than O(n²) time, and this is exponential. It's possible to write a memorizing implementation, but Haskell doesn't do it automatically. (The ruby one is wrong, too, and probably some others.)

An O(n²) version:

 distance str1 str2 = let
     (l1, l2) = (length str1, length str2)
     istr1 = (0, undefined) : zip [1..] str1
     istr2 = (0, undefined) : zip [1..] str2
     table = array ((0, 0), (l1, l2))
         [ ((i, j), value i c1 j c2) | (i, c1) <- istr1, (j, c2) <- istr2 ]
     value 0 _  j _  = j
     value i _  0 _  = i
     value i c1 j c2 = minimum [ table ! (i-1, j) + 1,
                                 table ! (i, j-1) + 1,
                                 table ! (i-1, j-1) + cost ]
         where cost = if c1 == c2 then 0 else 1
     table ! (l1, l2)

The following version isn't O(n²), because // copies arrays and !! is linear in the element number -- using lazy evaluation as above is the key for solving that:

 distance str1 str2 =
    last $ elems $ foldl update table [ (i,j) | i <- [1..length str1] , j <- [1..length str2] ]
    where table = initial (length str1  , length str2 )
          update table (i,j) =  table // [((i,j),value)]
              where value =
                        minimum [  table ! (i-1 , j)     + 1     -- deletion
                                ,  table ! (i   , j-1)   + 1     -- insertion
                                ,  table ! (i-1 , j-1)   + cost  -- substitution
                            where cost = if str1!!(i-1) == str2!!(j-1) then 0 else 1 
 initial (b1,b2) = array ((0,0),(b1,b2)) [ ((i,j), value (i,j)) | i <- [0 .. b1] , j <- [0..b2]]
     where value (0,j) = j
           value (i,0) = i
           value (i,j) = 0

An O(n) in space, faster, stricter, tail recursive version[edit]

Here is an O(n) version, using iteration over an initial row. This is much faster (with GHC; have not tried others) since

  • it is O(n) in space,
  • laziness left away by GHC strictness analysis is removed at one point using seq,
  • it is tail recursive in the outer iterations
       distance :: String -> String -> Int
       distance s1 s2 = iter s1 s2 [0..length s2] where
               iter (c:cs) s2 row@(e:es) =
                       iter cs s2 (e' : rest e' c s2 row) where
                               e' = e + 1
               iter [] _ row = last row
               iter _ _ _ = error "iter (distance): unexpected arguments"
               rest e c (c2:c2s) (e1:es@(e2:es')) =
                       seq k (k : rest k c c2s es) where
                               k = (min (e1 + if c == c2 then 0 else 1) $
                                       min (e+1) (e2+1))
               rest _ _ [] _ = []
               rest _ _ _ _ = error "rest (distance): unexpected arguments"

-- Abhay Parvate 05:33, 4 July 2006 (UTC)

Levenshtein's nationality/ethnicity[edit]

I don't see any particular reason to point out his ethnicity/religion in this article. If you want to put it in his article, be my guest -- but please provide documentation of some sort.--SarekOfVulcan 21:22, 3 November 2005 (UTC)

I believe the point of the edit was to indicate that Levenshtein was not an ethnic Russian, just a Jewish guy who lived in Russia. As you suggest, I think such fine points of Levenshtein's nationality are best reserved for his own article, if anyone can drag up enough info to write it. Deco 00:23, 4 November 2005 (UTC)
I dragged up enough on the web to stub it.--SarekOfVulcan 00:41, 4 November 2005 (UTC)
It might help when learning to pronounce the name of the algorithm. -- Mikeblas 08:13, 3 January 2006 (UTC)


I can transform "kitten" into "sitting" in two steps:

  • kitten
  • (delete "kitten")
  • sitting (insert "sitting" at end)

Can someone explain why this is not acceptable? Are we talking single-character edits here? --Doradus 17:48, 16 January 2006 (UTC)

I went ahead and added "of a single character" to the intro sentence. Please feel free to revert if I have this wrong. --Doradus 17:51, 16 January 2006 (UTC)
I guess it doesn't hurt to be pedantic. Deco 18:20, 16 January 2006 (UTC)
Heh... Well, if you use "diff" you wind up with a list of edits which are most definitely not single-character edits. Plus, I think I'm a fairly reasonable person, and I didn't realize it was single-character edits until I read the algorithm. --Doradus 23:46, 16 January 2006 (UTC)
Sorry for the confusion, I hope it's clearer now. :-) Deco 00:15, 17 January 2006 (UTC)
Try to do this two strings of 40 billion characters. Or, do it with your examples, but 99 billion times in a minute. (talk) 23:48, 31 January 2010 (UTC)


I don't believe the implementations are relevant to the article. Some of them are even wrong - and they certainly aren't more illustrative than the pseudo-code. Currently there are 18 implementations: 2 for C++, C#, 2 for Lisp, Haskell, Java, Python, Ruby, 2 for Scheme, Prolog, VB.NET, Visual FoxPro, Actionscript, Perl, Ocaml and Lua. I vote for removing all of them from the article; if anyone find these useful, just create a page (like with a sample implementation for each language under the sun, and add a link to it. DanielKO 00:21, 12 July 2006 (UTC)

This has a way of happening on wikis. This is why I started a separate wiki for source code. Moving them there is not an option though due to license conflicts - GFDL-licensed code just isn't very useful. Wikisource no longer accepts source code. I suggest we eradicate them all. Deco 00:37, 12 July 2006 (UTC)

Certainly. Get rid of all those implementations and just leave the pseudocode. I truly don't see the point of all this. Zyxoas (talk to me - I'll listen) 16:49, 12 July 2006 (UTC)

I just removed them all. Maybe the Implementations section should be rewritten. DanielKO 23:25, 12 July 2006 (UTC)

Potential license compatability downstream is no reason to delete material, even if the material is source code. That's like saying there should be no programming books at WikiBooks released under the GFDL.

If there's a wider proposal to remove all source code from WikiPedia (and hopefully come up with a sister project to move it to) then I'll accept deleting the implementations. If the quality of implementations are lacking, this can hardly be relevant to their deletion, because even the pseudocode was incorrect at one point.

The implementations provide useful, working examples of the pseudocode for readers. -- 01:54, 13 July 2006 (UTC)

I didn't say license compatibility was a problem. You totally misinterpreted me. I think source code at Wikipedia is totally okay. I would not have removed all of the implementations, maybe just all but 2 or 3. Deco 02:11, 13 July 2006 (UTC)
I don't think that it's reasonable to have that many implementations. The page is "unprintable", because there's too much irrelevant stuffs on it (who needs a Prolog implementation anyways?). But I agree that we should be consistent, and choose just n (for small values of n) languages to be used on all algorithm articles; everything else should not be in an ecyclopedia. So indeed, better lets use this article as an extreme example on how bad the situation may become, and propose a sister project for algorithms. --DanielKO 22:49, 14 July 2006 (UTC)

Well even though it's said here that the implementations were removed, someone must have put them back. I went ahead and removed the Ruby implementation because it didn't work.

I agree that the page is long and unprintable. What if we moved these to separate pages? Something like Levenshtein distance (C++)? Is that poor style, since most parentheticals in an encyclopedia are standardized to thinks like "album", "film", "computer" and not flavors of programming language? -- 00:55, 17 July 2006 (UTC)

That's certainly not reasonable; but maybe a Levenshtein distance (implementations). I think we should try asking this in other algorithm articles saturated with implementations and see if together we can create some basic guidelines. I would suggest that if an article has more than 3 implementations, they should be moved from the main article. What do you think? --DanielKO 08:57, 17 July 2006 (UTC)

That seems reasonable. We *should* try to standardize and therefore get the input from other articles and their editors. -- 15:01, 17 July 2006 (UTC)

Didn't there used to be more history on this page? What's with all the implementations? This article is just silly. ONE is enough. The technicalities of the Haskell code is so irrelevant. 01:43, 27 July 2006 (UTC)

The number of implementations are a known problem. What do you think of the proposal?

I think the technicalities of the haskell code are extremely relevant, but the Haskell implementation is disputed.

I don't see any sign of the page's history material removed. -- 03:20, 27 July 2006 (UTC)

I'd like to remove the majority of implementations from this page. Visual FoxPro? Is it really necessary? From a good encyclopedia I would expect pseudo code and supporting implementation in a well known language such as C. Can anyone cite me a good standard algorithm text that has implementations in so many languages? If no one has any objection I will remove most of the implementations tomorrow. I am also against an article called Levenshtein distance (implementations), if such an article were to exist in an encyclopedia then it's purpose would be to describe existing implementations not to provide new ones. It is my understanding that in an encyclopedia we should be documenting existing entities, not providing new implementations or producing original research. New299 13:09, 18 September 2006 (UTC)

I've cleaned out many of the major offenders that were either obscure or just hideous (long lines, comment headers). It would be nice to have a C version for sentimentality, but C++ usually subsumes programming examples in most textbooks these days. Should we keep the VisualBasic implementation? -- 23:28, 18 September 2006 (UTC)

There was this deletion: "(23:59, 19 September 2006) Simetrical (→Implementations - Wikipedia is not a code repository. It's especially not a code repository for that not-explicitly-GFDLed Python script.)". I think it is OK not to store code here. But as code may be extremely useful for those who want to implement the algorithm the links should be kept. I remember there was a link to code for python and perl. These links should be restored! -- 07:54, 20 September 2006 (UTC)

I don't think the article now stands as a "mere collection of ... source code" (quoted from WP:NOT). It probably used to. The python script should be just a link anyway, regardless of copyright issues. I've done that and removed the hideous VisualBasic source code since no one made a motion to oppose it. We'll see if they do now. Currently, the page prints in under 10 pages on my system. -- 12:41, 20 September 2006 (UTC)

Seems to me that Visual FoxPro is as valid as the next language: that's why I keep restoring it. If we want to drop the article down to one or two reference implementations (C++, probably), I don't have an issue with it. Since the Haskell implementation looks nothing like the others, I'd vote for keeping that one as well, regardless of how widely-used it is. --SarekOfVulcan 21:09, 9 October 2006 (UTC)

I would allow PL/SQL, Pascal or even COBOL before I had VFP. But none of those are there. Not just because of its market adoption (read "popularity") or lack thereof, but simply because it's not pedagocial or contributing anything except bytes to the article. -- 02:43, 10 October 2006 (UTC)
COBOL I'd buy for historical interest. :-) I don't see how Pascal is more pedagogical than VFP, though.--SarekOfVulcan 22:21, 10 October 2006 (UTC)

The size of the rendered page is about 44kB. If it get's too large, the VFP implementation should proably be first to go. -- 22:21, 23 October 2006 (UTC)

I think this article should include as many implementations as possible. I came here via Google search: 'edit distance' (couldn't recall the spelling of Levenshtein) looking for a Ruby implementation. Wikipedia ideally is a collection of all knowledge, regardless if the concept manifests itself as prose or code. Clearly the different implementations are conceptually different because of the facilities the language provides, and should all be included.

That's actually a common misconception. Wikipedia is not intended as a collection of all knowledge. It is intended as an encyclopedia, which is one form of collection of knowledge. For instance, Wikipedia does not include primary source material (for which see Wikisource), even though primary source material is clearly knowledge.
It is inappropriate to include a large number of implementations here, because it is redundant, distracting, and each additional is unlikely to provide much marginal benefit. It is better, if we need any at all, to prefer one or two in languages which are both (1) reasonably common, so readers are more likely to have seen them; and (2) easy to read for this algorithm, that is, not requiring a lot of hacks or bookkeeping. --FOo 07:35, 15 November 2006 (UTC)
According to Wikipedia:Algorithms_on_Wikipedia#Code_samples:
...avoid writing sample code unless it contributes significantly to a fundamental understanding of the encyclopedic content'
See also the rest of the arguments there. I feel that this is an algorithm where the implementations to very little extent contribute to the understanding. The implementations should be move somewhere under wikibooks:Algorithm_implementation. See wikibooks:Algorithm_implementation/Sorting/Quicksort for an excellent example. Klem fra Nils Grimsmo 08:43, 8 December 2006 (UTC)


I removed the Haskell example, for two reasons:

  1. it was the only instance of Haskell code on Wikipedia outside the Haskell article itself
  2. it relied on compiler support for memoization, which is "not guaranteed" according to the accompanying text.

If anyone knows of a language, not terribly obscure, in which memoization is guaranteed by the standard, please add an implementation in such a language. I think the clarity of such an implementation would be a great addition to the article.

For more information on my recent edits to example-code articles, and proposal to get rid of a lot of relatively esoteric example code along the lines that other editors have suggested in this section, see Wikipedia talk:WikiProject Programming languages#Category:Articles_with_example_code_proposal_and_call_for_volunteers. --Quuxplusone 01:34, 4 December 2006 (UTC)

Common Lisp version broken[edit]

Using the Common Lisp implementation described in the article, I get notably wrong results:

[11]> (levenshtein-distance "cow" "cow")

Whoever added this should probably fix it. The Levenshtein distance from any string to itself should be zero. --FOo 01:11, 28 August 2006 (UTC)

Thanks for reporting this. I've deleted until it can be fixed. It wasn't a very elegant implementation anyway, combined with it having been translated from the Python implementation. I'm sure somebody will be able to contribute something more useful in the future. -- 03:55, 28 August 2006 (UTC)

Dynamic Time Warping[edit]

Can someone please explain the difference between DTW and Levenshtein distance?

The algorithms look almost identical. Prehaps a translation of both into C would claify this?

RE: Implementations.[edit]

I removed the implementations and replaced with a link even though we already had one. It seems they were ignoring it anyway. I didn't even look at the source code to see how good the implementations were... I'm not sure I care. I'm not against people sharing their implementations, even multiple implementations per language, especially if one happens to use o(m) space instead of o(mn), or if there is some advantage to a particular implementation over another. If two implementations in the article were compared and contrasted, with benefits and disadvantages for both, that might be encyclopedic and useful for a more in depth investigation of the calculation of this metric. But all I saw was just code. And I'm not even sure it was good code at that. Root4(one) 03:17, 28 September 2007 (UTC)

Confusing wording?[edit]

In the upper and lower bounds section, point 5 seems to be merely re-stating point 1. I have taken the liberty of removing point 5, as it is the more complicated of the two. If this is wrong, feel free to correct me. --Gnack (talk) 05:21, 7 July 2008 (UTC)

Levenshtein automaton as Finite State Machine approach[edit]

Could someone please extend Levenshtein automaton article, as other possible implementation of this string metric? Current stub version is merely nothing. For the particular case when we do not need a transition graph (memory consuming n*m sized matrix) but scalar result only, FSM approach looks more appealing than dynamic one. —Preceding unsigned comment added by (talk) 13:17, 28 July 2008 (UTC)

Spell checking or correction?[edit]

The second para of the article currently says this "is often used in applications that need to determine how similar, or different, two strings are, such as spell checkers." Shouldn't that be "spell *correctors*"? A spell checker need only check for exact matches, whereas a spell corrector needs to find the closest match(es). Mcswell (talk) 13:41, 13 October 2008 (UTC)

I think that your remark makes sense but spell checker IS the commonly used word for such applications whether or not the intended end is correction or just exact match lookup. [[User:D'Artagnol|]] (talk) 16:51, 31 July 2009 (UTC)
Don't some of them hold only a limited set of exact words and guess variations (tenses etc) of the words based on rules and stuff? --TiagoTiago (talk) 19:13, 20 October 2011 (UTC)

Bad layout[edit]

The layout of this article is a total failure. Rather then slapping the pseudocode in the reader’s face, go and figure it out, it should provide an introduction and explain the ideas behind the algorithm first. --Yecril (talk) 13:01, 25 November 2009 (UTC)

I agree completely. It was written by programmers, who program first and specify later. I have expanded the introduction, which should help a little, but the algorithm section should still be reordered:
  • first, present the matrix in which every d[i,j] is the Levensthein distance up to i and j;
  • then, explain how d[i,j] can always be computed from d[i-1,j-1], d[i,j-1], and d[i,j-1];
  • conclude that the distance can be computed by flood-filling the matrix from start to end;
  • only then, give the algorithm, of which the correctness is now obvious;
  • continue with the improvements, starting with the version that only keeps two rows of the matrix in memory at once.
I'll attempt to do this if I find the time but please beat me to it. Rp (talk) 19:16, 21 January 2010 (UTC)


There was an article about Levenshtein distance in Dr. Dobbs Magazine: s. Maybe someone who knows how to include citations in the article could add this. — Preceding unsigned comment added by (talk) 16:49, 2 July 2011 (UTC)

Metric with only additions and deletions[edit]

The section Relationship with other edit distance metrics states that the length of the longest common subsequence gives an edit distance metric given addition and deletion as the allowable edit operations. This seems incorrect. If two strings have length and with a longest common subsequence of symbols, then the edit distance should be . Am I making some silly mistake? Otherwise this should be fixed. Augurar (talk) 02:01, 5 January 2012 (UTC)

You're right, what should be said is that the work required to do one or the other is the same, or that the metrics can be expressed in terms of the lengths of the strings and their longest common subsequence. Rp (talk) 16:09, 6 January 2012 (UTC)

Proposed merge with Damerau–Levenshtein distance[edit]

When the OR is removed from this page, what remains is a rather trivial modification of Levenshtein distance. QVVERTYVS (hm?) 21:21, 25 November 2013 (UTC)

In favor of merger. What is "OR"? I found the modified D-L (as described in Lowrance and Wagner in the references) to be quite distinct from simple L. distance. I would not want to see those references removed (or the pseudocode which is the only real explanation provided for how the algorithm works at the moment.) --Yoderj (talk) 20:44, 14 July 2014 (UTC)
Original research. Sorry for the jargon. What I'd like to keep is exactly the well-sourced bits. QVVERTYVS (hm?) 21:49, 14 July 2014 (UTC)
  • Oppose. The adjacent transpositions variation makes the two algorithms sufficiently different. Glrx (talk) 18:06, 15 July 2014 (UTC)
  • Oppose. It's a different distance, with quite different algorithms because of its difference, and different applications (this one makes more sense in bioinformatics where transpositions are apparently uncommon; the other one makes more sense in spelling checking where transpositions are common). I don't think much of either article is actually original research, just badly sourced. And both articles are long enough that a merged combination of both of them would be both confusing (how would readers be able to tell which distance a particular section refers to) and hard to read (because of the length). Accordingly I have removed the merge tag. (It's over a year old with inadequate support for performing the merge.) —David Eppstein (talk) 01:24, 11 December 2014 (UTC)

Most text is duplicate from Edit distance[edit]

Does this variation of edit distance really deserve an article of its own? --Sigmundur (talk) 13:34, 31 December 2014 (UTC)

The thing is that I started rewriting and generalizing this article's text as the article Edit distance. Levenshtein distance is the most common edit distance, so I think it deserves its own article, but we need to move some content around to put everything in its proper place. QVVERTYVS (hm?) 09:55, 16 March 2015 (UTC)

Inconsistent Start Index[edit]

The start index of the string is 1 but the start index of the 2D int array is 0. This is an inconsistency that can cause confusion, especially when translated into a programming language. Most programming languages stick with either 1 or 0 but not both. — Preceding unsigned comment added by Blackraven36 (talkcontribs) 18:19, 15 March 2015 (UTC)

Incorrect pseudocode[edit]

There is a problem in 4.3 pseudocode: line v1[i] = i+1; causes writing outside the array boundary when s.Length > t.Length+1. — Preceding unsigned comment added by (talk) 12:47, 3 July 2015 (UTC)

Recursive C implementation[edit]

I'm not sure why we're presenting a (presumably) fully working C implementation of the naïve recursive algorithm. No serious source presents this algorithm because it's so extremely inefficient. Anyone who wants to try it can derive it from the mathematical definition if they want. QVVERTYVS (hm?) 12:29, 14 October 2015 (UTC)

I'm happy keeping the slow and stupid algorithm. If nothing else, it serves as a set up for the more efficient versions. Glrx (talk) 16:19, 20 October 2015 (UTC)

The calculation of 'cost' don't need branching[edit]

The calculation of 'cost' is not branching. The indicator function is only boolean evaluation expression. If ... else statement is redundant and confusing. Please see in WikiWikiWeb. So the recursive C implementation source code is the following --- Cedar101 (talk) 02:13, 20 October 2015 (UTC)

 1 // len_s and len_t are the number of characters in string s and t respectively
 2 int LevenshteinDistance(string s, int len_s, string t, int len_t) { // 'string' is not in C. It's std::string in C++.
 3   // base case: empty strings
 4   if (len_s == 0) return len_t;
 5   if (len_t == 0) return len_s;
 7   // check mismatch of last characters of the strings
 8   bool mismatch = (s[len_s-1] != t[len_t-1]); // The bool type requires C++ or modern C(C99 or above) (#include <stdbool.h>)
 9   int cost = (int)mismatch;
11   // return minimum of delete char from s, delete char from t, and delete char from both
12   return minimum(LevenshteinDistance(s, len_s - 1, t, len_t    ) + 1,
13                  LevenshteinDistance(s, len_s    , t, len_t - 1) + 1,
14                  LevenshteinDistance(s, len_s - 1, t, len_t - 1) + cost);
15 }
Your reference discusses returning a boolean from a function. That's different from converting a boolean into a number, which is what is going on here. (I still think we should just delete this code, it serves no encyclopedic purpose.) QVVERTYVS (hm?) 12:56, 20 October 2015 (UTC)
Qwertyus is right about your reference; a cost is not a bool. The typecast to an int should also raise a flag as dirty code; the code relies on a specific representation of boolean values. Furthermore, WP is interested in describing algorithms rather than presenting code that compiles. WP avoids compiler issues such as include files because it does not expect its readers to be familiar with all languages and their nuances. Using constructs such as += or bizzare typecasts just makes the code more difficult for WP's readers. Glrx (talk) 16:17, 20 October 2015 (UTC)
Discussions like this are just one of the reasons why the WikiProject Computer science Manual of style recommends explaining algorithms like this in pseudocode. The point of an article like this is not to provide cut-and-paste source code, but rather to explain the algorithm in an understandable fashion. RossPatterson (talk) 22:50, 20 October 2015 (UTC)
RossPatterson rewrote it. Thanks for pointing out that MOS. QVVERTYVS (hm?) 19:52, 21 October 2015 (UTC)
I reverted the partial programming language style change. Needs discussion; further fracture. Glrx (talk) 00:35, 23 October 2015 (UTC)
What part do you object to? My version is mostly a decluttered version of the quasi-C code (indentation instead of brackets, fewer declarations), with one-indexing to directly match the equations. QVVERTYVS (hm?) 08:56, 23 October 2015 (UTC)
Okay. This point is a kind of MOS. By the way, In the above code, I think to have more emphasis the indicator function using the reference (Return Boolean Evaluations). -- Cedar101 (talk) 01:10, 22 October 2015 (UTC)
 1 // len_s and len_t are the number of characters in string s and t respectively
 2 int LevenshteinDistance(string s, int len_s, string t, int len_t) { // 'string' is not in C. It's std::string in C++.
 3   // base case: empty strings
 4   if (len_s == 0) return len_t;
 5   if (len_t == 0) return len_s;
 7   // return minimum(std::min) of delete char from s, delete char from t, and delete char from both
 8   return min({LevenshteinDistance(s, len_s-1, t, len_t  ) + 1,
 9               LevenshteinDistance(s, len_s  , t, len_t-1) + 1,
10               LevenshteinDistance(s, len_s-1, t, len_t-1) + mismatch_cost(s, len_s, t, len_t)});
11 }
13 // The indicator function: check (and count) mismatch of last characters of the strings
14 int mismatch_cost(string s, int len_s, string t, int len_t) { 
15   return s[len_s-1] != t[len_t-1];
16 }

The implied type cast is just ugly. Glrx (talk) 00:36, 23 October 2015 (UTC)
The thing that is just ugly is your insistence on using C instead of pseudocode. Why do you think this is a good idea? But if we are programming in C, I would use
  int mismatch_cost = (s[len_s-1] != t[len_t-1]) ? 1 : 0
It's a little redundant but avoids any weird casting. —David Eppstein (talk) 00:45, 23 October 2015 (UTC)
My opposition is quick style changes.
You are correct about the conditional expression in C, but C idioms should not be used because they are inaccessible. My comment about the ugly typecast, however, is directed at the failed understanding of the typecast and not how a conditional expression should be coded. Glrx (talk) 01:05, 23 October 2015 (UTC)
Glrx Would it have been better if my pseudocode had matched more closely the Wagner-Fisher algorithm below it? Do we have consensus that pseudocode is better than C (David Eppstein, Cedar101)? QVVERTYVS (hm?) 10:38, 27 October 2015 (UTC)
I mean to get to your question, but I just looked at the code and got distracted into reverting the changes by Cedar101. I'm running out of time. Cedar101's group of changes insert style changes and some poor indirection.
Generally, there's a WP notion to avoid gratuitous style changes to an article. If there is not a set style, then the first editor sets the style and other editors should follow it. That's what happens with reference formating. Lots of people have their own preference, and it crazy for editors to driveby and change the article to their preferred style. Style changes should be discussed and consensus obtained before the change.
There is a notion that if a style is inconsistent, then a WP:BRD could be done to make the entire article consistent. I don't think BRD is appropriate here; I've reverted changes so I'm asking for discussion.
There is no policy requiring pseudo code. I'm not opposed to it, but there are some gotchas going on in the string matching code. Strings are being passed, and there are subtle issues about the length of such strings, the 0/1 indexing, and need for an extra fencepost in the rows.
I'm not a fan of C/C++/etc., but if it's there I won't kill it. I will remove obscure C idioms (p++; p+=2; (x)?y:z;) as confusing to the reader, and I will bludgeon dirty typecasts.
Your syntactic changes to the code were minor, but you also removed informative comments such as "base case:empty strings".
Another reason for avoiding change is that errors creep in. It's minor, but you dropped a trailing paren.
Sorry, but I gotta go. Glrx (talk) 01:11, 29 October 2015 (UTC)

Please, prior to entering code into this article, take a look at its history. Many code examples have been introduced, gone through changes, and removed, according to the whims of the editors at the time. I have little doubt that the present examples will go through the same process. On top of that, there is a plethora of other articles on this subject such as edit distance, Jaro-Winkler distance, Damerau-Levenshtein distance, Needleman–Wunsch algorithm, string metric, Longest common subsequence problem, Approximate string matching, etcetera, which explain basically the same thing over and over again, often with their own code examples. It would be nice if we could repeat ourselves less often on this topic - but how to achieve that? Rp (talk) 11:12, 23 October 2015 (UTC)

The simple solution would be to limit the pages on edit distance to the actual definition of the distances, and explain the algorithm to compute it for a pair strings at Wagner–Fischer algorithm. On the plus side, that introduces a clear separation of concerns, pointing out that Levenshtein distance is not to be equated with with expensive pairwise comparisons (they could, and often should, use Smith–Waterman, Levenshtein automata, etc.). However, I have an inkling that the popularity of this page is largely due to its exposition of Wagner–Fischer rather than the theory... QVVERTYVS (hm?) 12:06, 23 October 2015 (UTC)

Dispute over memory consumed by a naive recursive implementation[edit]

The article claims that

"A more efficient method would never repeat the same distance calculation. For example, the Levenshtein distance of all possible prefixes might be stored in an array d[][] where d[i][j] is the distance between the first i characters of string s and the first j characters of string t. The table is easy to construct one row at a time starting with row 0. When the entire table has been built, the desired distance is d[len_s][len_t]. While this technique is significantly faster, it will consume len_s * len_t more memory than the straightforward recursive implementation.”

Does anybody have any reference for the last statement, because I think it’s not right. Consider the huge stack space occupied by a recursive implementation…

Xrisk (talk) 19:53, 29 October 2015 (UTC)

Why huge? The recursive implementation makes a lot of recursive calls, but the recursion is shallow (linear in total input length) and requires only constant storage per level of recursion. That said, I don't see why we're talking about the recursive implementation at all. It's a bad algorithm. In sorting, do we spend much time talking about the algorithm that generates all permutations and tests whether each one is sorted until it finds the right one? —David Eppstein (talk) 19:57, 29 October 2015 (UTC)
I spent way too much time on that problem once...
But the OP does have a point: brute-force recursion is linear in memory usage, naive Wagner-Fischer quadratic, optimized W-F linear again. The text suggests that brute-force is quadratically better, which is incorrect. QVVERTYVS (hm?) 20:49, 29 October 2015 (UTC)

Problem with iterative with two matrix rows implementation[edit]

I have implemented this method with unexpected results: the distance obtained is asymmetrical and gives fairly different results than the straightforward implementation. I will read the article and check if the implementation on this article is correct. --Farru ES (talk) 18:38, 4 October 2016 (UTC)

Mea culpa, the line
var cost = (s[i] == t[j]) ? 0 : 1;
should read
int cost = (s[i-1] == t[j-1]) ? 0 : 1;
for C++, as the index ranges from 0. --Farru ES (talk) 13:04, 6 October 2016 (UTC)