Talk:Knuth–Morris–Pratt algorithm

From Wikipedia, the free encyclopedia
Jump to: navigation, search
WikiProject Computer science (Rated Start-class, High-importance)
WikiProject icon This article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Start-Class article Start  This article has been rated as Start-Class on the project's quality scale.
 High  This article has been rated as High-importance on the project's importance scale.
 

Mistake? "Thus, in the fourth step..."[edit]

Mistake? "Thus in the fourth step, m = 0 and i = 3." should that be m = 3 and i = 3?

First, I've removed...[edit]

First, I've removed the rather antique discussion topic that was already here, as the article has been revised substantially since then and anyway, it was more than a year old.

Second, are there any thoughts on whether I should replace the verbal algorithms with the plain C code? Barely any knowledge of C is necessary to comprehend what is going on in the code; I have tried to write it in such a way that it avoids all idioms even at the cost of being a little too verbose, and the language itself is self-documenting in its usage for the most part. But now that the code is there it seems to reproduce very closely the English-language algorithms and creates a redundancy. I think, given the choice, I would rather keep the code, since it's clearer for the most part, in addition to being more useful, but then, I am comfortable with C and others may not be. So, keep both forms, or just one?

The verbal description reflects the C description too closely, probably because it was written to the C description. Replace it with something more high-level, if you can. No verbal description should say stuff like "if i > 0, set i = e." Deco 07:59, 8 December 2005 (UTC)
There are two verbal descriptions of the code; the lengthy example was the first thing written (actually, it was the reason I rewrote the article, since in the original it was all there was, and quite bad). The second verbal description, which I wrote in vague emulation of Knuth's style for describing algorithms, was written to the C code after I decided that it was unwise to have the only formal description of the algorithm written in a single programming language that some casual readers might not know. Those who do know can verify the two are the same and, if curious, can compile my code and run the algorithm. Ryan Reich 14:12, 21 February 2006 (UTC)
When an old discussion is taking up space on a talk page, please archive it instead of deleting it! ᛭ LokiClock (talk) 13:35, 28 May 2010 (UTC)

In the the efficiency of the search algorithm section...[edit]

In the The efficiency of the search algorithm section, the second paragraph starts:

  • Using this fact, we will show that the loop can execute at most l times.

but the third paragraph has this contradiction:

  • Thus the loop executes at most 2l times,

It seems it should be 2l times (for example searching AAAAA for AB, if I've read the code right), so I changed it. Secondly I tried but failed to make the math tags work on the letters l, m i, and so on, but they stayed obstinately unclear (1 l and I looked practically indistinguisale on my browser) so I resorted to boldifying each math term. I guess there's some preference setting for making the < math > terms look consistent? -Wikibob 02:23, 18 December 2005 (UTC)

That's a fine solution. Computer users have been unable to distinguish the letters l and I, and the number 1, for at least the last fifteen years. I blame the invention of Times New Roman. Ryan Reich 14:12, 21 February 2006 (UTC)

T[i-1] instead of T[i+1]?[edit]

Is the line below supposed to read "T[i-1]" instead of "T[i+1]"?--

"...T[i] in the code is equivalent to T[i + 1] above..."

Yeah... Ryan Reich 14:12, 21 February 2006 (UTC)

O(n+l)[edit]

Shouldn't O(n) + O(1) be O(n) and not O(n+1)? Constants are not taken into account in O() notation. PedroCR

You're talking about the very last section? That's not a 1 ("one"), it's a l ("ell"). It's the length of the string. Ryan Reich 12:10, 10 April 2006 (UTC)
Maybe it would be better to write uppercase 'L' and 'N', but lowercase 'i'? T0ljan 12:53, 17 April 2006 (UTC)

Code -> pseudocode[edit]

Since some IP editor decided that it would be neat to write Java code for this algorithm, I realized that having any language-specific code in here is a bad idea. Of course, the Java code was practically identical to the C code, with minor differences mostly centered on how the length of a string is determined, so it added nothing. I've replaced my C code, their Java code, and also my previous attempt at pseudocode with new pseudocode formatted as suggested in the WikiProject Computer Science manual of style. If you are a future editor reading this article, and for some reason you also read the talk page before editing, please don't implement the algorithm in your favorite language, as it adds nothing. I've also warned you in the page's source. Ryan Reich 19:10, 10 August 2006 (UTC)

History of the algorithm[edit]

The algorithm described in Item 179 of HAKMEM seems quite similar to KMP, despite that HAKMEM is from 1972 so it predates the publication by KMP. Were variants of the algorithm for fast string matching known long before this publication or am I misinterpreting something? Should this be mentioned in the article? Thanks, – b_jonas 19:07, 9 November 2006 (UTC)

I've now also asked this at the Reference desk/Mathematics. – b_jonas 15:53, 29 September 2007 (UTC)


The efficiency of the KMP algorithm[edit]

"A word such as "ABCDEFG" works well with this algorithm because it has no repetitions of its beginning, so that its table is simply all zeroes with a -1 at the beginning. On the opposite end of the spectrum, W = "AAAAAAA" works terribly"

- here where this algorithm is being compared to the naive search, surely 'ABCDEFG' with a table of all zeroes has no actual gain over the naive search, and the whole point of this algorithm is to reduce the no. of comparisons made when patterns like 'AAAAAA' are encountered?


e.g. if string is 'aaaaaaaaaabcdef' (15 chars) and pattern to find is 'abcdef': both naive search and KMP will make 15 comparisons (after computing fail array)

but if string is 'aaaaaaaaaaaaaab' (15 chars) and pattern to find is 'aaaaab': naive search will make (6x9)+6 = 60 comparisons whereas KMP will only do 15 comparisons again.

fail table would be [-1][1][2][3][4][5] and the gain is made from the fact KMP only needs to check for the last character instead of checking through the whole pattern each time - the fail table is a way of 'remembering' you already matched 5 a's, hence this being a good algorithm for small alphabets or bitstreams.

(I'm not confident enough of this to make changes, nor can I think of a concise way to word it, but I think it should be looked at again)


The comment about the Boyer-Moore algorithm's worst case:

"...while the Boyer-Moore algorithm achieves its best case at the great expense of its worst."

is incorrect. The worst case is still linear in the text size. Sustik 22:28, 15 June 2007 (UTC)

So fix it? Ryan Reich 22:07, 16 June 2007 (UTC)
I came here precisely to point out the same thing, and seeing that I'm
not the first, I will fix it. So there. :) 128.210.4.214 (talk) 18:58, 7 March 2008 (UTC)
Funny, I was just remembering this question this morning, after months of neglect. I'm glad to see my goading has encouraged people to action :) Aside from my enthusiasm for the KMP algorithm I am not even distantly connected to this area, and I don't know how these algorithms compare; it seemed, from a long-ago look at the Boyer-Moore page, that the claim was right, but I guess not. Since obviously I'm uninformed, I didn't feel like I could make the change. Ryan Reich (talk) 19:25, 7 March 2008 (UTC)

The confusion may involve the fact that the version of Boyer-Moore in some textbooks is a simplified one with a slower worst case, not the full linear-time algorithm. —David Eppstein (talk) 19:17, 7 March 2008 (UTC)

Terminal substring[edit]

What is terminal substring (this term is used w/o definition and I cannot find one with google)? 217.21.164.43 09:18, 10 October 2007 (UTC)

Perhaps I was writing too much under the influence of mathematical jargon (though you won't find it there either). "Terminal" just means "at the end", so a "terminal substring" of a string S is a substring located at the end of S. For example if S is the string "abcdefg", then the substrings "efg" and even "bcdefg" are terminal, whereas "a" and "cde" are not (to pick just a few examples). Ryan Reich 13:40, 10 October 2007 (UTC)
Somehow, I'd feel a LOT more comfortable if somebody used "proper prefix" or "proper suffix" rather than "proper initial substring" or "proper terminal substring". I mean, it's a mouthful of syllables that clutters the rest of the article --202.168.251.139 16:22, 10 October 2007 (UTC)
Here's an opportunity to gain comfort with editing Wikipedia, then. Why don't you change the terminology you don't like to something clearer? Ryan Reich 18:13, 10 October 2007 (UTC)

How to go on[edit]

Well, the pseudocode doesn't state how far to shift once we found a match (there could be more). I've learned the algo with a table building function that returns an array that is one longer than the pattern: e.g.

 ABCDABCD
-10000123

wouldn't tell you that you have to shift by 4 to find a potential match in ABCDABCDABCD, but

 ABCDABCD
-100001234

does. I think there is something wrong here. Or did I miss something? Cheers, 88.73.111.77 16:49, 3 November 2007 (UTC)

Well, I don't get your example since the string ABCDABCDABCD contains the pattern ABCDABCD, so in fact no shifting is ever necessary. However, say the text were ABCDABCCABCDABCD (in other words, it fails when matching the second D when starting from the beginning of the text). So we do the following:
m: 0123456789012345
S: ABCDABCCABCDABCD
W: ABCDABCD
i: 01234567
and you see that with m = 0, the search fails once when i = 7. According to the pseudocode, we then replace m with m + i - T[i] which is 0 + 7 - 3 = 4. And indeed, that where the first false match "ABC..." begins. So all is well. I certainly hope the code is right, since I agonized over it for some days when writing this article, including writing it into an actual C program and testing it. I'm fairly sure I've got the indices correct. I also think the "worked example" is general enough that the details of the algorithm can be verified by checking them against the computations done there. Do they not match up after all? Ryan Reich 19:26, 3 November 2007 (UTC)

Error in "working example of ..."?[edit]

In the following text: "As we already know that these characters match the two characters prior to the current position, we need not check them again; we simply reset m = 8, i = 2 and continue matching the current character.", shouldn't it be "m = 10, i = 2"? m[10] and i[2] are the next characters to be compared, so this seems wrong to me. --132.199.235.61 (talk) 19:06, 15 May 2008 (UTC)

No. You don't understand the notation: m is the beginning of the word, and i is the position within the word. So, we position W so that W[0] = S[8] (that's what m = 8 means) and then start matching at W[2] (which is S[10]). Ryan Reich (talk) 20:47, 15 May 2008 (UTC)

Possible Error in Pseudocode[edit]

In the pseudocode for kmp_search, should the line "if T[i] is greater than 0," be "if T[i] is greater than -1,"? —Preceding unsigned comment added by Byronknoll (talkcontribs) 21:20, 30 November 2009 (UTC)

I discussed the pseudocode with two friends and they also agreed it is a mistake. I have corrected it on the main article.--Byron (talk) 06:21, 1 December 2009 (UTC).

A very minor change, may be not work[edit]

I just inserted a new row above the m:0123... with the numbers 1,2, in order to read that columns of m as 10, 20, vertically. That change is very minor, it looks correctly aligned in my browser (Mozilla FireFox). It should be reverted, i.e. erase the row with 1 2 above m:01234..., if it is not seen aligned in other browsers.

Error in the example???[edit]

Shouldn't the example read:

i 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3
W[i] P A R T I C I P A T E I N P A R A C H U T E
T[i] -1 0 0 0 0 0 0 0 1 2 0 0 0 0 0 0 1 2 *3* 0 0 0 0 0

Instead of:

i 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3
W[i] P A R T I C I P A T E I N P A R A C H U T E
T[i] -1 0 0 0 0 0 0 0 1 2 0 0 0 0 0 0 1 2 *0* 0 0 0 0 0

Apologies if I am mistaken. Tony (talk) 16:06, 5 October 2010 (UTC)

I ran the C code last night, and it does indeed output a 3. I am changing the article. Tony (talk) 15:52, 6 October 2010 (UTC)

The example output is:
[-1, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 1, 2, 3, 0, 0, 0, 0, 0]
But the output from the pseudo-code function is:
[-1, 0, 0, 0, 0, 0, 0, 0, 2, 3, 0, 0, 0, 0, 0, 0, 2, 3, 4, 0, 0, 0, 0, 0]
This is the result in an off by one error in the pseudo-code. The two lines cnd ← cnd + 1 and T[pos] ← cnd got reversed a few months ago. Fixed. NeilFraser (talk) 04:37, 22 October 2010 (UTC)

T used before it is defined[edit]

The article is confusing to read because T is referred to before it is defined in the description of the algorithm in the first section. —Preceding unsigned comment added by 12.47.208.58 (talk) 00:30, 13 April 2011 (UTC)

Agree. Confuses me as well. — Preceding unsigned comment added by 80.254.148.43 (talk) 09:43, 3 November 2011 (UTC)
Yes, this is confusing. Anubhab91 (talk) 15:27, 23 February 2012 (UTC)
Also confused at first by this methodology. It's unclear. Vote for prioritizing the above suggestion. Dan McCarty (talk) 15:08, 6 September 2012 (UTC)

Difficulty in understanding Algorithms[edit]

It seems to m that the Algorithms given here are difficult to comprehend. Can anyone please modify the algorithms in simpler forms? Thank you. And especially in the explanation of building the T table, it is really obscure to understand the process. As for the pseudocode, the second branch (if cnd > 0, let cnd ← T[cnd]) looks very weird. I hope this part can be better presented with a good example if possible. Anubhab91 (talk) 15:33, 23 February 2012 (UTC)

Do you have any concrete suggestions? Otherwise I'm tempted to point to the story of Euclid and Alexander. I'm sure there are things that could be improved about our description but the main reason that it's difficult to understand is that it's a highly-nontrivial algorithm. —David Eppstein (talk) 05:01, 4 February 2013 (UTC)

Explanation skips a step[edit]

This part of the explanation--"we note that no 'A' occurs between positions 0 and 3 in S except at 0"--leaves out the step of how we "note" it. Is it related to the table T[]? Were we tracking previous starts of the search string? It's unclear. Thx. Dan McCarty (talk) 15:16, 6 September 2012 (UTC)