Talk:Bitap algorithm

This article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.Computer scienceWikipedia:WikiProject Computer scienceTemplate:WikiProject Computer scienceComputer science articles

Low

This article has been rated as Low-importance on the project's importance scale.

Things you can help WikiProject Computer science with:

Here are some tasks awaiting attention:

Article requests :
- Requested articles/Applied arts and sciences/Computer science, computing, and Internet
Cleanup :
- Computer science articles needing attention
- Computer science articles needing expert attention
Copyedit :
- Computing
Expand :
- Computer science
Infobox :
- Computer science articles without infoboxes
Maintain :
- Timeline of computing 2020–present
Photo :
- Find pictures for the biographies of computer scientists (see List of computer scientists)
- Computing articles needing images
Stubs :
- Computer science stubs
Unreferenced :
- WikiProject Computer science/Unreferenced BLPs
Project-related :
- Tag all relevant articles in Category:Computer science and sub-categories with {{WikiProject Computer science}}

'It performs best on patterns less than a constant length'[edit]

Someone wrote 'it performs best on patterns less than a constant length'. Such statements should not be made without adequite analysis. It is true that patterns that are 33 characters long may take twice as long as patterns of length 32, but if the algorithm beats the hell out of all its competitors, or it takes 2 nanoseconds instead of 1, then there's no reason not to use it.

Then the paragraph says the complexity in this case is O(m+n). But if m is limited to 32, then O(m+n) is the same as O(n) because O(constant) is 0.

Perhaps we should say that for arbitrary m and n, the algorithm has complexity O(kmn). Now this may look inefficient, but if you consider that modern processors can perform in the region of 64 billion of these operations every second, you'll understand why the algorithm is so fast. (unsigned comment by User:Nroets, 8 Nov 2005)

First point: Bitap does perform best on patterns less than the word length of the machine, because then it can be optimized to use bitwise operations and shifts. If you want to use bitap for longer patterns, you'll need to "pessimize" it either by using the array-of-BIT approach given in the article's first snippet, or by doing awkward things with carried bits. This has nothing to do with comparing bitap to other search algorithms; it's simply stating that bitap itself performs better on small patterns than on long ones.

Second point: On rereading, I can't figure out why O(m+n), either. So I've changed it to O(mn) — we iterate once for each of n text characters, and O(m) times over the bit array for each of those characters. There is an additional O(m) term for setting up the bit array, and in the article's optimized implementations there is an O(k) term where k is the size of the alphabet. So it's most like O(mn+n+k), but the O(mn) term is definitely the dominant one in non-pathological cases. --Quuxplusone 17:19, 8 November 2005 (UTC)[reply]

I think you can say O(m+n) assuming that the inner loop (iteration over the bit array) can be implemented with a shift and a OR operation each taking constant time. Since this seems to be the whole point of the algorithm (hence the name "shift-or"), I find it misleading to read O(mn). If I'm right, maybe this could be clarified in the article. -- TripleF (talk) 17:14, 28 February 2009 (UTC)[reply]

Comment on code sample[edit]

User:134.2.247.43 added a comment to the code in the article pointing out a potential error. I'm not judging at the moment whether they're correct, but it should be discussed here on the talk page to avoid self-contradiction in the article. Dcoetzee 20:25, 20 October 2008 (UTC)[reply]

#include <limits.h>

const char *bitap_bitwise_search(const char *text, const char *pattern)
{
    int m = strlen(pattern);
    unsigned long R;
    unsigned long pattern_mask[CHAR_MAX+1];
    int i;

    if (pattern[0] == '\0') return text;
    if (m > 31) return "The pattern is too long!";
 
    /* Initialize the bit array R */
    R = ~1;  /* I think this is wrong, because then we have no match if text starts with pattern. It should be (~1)<<1. */

    /* Initialize the pattern bitmasks */
    for (i=0; i <= CHAR_MAX; ++i)
      pattern_mask[i] = ~0;
    for (i=0; i < m; ++i)
      pattern_mask[pattern[i]] &= ~(1UL << i);

    for (i=0; text[i] != '\0'; ++i) {
        /* Update the bit array */
        R <<= 1;          
        R |= pattern_mask[text[i]]; 
        
        if (0 == (R & (1UL << (m - 1))))
          return (text+i - m) + 1;
    }

    return NULL;
}

Tracing the code (or thirty seconds with a test case and a C compiler) shows that User:134.2.247.43's concern is unfounded. --Quuxplusone (talk) 06:19, 4 January 2009 (UTC)[reply]

I belive the original is shift then and. instead of and then shift. Match "a" again "a" clearly will fail. Another place might be wrong is in fussy search, Can Quuxplusone comment on this? Weipingh (talk) 03:17, 15 November 2009 (UTC)[reply]

Sorry, I hadn't checked the talk page in a long time. bitap_bitwise_search("a","a") returns "a", which indicates success. Changing the code (as some un-logged-in editor did a while back, and I just saw and reverted) seems to work too — and I'm pretty sure the two versions are equivalent — but the shift-then-or version makes the complicated expression (0 == (R & (1UL << m))) a little more complicated. --Quuxplusone (talk) 18:22, 5 May 2010 (UTC)[reply]

Canonical name[edit]

Where does “bitap” come from? Why is it the canonical name of the lemma? I’m assuming that it stands for “bitwise approximate”, in which case it’s wildly inaccurate – since one of the algorithms (arguably the fundamental one, of which the others are variations) does exact matching, only certain extensions allow approximate matching. Furthermore, none of the original papers use that name, nor can I find it in the literature (in particular, Navarro & Raffinot). I propose to ~~remove~~rename the lemma and/or split it up.--87.77.158.189 (talk) 15:40, 7 December 2010 (UTC)[reply]

Is there a lemma ? Or did you mean algorithm ? If "bitap" isn't mentioned in academic papers, try to look in the source code of "agrep". A name doesn't need to mean anything. —Preceding unsigned comment added by Nroets (talk • contribs) 18:44, 7 December 2010 (UTC)[reply]

With “lemma” I meant a Wikipedia article (“lemma” in the linguistics sense, not the mathematical sense). I don’t think the article should take its name from a fairly obscure (and seemingly obsolete) tool, unless it were exclusively related to that tool. Which it’s not: These algorithms have existed before agrep after all, and are widely used in text search. “A name doesn't need to mean anything.” True, but I was searching for a plausible origin of the name that would justify naming the article after it.--88.75.180.85 (talk) 16:04, 8 December 2010 (UTC)[reply]

I started this page when I was looking at approximate matching. I'm sure a lot of programmers found this page educational, if not extremely useful. So I think it should remain focused on explaining this simple, fast approximate matching algorithm. If the Baeza–Yates–Gonnet algorithm is used extensively in exact matching applications, then start a new page. -- Nic Roets (talk) 17:16, 9 December 2010 (UTC)[reply]

Nic, thanks for the explanation. That sounds like a reasonable approach. I’ll probably do that, once I find the time (currently busy finishing my master thesis).--130.133.50.96 (talk) 13:31, 10 December 2010 (UTC)[reply]

Google, an important player[edit]

Google exploits the bitap algorithm many places. Free code here:

http://code.google.com/p/google-diff-match-patch/ — Preceding unsigned comment added by Maasha (talk • contribs) 08:21, 30 March 2011 (UTC)[reply]

Improve code quality[edit]

While there are comments in the code, they comment on the trivial things and miss the non-trivial things. There's no need to explain that malloc allocates memory, but it would make a lot of sense if the author of the code explained why R[m] has to be non-zero in order to... what? Also, in my humble opinion, the code will benefit a lot from:

Giving proper names to the variables (proper means exactly "the whole word that reflects the purpose of the variable". `R' is not a whole word and reflects nothing; moreso, it is a good practice to name all variables using the same convention with regard to their case, thus `R' near `m' are the opposite of good practice).
Writing a short description of what function is doing (for example: Given the string TEXT searches for the PATTERN string inside the TEXT and returns... whoops, there's no consensus on what the function returns, the value appears to be somewhat random in corner cases.)
Remove the Captain Obvious comments. It is trivial knowledge that malloc allocates memory, only complex or non-intuitive parts require commenting.
Add comments, where it is not trivial. For example, it is not clear why if R[m] is not equal to zero the outer loop should terminate (hint: this is the most important part of the algorithm).
Now you will hate me even more, but... spacing and indentation! At least be consistent. How is one time you are writing x=y and the other time x = y?

79.176.215.66 (talk) 13:36, 31 October 2012 (UTC)[reply]

External links modified[edit]

Hello fellow Wikipedians,

I have just modified one external link on Bitap algorithm. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:

Added archive https://web.archive.org/web/20120122013152/http://www.cod5.ch:80/archive/l/libbitap.html to http://www.cod5.ch/archive/l/libbitap.html

When you have finished reviewing my changes, please set the checked parameter below to true or failed to let others know (documentation at {{Sourcecheck}}).

This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}} (last update: 18 January 2022).

If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—InternetArchiveBot (Report bug) 09:05, 3 November 2016 (UTC)[reply]

The Myers algorithm is not a modification of bitap[edit]

Firstly, let me preface this by saying that it is true that both the bitap algorithm and the Myers algorithm are bit parallel approximate string matching algorithms. But the Myers algorithm and bitap are based on completely different techniques.

The bitap algorithm can be thought of as being based on nondeterministic automata^[1]. The Myers algorithm uses a completely different technique based on the dynamic programming matrix^[2]. So it is incorrect to say that Myers modified the algorithm for long patterns. I've removed the line that said that the bitap algorithm was modified by Myers in 1998 for long patterns.

References

^ Navarro, Gonzalo (1 March 2001). "A guided tour to approximate string matching". ACM Computing Surveys. 33 (1): 31–88. doi:https://doi.org/10.1145/375360.375365. {{cite journal}}: Check |doi= value (help); External link in |doi= (help) See section 7.2.1
^ Navarro, Gonzalo (1 March 2001). "A guided tour to approximate string matching". ACM Computing Surveys. 33 (1): 31–88. doi:https://doi.org/10.1145/375360.375365. {{cite journal}}: Check |doi= value (help); External link in |doi= (help) See section 7.3.2

[1] Navarro, Gonzalo (1 March 2001). "A guided tour to approximate string matching". ACM Computing Surveys. 33 (1): 31–88. doi:https://doi.org/10.1145/375360.375365. {{cite journal}}: Check |doi= value (help); External link in |doi= (help) See section 7.2.1

[2] Navarro, Gonzalo (1 March 2001). "A guided tour to approximate string matching". ACM Computing Surveys. 33 (1): 31–88. doi:https://doi.org/10.1145/375360.375365. {{cite journal}}: Check |doi= value (help); External link in |doi= (help) See section 7.3.2

[1]

[2]