Talk:Bitap algorithm

Someone wrote 'it performs best on patterns less than a constant length'. Such statements should not be made without adequite analysis. It is true that patterns that are 33 characters long may take twice as long as patterns of length 32, but if the algorithm beats the hell out of all its competitors, or it takes 2 nanoseconds instead of 1, then there's no reason not to use it.

Then the paragraph says the complexity in this case is O(m+n). But if m is limited to 32, then O(m+n) is the same as O(n) because O(constant) is 0.

Perhaps we should say that for arbitrary m and n, the algorithm has complexity O(kmn). Now this may look inefficient, but if you consider that modern processors can perform in the region of 64 billion of these operations every second, you'll understand why the algorithm is so fast. (unsigned comment by User:Nroets, 8 Nov 2005)

First point: Bitap does perform best on patterns less than the word length of the machine, because then it can be optimized to use bitwise operations and shifts. If you want to use bitap for longer patterns, you'll need to "pessimize" it either by using the array-of-BIT approach given in the article's first snippet, or by doing awkward things with carried bits. This has nothing to do with comparing bitap to other search algorithms; it's simply stating that bitap itself performs better on small patterns than on long ones.

Second point: On rereading, I can't figure out why O(m+n), either. So I've changed it to O(mn) — we iterate once for each of n text characters, and O(m) times over the bit array for each of those characters. There is an additional O(m) term for setting up the bit array, and in the article's optimized implementations there is an O(k) term where k is the size of the alphabet. So it's most like O(mn+n+k), but the O(mn) term is definitely the dominant one in non-pathological cases. --Quuxplusone 17:19, 8 November 2005 (UTC)[reply]

I think you can say O(m+n) assuming that the inner loop (iteration over the bit array) can be implemented with a shift and a OR operation each taking constant time. Since this seems to be the whole point of the algorithm (hence the name "shift-or"), I find it misleading to read O(mn). If I'm right, maybe this could be clarified in the article. -- TripleF (talk) 17:14, 28 February 2009 (UTC)[reply]

Comment on code sample

User:134.2.247.43 added a comment to the code in the article pointing out a potential error. I'm not judging at the moment whether they're correct, but it should be discussed here on the talk page to avoid self-contradiction in the article. Dcoetzee 20:25, 20 October 2008 (UTC)[reply]

#include <limits.h>

const char *bitap_bitwise_search(const char *text, const char *pattern)
{
    int m = strlen(pattern);
    unsigned long R;
    unsigned long pattern_mask[CHAR_MAX+1];
    int i;

    if (pattern[0] == '\0') return text;
    if (m > 31) return "The pattern is too long!";
 
    /* Initialize the bit array R */
    R = ~1;  /* I think this is wrong, because then we have no match if text starts with pattern. It should be (~1)<<1. */

    /* Initialize the pattern bitmasks */
    for (i=0; i <= CHAR_MAX; ++i)
      pattern_mask[i] = ~0;
    for (i=0; i < m; ++i)
      pattern_mask[pattern[i]] &= ~(1UL << i);

    for (i=0; text[i] != '\0'; ++i) {
        /* Update the bit array */
        R <<= 1;          
        R |= pattern_mask[text[i]]; 
        
        if (0 == (R & (1UL << (m - 1))))
          return (text+i - m) + 1;
    }

    return NULL;
}

Tracing the code (or thirty seconds with a test case and a C compiler) shows that User:134.2.247.43's concern is unfounded. --Quuxplusone (talk) 06:19, 4 January 2009 (UTC)[reply]

I belive the original is shift then and. instead of and then shift. Match "a" again "a" clearly will fail. Another place might be wrong is in fussy search, Can Quuxplusone comment on this? Weipingh (talk) 03:17, 15 November 2009 (UTC)[reply]

Sorry, I hadn't checked the talk page in a long time. bitap_bitwise_search("a","a") returns "a", which indicates success. Changing the code (as some un-logged-in editor did a while back, and I just saw and reverted) seems to work too — and I'm pretty sure the two versions are equivalent — but the shift-then-or version makes the complicated expression (0 == (R & (1UL << m))) a little more complicated. --Quuxplusone (talk) 18:22, 5 May 2010 (UTC)[reply]

Canonical name

Where does “bitap” come from? Why is it the canonical name of the lemma? I’m assuming that it stands for “bitwise approximate”, in which case it’s wildly inaccurate – since one of the algorithms (arguably the fundamental one, of which the others are variations) does exact matching, only certain extensions allow approximate matching. Furthermore, none of the original papers use that name, nor can I find it in the literature (in particular, Navarro & Raffinot). I propose to remove the lemma and/or split it up.--87.77.158.189 (talk) 15:40, 7 December 2010 (UTC)[reply]