User:RMcPhillip/sandbox/boyer-moore

From Wikipedia, the free encyclopedia
Jump to: navigation, search

The Boyer–Moore string search algorithm[1] is a particularly efficient string searching algorithm, and it has been the standard benchmark for the practical string search literature.[2]It was developed by Bob Boyer and J Strother Moore in 1977. The algorithm preprocesses the target string (key) that is being searched for, but not the string being searched in (unlike some algorithms that preprocess the string to be searched and can then amortize the expense of the preprocessing by searching repeatedly). The execution time of the Boyer-Moore algorithm, while still linear in the size of the string being searched, can have a significantly lower constant factor than many other search algorithms: it doesn't need to check every character of the string to be searched, but rather skips over some of them. Generally the algorithm gets faster as the key being searched for becomes longer. Its efficiency derives from the fact that with each unsuccessful attempt to find a match between the search string and the text it is searching, it uses the information gained from that attempt to rule out as many positions of the text as possible where the string cannot match.

How the algorithm works[edit]

The Boyer-Moore algorithm performs its matching – perhaps surprisingly – backwards, from right to left. For instance, to start searching a string for the pattern "AT-THAT" it checks position 6 of the string (counting from 0 at the left) to see if it contains an T. If it finds the T, it checks position 5 for an A, position 4 for an H and so on until it checks the position 0 of the string for an A. Why Boyer-Moore takes this backward approach becomes clear when we consider what happens if the match fails—for instance, if instead of a T in the seventh position, we find an F. The F doesn't appear anywhere in "AT-THAT", and this means there is no match for the pattern at the very start of the string—or at the next six positions following it, since those would all fall across the F as well. After checking the seven characters of the word "AT-THAT" for just one character F, we are able to skip ahead and start looking for a match ending at the position 13 of the string. If instead of an F we find some character that does occur in the pattern, such as “-“, we are also able to skip ahead. The skip amount depends on the where the matching character appears last in the pattern. The earlier, the greater the skip amount.

The algorithm has a second feature that improves performance when some suffix of the pattern recurs earlier in the pattern as “AT” does in “AT-THAT”. If we have a mismatch on H after matching “AT”, we can skip ahead seven characters. Example 1d, below, illustrates this.

Navarro (reference here) has shown that the average performance of the algorithm for a string of length n and a fixed pattern of length m is n / m: in the best case, only one in m characters needs to be checked. It also explains the somewhat counter-intuitive result that the longer the pattern we are looking for, the faster the algorithm will usually be able to find it.

The following example, which is taken from the original Boyer-Moore paper, illustrates all these features. We use “↓” to indicate the current character and ≠ to indicate where a mismatch is found.

Example 1a
current:             ↓
pattern:       AT-THAT
string:    ... WHICH-FINALLY-HALTS.--AT-THAT-POINT ...
mismatch:            ≠

We start matching at the pattern’s right end and proceed from right to left. A mismatch occurs immediately – F ≠ T – and since F does not occur in the pattern, we can shift the pattern right by seven and begin matching again:

Example 1b
current:                    ↓
pattern:              AT-THAT
string:    ... WHICH-FINALLY-HALTS.--AT-THAT-POINT ...
mismatch:                   ≠

Again there is an immediate mismatch – dash ≠ T – however dash does occur in the pattern. We shift the pattern to align the dashes and begin matching again:

Example 1c
current:                        ↓
pattern:                  AT-THAT
string:    ... WHICH-FINALLY-HALTS.--AT-THAT-POINT ...
mismatch:                      ≠

This time we match the Ts but a mismatch occurs when comparing L in the string with A in the pattern. Since L does not occur in the pattern, we can shift it right by six, which is the length of the unmatched portion of the pattern. (Previously, when we found F ≠ T, the unmatched portion was seven characters long.)

Example 1d
current:                              ↓
pattern:                        AT-THAT
string:    ... WHICH-FINALLY-HALTS.--AT-THAT-POINT ...
mismatch:                           ≠

Now the second feature -- suffix recurrence -- comes into play. The pattern’s suffix “AT” matches the current characters in the string and a mismatch is found where dash ≠ H. We can shift the pattern right to align the recurring “AT” with the matched characters in the string.

Example 1e
current:                                   ↓
pattern:                             AT-THAT
string:    ... WHICH-FINALLY-HALTS.--AT-THAT-POINT ...
match:                              =

Finally we are able to match the entire pattern with characters in the string. The search scanned a 22 character string while doing only 14 character matches of which seven were in the final match.

The algorithm proceeds by iteratively matching, shifting the pattern, matching, shifting, etc. The shift amount, as mentioned above, is based on one of two features. It can be (1) based on the mismatched character and where if anywhere it occurs in the pattern, or (2) based on the pattern suffix that was successfully matched and where it recurs in the pattern. At each mismatch, the shift amount is computed using both features and the larger of the two is used.

The matching algorithm can quickly determine the shift amount, delta, from precomputed tables. Boyer-Moore refers to these tables as delta1 – the “bad character” table – and delta2 – the “good suffix” table.

Delta1 - The “Bad Character” Table[edit]

This table contains an entry for every character in the alphabet. The entry for char specifies how far the pattern should be right shifted when charis found in the string and it does not match the current pattern character.

The table is constructed as follows. If a character does not occur in the pattern then its entry in the table is set to patlen - 1, the length of the pattern. If a character does occur in the pattern then its entry is set to patlen - j - 1 where j is its rightmost position in the pattern, counting from zero. (Note: Boyer-Moore counts character positions from 1 on the left. We count from 0 on the left so as to match the example C++ code below.)

Delta2 - The “Good Suffix” Table[edit]

This table contains an entry for each character in the pattern. The entry for pattern[j] specifies how far the current string position should shift to the right when pattern[j-1] does not match the string but the suffix at pattern[j .. patlen-1] does match.

Its construction is less obvious than the delta1 table and might best be explained after first introducing several terms. We use this table when some maximal suffix of the pattern matches some substring found in the string and when that suffix recurs earlier (to the left) in the pattern. The substring is delimited by the character before the suffix, the pre-suffix character, which does not match the character before the substring. (If these two characters match then the suffix is not maximal.)

In example 1d, suffix “AT” in “AT-THAT” matches substring “AT” in “...-AT-T...”. The substring length is 2 because the pre-suffix character H does not match the dash that precedes the substring. The suffix recurs at the left of the pattern, “AT-THAT”.

To populate the delta2 table, we need to determine for each position j in the pattern where the suffix starting at j+1 recurs. We are interested in only the rightmost recurrence. (For pattern “AT-THAT”, the suffix T occurs at the far right and the rightmost recurrence of T is “AT-THAT”.) In addition Boyer-Moore deems a recurrence to be “plausible” only if either:

(a) its preceding character differs from the pre-suffix character, or
(b) there is no preceding character because the recurrence is at the left of the pattern.

For example, in pattern “AT-FAT”, the first occurrence of T is not a plausible recurrence of suffix T because both Ts are preceded by A. If a string matches on T but mismatches on A then shifting the pattern right by 4 would not be optimal because we would immediately mismatch on A again.

So, to populate the delta2 table we must determine the rightmost plausible recurrence (RPR) of each character in the pattern.

At first glance it might appear that there is no RPR for many suffixes; what is the RPR for “HAT” in “AT-THAT”? Boyer-Moore handles this by saying any suffix character matches pattern[j] where j < 0. For example:

Example 2a
j -2 -1 0 1 2 3 4 5 6
suffix A T - T H A T
RPR * A T - T H A T

RPR(3) is -1 because the suffix at pattern[4] recurs at pattern[-1].

Note: pattern[-1] matches pattern[4] because -1 < 0,pattern[0] = pattern[5] = A, pattern[1] = pattern[6] = T.

The recurrence is plausible since it is at the left of the pattern.

Boyer-Moore shows that delta2[j] is simply patlen - rpr[j].


The following examples, derived from Boyer-Moore, illustrate this:

Example 2b
j 0 1 2 3 4 5 6
pattern A T - T H A T
rpr(j) -4 -3 -2 -1 0 3 6
delta2(j) 11 10 9 8 7 4 1
Example 2c
j 0 1 2 3 4 5 6 7 8
pattern A B C X X X A B C
rpr(j) -5 -4 -3 -2 -1 0 -2 -1 8
delta2(j) 14 13 12 11 10 9 11 10 1
Example 2d
j 0 1 2 3 4 5 6 7 8
pattern A B Y X C D E Y X
rpr(j) -8 -7 -6 -5 -4 -3 2 -1 8
delta2(j) 17 16 15 14 13 12 7 10 1

Performance of the Boyer-Moore string search algorithm[edit]

The worst-case to find all occurrences in a text needs approximately comparisons, hence the complexity is , regardless whether the text contains a match or not.[3] This proof took some years to determine. In the year the algorithm was devised, 1977, the maximum number of comparisons was shown to be no more than ; in 1980 it was shown to be no more than, until Cole's result in Sep 1991.

Implementation[edit]

C++[edit]

#include <string.h>  // strlen()
#include <stdlib.h>  // __max()

#define ALPHABET_SIZE (1 << (sizeof(char)*8))

// Enable any/all to trace intermediate results
// #define TRACE_DELTA1
// #define TRACE_DELTA2
// #define TRACE_BM

#if defined TRACE_DELTA1 || defined TRACE_DELTA2 || defined TRACE_BM
#include <stdio.h>
#include <ctype.h>
#endif

void calc_delta1(const char *pat, int patlen, int delta1[]) 
{
   for (int j = 0; j < ALPHABET_SIZE; j++)
      delta1[j] = patlen;

   for (int j = 0; j < patlen; j++)
       // By scanning pat from left to right, the final 
       // value in delta1[char] is the *rightmost* occurrence of
       // char in pat
      delta1[pat[j]] = patlen - j - 1;

#ifdef TRACE_DELTA1
    printf("[ delta1\n");
    for (int j = 0; j < ALPHABET_SIZE; j++) {
        if (delta1[j] != patlen) {
            printf("       %c:%d\n", (char)j, delta1[j]);
        }
    }
    printf("  others:%d\n", patlen);
    printf("] delta1\n");
#endif
}

void calc_delta2(const char *pat, int patlen, int * delta2)
{
    // rpr[j] : where we can find rightmost plausible recurrence of pat[j+1 .. patlen-1]
    int *rpr = new int[patlen];

    // Mark each uninitialized rpr value with a large negative index
    const int def = -2*patlen;
    for (int i = 0; i != patlen; i++) 
        rpr[i] = def;

    // r: number of uninitialized entries in rpr[]
    int r = patlen;

    // Scan pattern from right-to-left until all rpr[] are initialized.
    // s: scan position.  
    // Examine all substrings that end at pat[s] including null string pat[s .. s]
    for (int s = patlen - 1; r > 0; s--) {
        // m: length of substring  pat[s-m .. s]
        for (int m = 0; m <= patlen - 1 && r > 0; m++) {
            // Introduce j and k (as used in the BM paper)
            // j: index of leftmost character of suffix
            int j = patlen - m - 1;
            // k: index of leftmost character of (possible) recurrence.
            int k = s - m;

#ifdef TRACE_DELTA2
            const int indent = patlen;
            printf("\nj:%d k:%d\n", j, k);
            printf("p  :%*s%s\n", indent, "", pat);
            printf("j  :%*s%*.*s\n", indent+j, "", m+1, m+1, &pat[j] );
            printf(  "k-1:%*s", indent+k-1, "");
            for (int n = 0; n <= m; n++) {
                printf("%c", (k-1+n < 0 ? pat[j+n] : pat[k-1+n]) );
            }
            printf("\n");
#endif
            // We have a match of pat[j+1 .. j+1+m] with pat[k .. k+m]
            // Compare pat[j] to pat[k-1].
            // Match: extend the substring to the left by increasing m
            // Mismatch: terminate the substring and check if plausible RPR

            bool mismatch = false;
            if (k > 0) {
                if (pat[j] == pat[k-1]) // extend substring
                    continue;
                mismatch = true;
            }
            // else preceding char, pat[k-1] lies to the left of pat[0]
            // which terminates the substring

            // We have a match of m (possibly zero) characters.
            // pat[j+1 .. j+1+m] matches pat[k .. k+m] and
            // either pat[j] != pat[k-1] or k <= 0.
            // So rpr[j] = k (unless rpr[j] is already > k)
            if (rpr[j] < k) {
#ifdef TRACE_DELTA2
                printf("2  :%*s%c%*.*s%*s s:%d m:%d j[%d]:%d r:%d\n",
                        indent+j, "",
                        toupper(pat[j]),
                        m, m, &pat[j+1],
                        (patlen-j-1-m), "",
                        s, m, j, k, r);
#endif
                rpr[j] = k;
                r--;
            }
#ifdef TRACE_DELTA2
            else {
                printf("rpr[%d]=%d already\n", j, rpr[j]); 
            }
#endif
            // Once we have a mismatch (pat[j] != pat[k-1]) it is fruitless
            // to examine further substrings ending at pat[s].
            if (mismatch)
                break;
        }
    }

    for (int j = 0; j != patlen; j++) {
        delta2[j] = patlen - rpr[j];
    }
#ifdef TRACE_DELTA2
    printf("R:"); // trace rpr[] values
    for (int j = 0; j != patlen; j++) {
        printf(" %3d", rpr[j] );
    }
    printf("\n");
    printf("D:"); // trace delta2[] values
    for (int j = 0; j != patlen; j++) {
        printf(" %3d", delta2[j] );
    }
    printf("\n");
#endif
    delete [] rpr;
}

/*
* Boyer-Moore search algorithm
*/
const char *boyermoore_search(const char * string, const char *pat) 
{
    const char *result = NULL;

    int patlen = strlen(pat);
    int *delta1 = NULL;
    int *delta2 = NULL;

    if (patlen == 0)
        goto out;

    int stringlen = strlen(string);

    if (patlen > stringlen)
        goto out;

    delta1 = new int[ALPHABET_SIZE];
    delta2 = new int[patlen];

#ifdef TRACE_BM
    printf("p:%s\n", pat);
#endif
    calc_delta1(pat, patlen, delta1);
    calc_delta2(pat, patlen, delta2) ;

    // i: index of current string character
    for (int i = patlen-1;;) {
        if (i > stringlen) {
            result = NULL;
            goto out;
        }

        // j: index of current pattern character
        int j = patlen-1;
        for (;;) {
            if (j == 0) {
                result = &string[i];
                goto out;
            }
            if (string[i] == pat[j]) {
#ifdef TRACE_BM
                printf("p:%*s%*.*s%c%*.*s\n",
                    (i-j), "",
                    j, j, pat,
                    toupper(pat[j]), // mark matched char with upcase
                    patlen-j-1, patlen-j-1, &pat[j+1]);
#endif
                j--;
                i--;
                continue;
            }
            break;
        }
#ifdef TRACE_BM
        printf("p:%*s%*.*s%c%*.*s\n",
            (i-j), "",
            j, j, pat,
            L'?', // mark mismatch char
            patlen-j-1, patlen-j-1, &pat[j+1]); // which-finally-halts.--at-that-point ...
        printf("c:%s\n", string);
#endif
        // bc: "bad character" shift amount
        int bc = delta1[string[i]];

        // gs: "good suffix" shift amount
        int gs = delta2[j];
#ifdef TRACE_BM
        printf("j:%d bc:%d gs:%d\n\n", j, bc, gs);
#endif
        i += __max(bc, gs);
    }

    /* not found */
out:
    delete [] delta1;
    delete [] delta2;
    return result;
}

Variants[edit]

The Turbo Boyer-Moore algorithm takes an additional constant amount of space to complete a search within comparisons (as opposed to for Boyer-Moore), where is the number of characters in the text to be searched.[4]

The Boyer-Moore-Horspool algorithm is a simplification of the Boyer-Moore algorithm that leaves out delta2, the Good Suffix table. The Boyer-Moore-Horspool algorithm requires (in the worst case) comparisons, while the Boyer-Moore algorithm requires (in the worst case) only comparisons.

See also[edit]

References[edit]

  1. ^ R. S. Boyer (1977). "A fast string searching algorithm". Comm. ACM. 20: 762–772. doi:10.1145/359842.359859.  Unknown parameter |coauthors= ignored (|author= suggested) (help)
  2. ^ Hume and Sunday (1991) [Fast String Searching] SOFTWARE—PRACTICE AND EXPERIENCE, VOL. 21(11), 1221–1248 (NOVEMBER 1991)
  3. ^ Richard Cole (1991). "Tight bounds on the complexity of the Boyer-Moore algorithm". Proceedings of the 2nd Annual ACM-SIAM Symposium on Discrete Algorithms. pp. 224–233. 
  4. ^ http://www-igm.univ-mlv.fr/~lecroq/string/node15.html

External links[edit]


Category:Algorithms on strings Category:Search algorithms Category:Articles with example C code