Boyer–Moore string search algorithm
|
|
This article may require cleanup to meet Wikipedia's quality standards. (Consider using more specific cleanup instructions.) Please help improve this article if you can. The talk page may contain suggestions. (May 2007) |
In computer science, the Boyer–Moore string search algorithm is a particularly efficient string searching algorithm, and it has been the standard benchmark for the practical string search literature.[1] It was developed by Bob Boyer and J Strother Moore in 1977. The algorithm preprocesses the target string (key) that is being searched for, but not the string being searched in (unlike some algorithms that preprocess the string to be searched and can then amortize the expense of the preprocessing by searching repeatedly). The execution time of the Boyer-Moore algorithm, while still linear in the size of the string being searched, can have a significantly lower constant factor than many other search algorithms: it doesn't need to check every character of the string to be searched, but rather skips over some of them. Generally the algorithm gets faster as the key being searched for becomes longer. Its efficiency derives from the fact that with each unsuccessful attempt to find a match between the search string and the text it is searching, it uses the information gained from that attempt to rule out as many positions of the text as possible where the string cannot match.
Contents |
[edit] How the algorithm works
| - | - | - | - | - | - | - | X | - | - | - | - | - | - | - |
| A | N | P | A | N | M | A | N | - | - | - | - | - | - | - |
| - | A | N | P | A | N | M | A | N | - | - | - | - | - | - |
| - | - | A | N | P | A | N | M | A | N | - | - | - | - | - |
| - | - | - | A | N | P | A | N | M | A | N | - | - | - | - |
| - | - | - | - | A | N | P | A | N | M | A | N | - | - | - |
| - | - | - | - | - | A | N | P | A | N | M | A | N | - | - |
| - | - | - | - | - | - | A | N | P | A | N | M | A | N | - |
| - | - | - | - | - | - | - | A | N | P | A | N | M | A | N |
What people frequently find surprising about the Boyer-Moore algorithm, when they first encounter it, is that its verifications—its attempts to check whether a match exists at a particular position—work backwards. If it starts a search at the beginning of a text for the word "ANPANMAN", for instance, it checks the eighth position of the text to see if it contains an "N". If it finds the "N", it moves to the seventh position to see if that contains the last "A" of the word, and so on until it checks the first position of the text for an "A".
Why Boyer-Moore takes this backward approach is clearer when we consider what happens if the verification fails—for instance, if instead of an "N" in the eighth position, we find an "X". The "X" doesn't appear anywhere in "ANPANMAN", and this means there is no match for the search string at the very start of the text—or at the next seven positions following it, since those would all fall across the "X" as well. After checking the eight characters of the word "ANPANMAN" for just one character "X", we're able to skip ahead and start looking for a match ending at the sixteenth position of the text.
This explains why the best-case performance of the algorithm, for a text of length n and a fixed pattern of length m, is n / m: in the best case, only one in m characters needs to be checked. This also explains the somewhat counter-intuitive result that the longer the pattern we are looking for, the faster the algorithm will usually be able to find it.
The algorithm precomputes two tables to process the information it obtains in each failed verification: one table calculates how many positions ahead to start the next search based on the value of the character that caused the mismatch; the other makes a similar calculation based on how many characters were matched successfully before the match attempt failed. (Because these two tables return results indicating how far ahead in the text to "jump", they are sometimes called "jump tables", which should not be confused with the more common meaning of jump tables in computer science.) The algorithm will shift the larger of the two jump values when a mismatch occurs.
[edit] The first table
| - | - | - | - | A | M | A | N | - | - | - | - | - | - | - |
| A | N | P | A | N | M | A | N | - | - | - | - | - | - | - |
| - | A | N | P | A | N | M | A | N | - | - | - | - | - | - |
| - | - | A | N | P | A | N | M | A | N | - | - | - | - | - |
| - | - | - | A | N | P | A | N | M | A | N | - | - | - | - |
| - | - | - | - | A | N | P | A | N | M | A | N | - | - | - |
| - | - | - | - | - | A | N | P | A | N | M | A | N | - | - |
| - | - | - | - | - | - | A | N | P | A | N | M | A | N | - |
Populate the first table as follows. For each i less than the length of the search string, construct the pattern consisting of the last i characters of the string preceded by a mis-matched character, right-align the pattern and string, and record the fewest characters the pattern must shift left for a match.
For instance, for the search string ANPANMAN, the table would be as follows:
(NMAN signifies a substring in ANPANMAN consisting of a character that is not 'N' plus the characters 'MAN'.)
| i | Pattern | Left Shift |
|---|---|---|
| 0 | It is true that the next letter to the left in 'ANPANMAN' is not N (it is A), therefore the pattern |
|
| 1 | ||
| 2 | Substring |
|
| 3 | We see that ' |
|
| 4 | 6 | |
| 5 | 6 | |
| 6 | 6 | |
| 7 | 6 |
[edit] The second table
The second table is easier to calculate: Start at the last character of the sought string and move towards the first character. Each time you move left, if the character you are on is not in the table already, add it; its Shift value is its distance from the rightmost character. All other characters receive a count equal to the length of the search string.
Example: For the string ANPANMAN, the second table would be as shown (for clarity, entries are shown in the order they would be added to the table): (The N which is supposed to be zero is based on the second N from the right because we only record the calculation for the first m − 1 letters)
| Character | Shift |
|---|---|
| A | 1 |
| M | 2 |
| N | 3 |
| P | 5 |
| all other characters | 8 |
The amount of shift calculated by the second table is sometimes called the "bad character shift".[2]
[edit] Performance of the Boyer-Moore string search algorithm
The worst-case to find all occurrences in an aperiodic text needs approximately 3n comparisons, hence the complexity is O(n), regardless whether the text contains a match or not.[3] This proof took some years to determine. In the year the algorithm was devised, 1977, the maximum number of comparisons was shown to be no more than 6n; in 1980 it was shown to be no more than 4n, until Cole's result in Sep 1991.
[edit] Implementation
[edit] C
#include <stdint.h> #include <stdlib.h> #define ALPHABET_LEN 255 #define NOT_FOUND patlen #define max(a, b) ((a < b) ? b : a) // delta1 table: delta1[c] contains the distance between the last // character of pat and the rightmost occurence of c in pat. // If c does not occur in pat, then delta1[c] = patlen. // If c is at string[i] and c != pat[patlen-1], we can // safely shift i over by delta1[c], which is the minimum distance // needed to shift pat forward to get string[i] lined up // with some character in pat. // this algorithm runs in alphabet_len+patlen time. void make_delta1(int *delta1, uint8_t *pat, int32_t patlen) { int i; for (i=0; i < ALPHABET_LEN; i++) { delta1[i] = NOT_FOUND; } for (i=0; i < patlen-1; i++) { delta1[pat[i]] = patlen-1 - i; } } // true if the suffix of word starting from word[pos] is a prefix // of word int is_prefix(uint8_t *word, int wordlen, int pos) { int i; int suffixlen = wordlen - pos; // could also use the strncmp() library function here for (i = 0; i < suffixlen; i++) { if (word[i] != word[pos+i]) { return 0; } } return 1; } // length of the longest suffix of word ending on word[pos]. // suffix_length("dddbcabc", 8, 4) = 2 int suffix_length(uint8_t *word, int wordlen, int pos) { int i; // increment suffix length i to the first mismatch or beginning // of the word for (i = 0; (word[pos-i] == word[wordlen-1-i]) && (i < pos); i++); return i; } // delta2 table: given a mismatch at pat[pos], we want to align // with the next possible full match could be based on what we // know about pat[pos+1] to pat[patlen-1]. // // In case 1: // pat[pos+1] to pat[patlen-1] does not occur elsewhere in pat, // the next plausible match starts at or after the mismatch. // If, within the substring pat[pos+1 .. patlen-1], lies a prefix // of pat, the next plausible match is here (if there are multiple // prefixes in the substring, pick the longest). Otherwise, the // next plausible match starts past the character aligned with // pat[patlen-1]. // // In case 2: // pat[pos+1] to pat[patlen-1] does occur elsewhere in pat. The // mismatch tells us that we are not looking at the end of a match. // We may, however, be looking at the middle of a match. // // The first loop, which takes care of case 1, is analogous to // the KMP table, adapted for a 'backwards' scan order with the // additional restriction that the substrings it considers as // potential prefixes are all suffixes. In the worst case scenario // pat consists of the same letter repeated, so every suffix is // a prefix. This loop alone is not sufficient, however: // Suppose that pat is "ABYXCDEYX", and text is ".....ABYXCDEYX". // We will match X, Y, and find B != E. There is no prefix of pat // in the suffix "YX", so the first loop tells us to skip forward // by 9 characters. // Although superficially similar to the KMP table, the KMP table // relies on information about the beginning of the partial match // that the BM algorithm does not have. // // The second loop addresses case 2. Since suffix_length may not be // unique, we want to take the minimum value, which will tell us // how far away the closest potential match is. void make_delta2(int *delta2, uint8_t *pat, int32_t patlen) { int p; int last_prefix_index = patlen-1; // first loop for (p=patlen-1; p>=0; p--) { if (is_prefix(pat, patlen, p+1)) { last_prefix_index = p+1; } delta2[p] = last_prefix_index + (patlen-1 - p); } // second loop for (p=0; p < patlen-1; p++) { int slen = suffix_length(pat, patlen, p); if (pat[p - slen] != pat[patlen-1 - slen]) { delta2[patlen-1 - slen] = patlen-1 - p + slen; } } } char* boyer_moore (uint8_t *string, uint32_t stringlen, uint8_t *pat, uint32_t patlen) { int i; int delta1[ALPHABET_LEN]; int *delta2 = malloc(patlen * sizeof(int)); make_delta1(&delta1, pat, patlen); make_delta2(delta2, pat, patlen); i = patlen-1; while (i < stringlen) { int j = patlen-1; while (j >= 0 && (string[i] == pat[j])) { --i; --j; } if (j < 0) { free(delta2); return (string + i+1); } i += max(delta1[string[i]], delta2[j]); } free(delta2); return NULL; }
[edit] Java
/** * Returns the index within this string of the first occurrence of the * specified substring. If it is not a substring, return -1. * * @param haystack The string to be scanned * @param needle The target string to search * @return The start index of the substring */ public static int indexOf(char[] haystack, char[] needle) { if (needle.length == 0) { return 0; } int charTable[] = makeCharTable(needle); int offsetTable[] = makeOffsetTable(needle); for (int i = needle.length - 1, j; i < haystack.length;) { for (j = needle.length - 1; needle[j] == haystack[i]; --i, --j) { if (j == 0) { return i; } } // i += needle.length - j; // For naive method i += Math.max(offsetTable[needle.length - 1 - j], charTable[haystack[i]]); } return -1; } /** * Makes the jump table based on the mismatched character information. */ private static int[] makeCharTable(char[] needle) { final int ALPHABET_SIZE = 256; int[] table = new int[ALPHABET_SIZE]; for (int i = 0; i < table.length; ++i) { table[i] = needle.length; } for (int i = 0; i < needle.length - 1; ++i) { table[needle[i]] = needle.length - 1 - i; } return table; } /** * Makes the jump table based on the scan offset which mismatch occurs. */ private static int[] makeOffsetTable(char[] needle) { int[] table = new int[needle.length]; int lastPrefixPosition = needle.length; for (int i = needle.length - 1; i >= 0; --i) { if (isPrefix(needle, i + 1)) { lastPrefixPosition = i + 1; } table[needle.length - 1 - i] = lastPrefixPosition - i + needle.length - 1; } for (int i = 0; i < needle.length - 1; ++i) { int slen = suffixLength(needle, i); table[slen] = needle.length - 1 - i + slen; } return table; } /** * Is needle[p:end] a prefix of needle? */ private static boolean isPrefix(char[] needle, int p) { for (int i = p, j = 0; i < needle.length; ++i, ++j) { if (needle[i] != needle[j]) { return false; } } return true; } /** * Returns the maximum length of the substring ends at p and is a suffix. */ private static int suffixLength(char[] needle, int p) { int len = 0; for (int i = p, j = needle.length - 1; i >= 0 && needle[i] == needle[j]; --i, --j) { len += 1; } return len; }
[edit] Variants
The Turbo Boyer-Moore algorithm takes an additional constant amount of space to complete a search within 2n comparisons (as opposed to 3n for Boyer-Moore), where n is the number of characters in the text to be searched.[4]
The Boyer-Moore-Horspool algorithm is a simplification of the Boyer-Moore algorithm that leaves out the "first table". The Boyer-Moore-Horspool algorithm requires (in the worst case) mn comparisons, while the Boyer-Moore algorithm requires (in the worst case) only 3n comparisons.
[edit] See also
[edit] References
- ^ Hume and Sunday (1991) [Fast String Searching] SOFTWARE—PRACTICE AND EXPERIENCE, VOL. 21(11), 1221–1248 (NOVEMBER 1991)
- ^ http://www.movsd.com/bm.htm
- ^ Richard Cole (1991). "Tight bounds on the complexity of the Boyer-Moore algorithm". Proceedings of the 2nd Annual ACM-SIAM Symposium on Discrete Algorithms. pp. 224–233. http://portal.acm.org/citation.cfm?id=127830.
- ^ http://www-igm.univ-mlv.fr/~lecroq/string/node15.html
[edit] External links
- String Searching Applet animation
- Original article
- An example of the Boyer-Moore algorithm from the homepage of J Strother Moore, co-inventor of the algorithm
- An explanation of the algorithm (with sample C code)
- Cole et al., Tighter lower bounds on the exact complexity of string matching
- An implementation of the algorithm in Ruby
- Scala functional implementation with source code