# Boyer–Moore string-search algorithm

Class String search String Θ(m) preprocessing + O(mn) matching[note 1] Θ(m) preprocessing + Ω(n/m) matching Θ(k+m)[note 2]

In computer science, the Boyer–Moore string-search algorithm is an efficient string-searching algorithm that is the standard benchmark for practical string-search literature. It was developed by Robert S. Boyer and J Strother Moore in 1977. The original paper contained static tables for computing the pattern shifts without an explanation of how to produce them. The algorithm for producing the tables was published in a follow-on paper; this paper contained errors which were later corrected by Wojciech Rytter in 1980.

The algorithm preprocesses the string being searched for (the pattern), but not the string being searched in (the text). It is thus well-suited for applications in which the pattern is much shorter than the text or where it persists across multiple searches. The Boyer–Moore algorithm uses information gathered during the preprocess step to skip sections of the text, resulting in a lower constant factor than many other string search algorithms. In general, the algorithm runs faster as the pattern length increases. The key features of the algorithm are to match on the tail of the pattern rather than the head, and to skip along the text in jumps of multiple characters rather than searching every single character in the text.

## Definitions

 A N P A N M A N - P A N - - - - - - - P A N - - - - - - - P A N - - - - - - - P A N - - - - - - - P A N - - - - - - - P A N -
Alignments of pattern PAN to text ANPANMAN,
from k=3 to k=8. A match occurs at k=5.
• T denotes the input text to be searched. Its length is n.
• P denotes the string to be searched for, called the pattern. Its length is m.
• S[i] denotes the character at index i of string S, counting from 1.
• S[i..j] denotes the substring of string S starting at index i and ending at j, inclusive.
• A prefix of S is a substring S[1..i] for some i in range [1, l], where l is the length of S.
• A suffix of S is a substring S[i..l] for some i in range [1, l], where l is the length of S.
• An alignment of P to T is an index k in T such that the last character of P is aligned with index k of T.
• A match or occurrence of P occurs at an alignment k if P is equivalent to T[(k-m+1)..k].

## Description

The Boyer–Moore algorithm searches for occurrences of P in T by performing explicit character comparisons at different alignments. Instead of a brute-force search of all alignments (of which there are $n-m+1$ ), Boyer–Moore uses information gained by preprocessing P to skip as many alignments as possible.

Previous to the introduction of this algorithm, the usual way to search within text was to examine each character of the text for the first character of the pattern. Once that was found the subsequent characters of the text would be compared to the characters of the pattern. If no match occurred then the text would again be checked character by character in an effort to find a match. Thus almost every character in the text needs to be examined.

The key insight in this algorithm is that if the end of the pattern is compared to the text, then jumps along the text can be made rather than checking every character of the text. The reason that this works is that in lining up the pattern against the text, the last character of the pattern is compared to the character in the text. If the characters do not match, there is no need to continue searching backwards along the text. If the character in the text does not match any of the characters in the pattern, then the next character in the text to check is located m characters farther along the text, where m is the length of the pattern. If the character in the text is in the pattern, then a partial shift of the pattern along the text is done to line up along the matching character and the process is repeated. Jumping along the text to make comparisons rather than checking every character in the text decreases the number of comparisons that have to be made, which is the key to the efficiency of the algorithm.

More formally, the algorithm begins at alignment $k=m$ , so the start of P is aligned with the start of T. Characters in P and T are then compared starting at index m in P and k in T, moving backward. The strings are matched from the end of P to the start of P. The comparisons continue until either the beginning of P is reached (which means there is a match) or a mismatch occurs upon which the alignment is shifted forward (to the right) according to the maximum value permitted by a number of rules. The comparisons are performed again at the new alignment, and the process repeats until the alignment is shifted past the end of T, which means no further matches will be found.

The shift rules are implemented as constant-time table lookups, using tables generated during the preprocessing of P.

## Shift rules

A shift is calculated by applying two rules: the bad-character rule and the good-suffix rule. The actual shifting offset is the maximum of the shifts calculated by these rules.

#### Description

 - - - - X - - K - - - A N P A N M A N A M - - N N A A M A N - - - - - - N N A A M A N -
Demonstration of bad-character rule with pattern P = NNAAMAN. There is a mismatch between N (in the input text) and A (in the pattern) in the column marked with an X. The pattern is shifted right (in this case by 2) so that the next occurrence of the character N (in the pattern P) to the left of the current character (which is the middle A) is found.

The bad-character rule considers the character in T at which the comparison process failed (assuming such a failure occurred). The next occurrence of that character to the left in P is found, and a shift which brings that occurrence in line with the mismatched occurrence in T is proposed. If the mismatched character does not occur to the left in P, a shift is proposed that moves the entirety of P past the point of mismatch.

#### Preprocessing

Methods vary on the exact form the table for the bad-character rule should take, but a simple constant-time lookup solution is as follows: create a 2D table which is indexed first by the index of the character c in the alphabet and second by the index i in the pattern. This lookup will return the occurrence of c in P with the next-highest index $j or -1 if there is no such occurrence. The proposed shift will then be $i-j$ , with $O(1)$ lookup time and $O(km)$ space, assuming a finite alphabet of length k.

The C and Java implementations below have a $O(k)$ space complexity (make_delta1, makeCharTable). This is the same as the original delta1 and the BMH bad-character table. This table maps a character at position $i$ to shift by $\operatorname {len} (p)-1-i$ , with the last instance—the least shift amount—taking precedence. All unused characters are set as $\operatorname {len} (p)$ as a sentinel value.

### The good-suffix rule

#### Description

 - - - - X - - K - - - - - M A N P A N A M A N A P - A N A M P N A M - - - - - - - - - A N A M P N A M -
Demonstration of good-suffix rule with pattern P = ANAMPNAM. Here, t is T[6..8] and t' is P[2..4].

The good-suffix rule is markedly more complex in both concept and implementation than the bad-character rule. Like the bad-character rule, it also exploits the algorithm's feature of comparisons beginning at the end of the pattern and proceeding towards the pattern's start. It can be described as follows:

Suppose for a given alignment of P and T, a substring t of T matches a suffix of P, but a mismatch occurs at the next comparison to the left.

1. Then find, if it exists, the right-most copy t' of t in P such that t' is not a suffix of P and the character to the left of t' in P differs from the character to the left of t in P. Shift P to the right so that substring t' in P aligns with substring t in T.
2. If t' does not exist, then shift the left end of P past the left end of t in T by the least amount so that a prefix of the shifted pattern matches a suffix of t in T.
3. If no such shift is possible, then shift P by m (length of P) places to the right.
4. If an occurrence of P is found, then shift P by the least amount so that a proper prefix of the shifted P matches a suffix of the occurrence of P in T.
5. If no such shift is possible, then shift P by m places, that is, shift P past t.

#### Preprocessing

The good-suffix rule requires two tables: one for use in the general case, and another for use when either the general case returns no meaningful result or a match occurs. These tables will be designated L and H respectively. Their definitions are as follows:

For each i, $L[i]$ is the largest position less than m such that string $P[i..m]$ matches a suffix of $P[1..L[i]]$ and such that the character preceding that suffix is not equal to $P[i-1]$ . $L[i]$ is defined to be zero if there is no position satisfying the condition.

Let $H[i]$ denote the length of the largest suffix of $P[i..m]$ that is also a prefix of P, if one exists. If none exists, let $H[i]$ be zero.

Both of these tables are constructible in $O(m)$ time and use $O(m)$ space. The alignment shift for index i in P is given by $m-L[i]$ or $m-H[i]$ . H should only be used if $L[i]$ is zero or a match has been found.

## The Galil rule

A simple but important optimization of Boyer–Moore was put forth by Zvi Galil in 1979. As opposed to shifting, the Galil rule deals with speeding up the actual comparisons done at each alignment by skipping sections that are known to match. Suppose that at an alignment k1, P is compared with T down to character c of T. Then if P is shifted to k2 such that its left end is between c and k1, in the next comparison phase a prefix of P must match the substring T[(k2 - n)..k1]. Thus if the comparisons get down to position k1 of T, an occurrence of P can be recorded without explicitly comparing past k1. In addition to increasing the efficiency of Boyer–Moore, the Galil rule is required for proving linear-time execution in the worst case.

The Galil rule, in its original version, is only effective for versions that output multiple matches. It updates the substring range only on c = 0, i.e. a full match. A generalized version for dealing with submatches was reported in 1985 as the Apostolico–Giancarlo algorithm.

## Performance

The Boyer–Moore algorithm as presented in the original paper has worst-case running time of $O(n+m)$ only if the pattern does not appear in the text. This was first proved by Knuth, Morris, and Pratt in 1977, followed by Guibas and Odlyzko in 1980 with an upper bound of 5n comparisons in the worst case. Richard Cole gave a proof with an upper bound of 3n comparisons in the worst case in 1991.

When the pattern does occur in the text, running time of the original algorithm is $O(nm)$ in the worst case. This is easy to see when both pattern and text consist solely of the same repeated character. However, inclusion of the Galil rule results in linear runtime across all cases.

## Implementations

Various implementations exist in different programming languages. In C++ it is part of the Standard Library since C++17, also Boost provides the generic Boyer–Moore search implementation under the Algorithm library. In Go (programming language) there is an implementation in search.go. D (programming language) uses a BoyerMooreFinder for predicate based matching within ranges as a part of the Phobos Runtime Library.

The Boyer–Moore algorithm is also used in GNU's grep.

### Python implementation

```from typing import *
# This version is sensitive to the English alphabet in ASCII for case-insensitive matching.
# To remove this feature, define alphabet_index as ord(c), and replace instances of "26"
# with "256" or any maximum code-point you want. For Unicode you may want to match in UTF-8
# bytes instead of creating a 0x10FFFF-sized table.

ALPHABET_SIZE = 26

def alphabet_index(c: str) -> int:
"""Return the index of the given character in the English alphabet, counting from 0."""
val = ord(c.lower()) - ord("a")
assert val >= 0 and val < ALPHABET_SIZE
return val

def match_length(S: str, idx1: int, idx2: int) -> int:
"""Return the length of the match of the substrings of S beginning at idx1 and idx2."""
if idx1 == idx2:
return len(S) - idx1
match_count = 0
while idx1 < len(S) and idx2 < len(S) and S[idx1] == S[idx2]:
match_count += 1
idx1 += 1
idx2 += 1
return match_count

def fundamental_preprocess(S: str) -> list[int]:
"""Return Z, the Fundamental Preprocessing of S.

Z[i] is the length of the substring beginning at i which is also a prefix of S.
This pre-processing is done in O(n) time, where n is the length of S.
"""
if len(S) == 0:  # Handles case of empty string
return []
if len(S) == 1:  # Handles case of single-character string
return 
z = [0 for x in S]
z = len(S)
z = match_length(S, 0, 1)
for i in range(2, 1 + z):  # Optimization from exercise 1-5
z[i] = z - i + 1
# Defines lower and upper limits of z-box
l = 0
r = 0
for i in range(2 + z, len(S)):
if i <= r:  # i falls within existing z-box
k = i - l
b = z[k]
a = r - i + 1
if b < a:  # b ends within existing z-box
z[i] = b
else:  # b ends at or after the end of the z-box, we need to do an explicit match to the right of the z-box
z[i] = a + match_length(S, a, r + 1)
l = i
r = i + z[i] - 1
else:  # i does not reside within existing z-box
z[i] = match_length(S, 0, i)
if z[i] > 0:
l = i
r = i + z[i] - 1
return z

"""
Generates R for S, which is an array indexed by the position of some character c in the
English alphabet. At that index in R is an array of length |S|+1, specifying for each
index i in S (plus the index after S) the next location of character c encountered when
traversing S from right to left starting at i. This is used for a constant-time lookup
for the bad-character rule in the Boyer-Moore string search algorithm, although it has
a much larger size than non-constant-time solutions.
"""
if len(S) == 0:
return [[] for a in range(ALPHABET_SIZE)]
R = [[-1] for a in range(ALPHABET_SIZE)]
alpha = [-1 for a in range(ALPHABET_SIZE)]
for i, c in enumerate(S):
alpha[alphabet_index(c)] = i
for j, a in enumerate(alpha):
R[j].append(a)
return R

def good_suffix_table(S: str) -> list[int]:
"""
Generates L for S, an array used in the implementation of the strong good-suffix rule.
L[i] = k, the largest position in S such that S[i:] (the suffix of S starting at i) matches
a suffix of S[:k] (a substring in S ending at k). Used in Boyer-Moore, L gives an amount to
shift P relative to T such that no instances of P in T are skipped and a suffix of P[:L[i]]
matches the substring of T matched by a suffix of P in the previous match attempt.
Specifically, if the mismatch took place at position i-1 in P, the shift magnitude is given
by the equation len(P) - L[i]. In the case that L[i] = -1, the full shift table is used.
Since only proper suffixes matter, L = -1.
"""
L = [-1 for c in S]
N = fundamental_preprocess(S[::-1])  # S[::-1] reverses S
N.reverse()
for j in range(0, len(S) - 1):
i = len(S) - N[j]
if i != len(S):
L[i] = j
return L

def full_shift_table(S: str) -> list[int]:
"""
Generates F for S, an array used in a special case of the good-suffix rule in the Boyer-Moore
string search algorithm. F[i] is the length of the longest suffix of S[i:] that is also a
prefix of S. In the cases it is used, the shift magnitude of the pattern P relative to the
text T is len(P) - F[i] for a mismatch occurring at i-1.
"""
F = [0 for c in S]
Z = fundamental_preprocess(S)
longest = 0
for i, zv in enumerate(reversed(Z)):
longest = max(zv, longest) if zv == i + 1 else longest
F[-i - 1] = longest
return F

def string_search(P, T) -> list[int]:
"""
Implementation of the Boyer-Moore string search algorithm. This finds all occurrences of P
in T, and incorporates numerous ways of pre-processing the pattern to determine the optimal
amount to shift the string and skip comparisons. In practice it runs in O(m) (and even
sublinear) time, where m is the length of T. This implementation performs a case-insensitive
search on ASCII alphabetic characters, spaces not included.
"""
if len(P) == 0 or len(T) == 0 or len(T) < len(P):
return []

matches = []

# Preprocessing
L = good_suffix_table(P)
F = full_shift_table(P)

k = len(P) - 1      # Represents alignment of end of P relative to T
previous_k = -1     # Represents alignment in previous phase (Galil's rule)
while k < len(T):
i = len(P) - 1  # Character to compare in P
h = k           # Character to compare in T
while i >= 0 and h > previous_k and P[i] == T[h]:  # Matches starting from end of P
i -= 1
h -= 1
if i == -1 or h == previous_k:  # Match has been found (Galil's rule)
matches.append(k - len(P) + 1)
k += len(P) - F if len(P) > 1 else 1
else:  # No match, shift by max of bad character and good-suffix rules
char_shift = i - R[alphabet_index(T[h])][i]
if i + 1 == len(P):  # Mismatch happened on first attempt
suffix_shift = 1
elif L[i + 1] == -1:  # Matched suffix does not appear anywhere in P
suffix_shift = len(P) - F[i + 1]
else:               # Matched suffix appears in P
suffix_shift = len(P) - 1 - L[i + 1]
shift = max(char_shift, suffix_shift)
previous_k = k if shift >= i + 1 else previous_k  # Galil's rule
k += shift
return matches
```

### C implementation

```#include <stdint.h>
#include <stddef.h>
#include <stdbool.h>
#include <stdlib.h>
#include <unistd.h>

#define ALPHABET_LEN 256
#define max(a, b) ((a < b) ? b : a)

// delta1 table: delta1[c] contains the distance between the last
// character of pat and the rightmost occurrence of c in pat.
//
// If c does not occur in pat, then delta1[c] = patlen.
// If c is at string[i] and c != pat[patlen-1], we can safely shift i
//   over by delta1[c], which is the minimum distance needed to shift
//   pat forward to get string[i] lined up with some character in pat.
// c == pat[patlen-1] returning zero is only a concern for BMH, which
//   does not have delta2. BMH makes the value patlen in such a case.
//   We follow this choice instead of the original 0 because it skips
//   more. (correctness?)
//
// This algorithm runs in alphabet_len+patlen time.
void make_delta1(ptrdiff_t *delta1, uint8_t *pat, size_t patlen) {
for (int i=0; i < ALPHABET_LEN; i++) {
delta1[i] = patlen;
}
for (int i=0; i < patlen; i++) {
delta1[pat[i]] = patlen-1 - i;
}
}

// true if the suffix of word starting from word[pos] is a prefix
// of word
bool is_prefix(uint8_t *word, size_t wordlen, ptrdiff_t pos) {
int suffixlen = wordlen - pos;
// could also use the strncmp() library function here
// return ! strncmp(word, &word[pos], suffixlen);
for (int i = 0; i < suffixlen; i++) {
if (word[i] != word[pos+i]) {
return false;
}
}
return true;
}

// length of the longest suffix of word ending on word[pos].
// suffix_length("dddbcabc", 8, 4) = 2
size_t suffix_length(uint8_t *word, size_t wordlen, ptrdiff_t pos) {
size_t i;
// increment suffix length i to the first mismatch or beginning
// of the word
for (i = 0; (word[pos-i] == word[wordlen-1-i]) && (i <= pos); i++);
return i;
}

// GOOD-SUFFIX RULE.
// delta2 table: given a mismatch at pat[pos], we want to align
// with the next possible full match could be based on what we
// know about pat[pos+1] to pat[patlen-1].
//
// In case 1:
// pat[pos+1] to pat[patlen-1] does not occur elsewhere in pat,
// the next plausible match starts at or after the mismatch.
// If, within the substring pat[pos+1 .. patlen-1], lies a prefix
// of pat, the next plausible match is here (if there are multiple
// prefixes in the substring, pick the longest). Otherwise, the
// next plausible match starts past the character aligned with
// pat[patlen-1].
//
// In case 2:
// pat[pos+1] to pat[patlen-1] does occur elsewhere in pat. The
// mismatch tells us that we are not looking at the end of a match.
// We may, however, be looking at the middle of a match.
//
// The first loop, which takes care of case 1, is analogous to
// the KMP table, adapted for a 'backwards' scan order with the
// additional restriction that the substrings it considers as
// potential prefixes are all suffixes. In the worst case scenario
// pat consists of the same letter repeated, so every suffix is
// a prefix. This loop alone is not sufficient, however:
// Suppose that pat is "ABYXCDBYX", and text is ".....ABYXCDEYX".
// We will match X, Y, and find B != E. There is no prefix of pat
// in the suffix "YX", so the first loop tells us to skip forward
// by 9 characters.
// Although superficially similar to the KMP table, the KMP table
// relies on information about the beginning of the partial match
// that the BM algorithm does not have.
//
// The second loop addresses case 2. Since suffix_length may not be
// unique, we want to take the minimum value, which will tell us
// how far away the closest potential match is.
void make_delta2(ptrdiff_t *delta2, uint8_t *pat, size_t patlen) {
ssize_t p;
size_t last_prefix_index = 1;

// first loop
for (p=patlen-1; p>=0; p--) {
if (is_prefix(pat, patlen, p+1)) {
last_prefix_index = p+1;
}
delta2[p] = last_prefix_index + (patlen-1 - p);
}

// second loop
for (p=0; p < patlen-1; p++) {
size_t slen = suffix_length(pat, patlen, p);
if (pat[p - slen] != pat[patlen-1 - slen]) {
delta2[patlen-1 - slen] = patlen-1 - p + slen;
}
}
}

// Returns pointer to first match.
uint8_t* boyer_moore (uint8_t *string, size_t stringlen, uint8_t *pat, size_t patlen) {
ptrdiff_t delta1[ALPHABET_LEN];
ptrdiff_t delta2[patlen]; // C99 VLA
make_delta1(delta1, pat, patlen);
make_delta2(delta2, pat, patlen);

// The empty pattern must be considered specially
if (patlen == 0) {
return string;
}

size_t i = patlen - 1;        // str-idx
while (i < stringlen) {
ptrdiff_t j = patlen - 1; // pat-idx
while (j >= 0 && (string[i] == pat[j])) {
--i;
--j;
}
if (j < 0) {
return &string[i+1];
}

ptrdiff_t shift = max(delta1[string[i]], delta2[j]);
i += shift;
}
return NULL;
}
```

### Java implementation

```    /**
* Returns the index within this string of the first occurrence of the
* specified substring. If it is not a substring, return -1.
*
* There is no Galil because it only generates one match.
*
* @param haystack The string to be scanned
* @param needle The target string to search
* @return The start index of the substring
*/
public static int indexOf(char[] haystack, char[] needle) {
if (needle.length == 0) {
return 0;
}
int charTable[] = makeCharTable(needle);
int offsetTable[] = makeOffsetTable(needle);
for (int i = needle.length - 1, j; i < haystack.length;) {
for (j = needle.length - 1; needle[j] == haystack[i]; --i, --j) {
if (j == 0) {
return i;
}
}
// i += needle.length - j; // For naive method
i += Math.max(offsetTable[needle.length - 1 - j], charTable[haystack[i]]);
}
return -1;
}

/**
* Makes the jump table based on the mismatched character information.
*/
private static int[] makeCharTable(char[] needle) {
final int ALPHABET_SIZE = Character.MAX_VALUE + 1; // 65536
int[] table = new int[ALPHABET_SIZE];
for (int i = 0; i < table.length; ++i) {
table[i] = needle.length;
}
for (int i = 0; i < needle.length; ++i) {
table[needle[i]] = needle.length - 1 - i;
}
return table;
}

/**
* Makes the jump table based on the scan offset which mismatch occurs.
*/
private static int[] makeOffsetTable(char[] needle) {
int[] table = new int[needle.length];
int lastPrefixPosition = needle.length;
for (int i = needle.length; i > 0; --i) {
if (isPrefix(needle, i)) {
lastPrefixPosition = i;
}
table[needle.length - i] = lastPrefixPosition - i + needle.length;
}
for (int i = 0; i < needle.length - 1; ++i) {
int slen = suffixLength(needle, i);
table[slen] = needle.length - 1 - i + slen;
}
return table;
}

/**
* Is needle[p:end] a prefix of needle?
*/
private static boolean isPrefix(char[] needle, int p) {
for (int i = p, j = 0; i < needle.length; ++i, ++j) {
if (needle[i] != needle[j]) {
return false;
}
}
return true;
}

/**
* Returns the maximum length of the substring ends at p and is a suffix.
* (good-suffix rule)
*/
private static int suffixLength(char[] needle, int p) {
int len = 0;
for (int i = p, j = needle.length - 1;
i >= 0 && needle[i] == needle[j]; --i, --j) {
len += 1;
}
return len;
}
```

## Variants

The Boyer–Moore–Horspool algorithm is a simplification of the Boyer–Moore algorithm using only the bad-character rule.

The Apostolico–Giancarlo algorithm speeds up the process of checking whether a match has occurred at the given alignment by skipping explicit character comparisons. This uses information gleaned during the pre-processing of the pattern in conjunction with suffix match lengths recorded at each match attempt. Storing suffix match lengths requires an additional table equal in size to the text being searched.

The Raita algorithm improves the performance of Boyer-Moore-Horspool algorithm. The searching pattern of particular sub-string in a given string is different from Boyer-Moore-Horspool algorithm.