Approximate string matching

From Wikipedia, the free encyclopedia

  (Redirected from Fuzzy string searching)
Jump to: navigation, search

In computing, approximate string matching is the technique of finding approximate matches to a pattern in a string.

Contents

[edit] Overview

The closeness of a match is measured in terms of the number of primitive operations necessary to convert the string into an exact match. This number is called the edit distance — also called the Levenshtein distance — between the string and the pattern. The usual primitive operations are:[1]

  • insertion (e.g., changing cot to coat),
  • deletion (e.g. changing coat to cot), and
  • substitution (e.g. changing coat to cost).

Or an advanced overview is just

  • substitution (e.g., cot to coat where NULL has been substituted by 'a', coat to cot where 'a' has been substituted by NULL, coat to cost where 'a' has been substituted by 's').

Some approximate matchers also treat transposition, in which the positions of two letters in the string are swapped, to be a primitive operation. Changing cost to cots is an example of a transposition.[1]

Different approximate matchers impose different constraints. Some matchers use a single global unweighted cost, that is, the total number of primitive operations necessary to convert the match to the pattern. For example, if the pattern is coil, foil differs by one substitution, coils by one insertion, oil by one deletion, and foal by two substitutions. If all operations count as a single unit of cost and the limit is set to one, foil, coils, and oil will count as matches while foal will not.

Other matchers specify the number of operations of each type separately, while still others set a total cost but allow different weights to be assigned to different operations. Some matchers allow separate assignments of limits and weights to individual groups in the pattern.

Most approximate matchers used for text processing are regular expression matchers.[citation needed] The distance between a candidate and the pattern is therefore computed as the minimum distance between the candidate and a fixed string matching the regular expression. Thus, if the pattern is co.l, using the POSIX notation in which a dot matches any single character, both coal and coil are exact matches, while soil differs by one substitution.

[edit] Problem formulation and algorithms

One possible definition of the approximate string matching problem is the following: Given a pattern string P = p1p2...pm and a text string T = t_1t_2\dots t_n, find a substring T_{j',j} = t_{j'}\dots t_j in T, which, of all substrings of T, has the smallest edit distance to the pattern P.

A brute-force approach would be to compute the edit distance to P for all substrings of T, and then choose the substring with the minimum distance. However, this algorithm would have the running time O(n3 m)

A better solution[2][3], utilizing dynamic programming, uses an alternative formulation of the problem: for each position j in the text T and each position i in the pattern P, compute the minimum edit distance between the i first characters of the pattern, Pi, and any substring Tj',j of T that ends at position j.

For each position j in the text T, and each position i in the pattern P, go through all substrings of T ending at position j, and determine which one of them has the minimal edit distance to the i first characters of the pattern P. Write this minimal distance as E(ij). After computing E(ij) for all i and j, we can easily find a solution to the original problem: it is the substring for which E(mj) is minimal (m being the length of the pattern P.)

Computing E(mj) is very similar to computing the edit distance between two strings. In fact, we can use the edit distance computing algorithm for E(mj), the only difference being that we must initialize the first row with zeros, and save the path of computation, that is, if we used E(i − 1,j), E(i,j − 1) or E(i − 1,j − 1) in computing E(ij).

In the array containing the E(xy) values, we then choose the minimal value in the last row, let it be E(x2y2), and follow the path of computation backwards, back to the row number 0. If the field we arrived at was E(x1, 0), then T[x1 + 1] ... T[y2] is a substring of T with the minimal edit distance to the pattern P.

Computing the E(xy) array takes O(mn) time with the dynamic programming algorithm, while the backwards-working phase takes O(n + m) time.

[edit] On-line versus off-line

Traditionally, approximate string matching algorithms are classified into two categories: on-line and off-line. With on-line algorithms the pattern can be preprocessed before searching but the text cannot. In other words, on-line techniques do searching without an index. Early algorithms for on-line approximate matching were suggested by Wagner and Fisher[1] and by Sellers. [2] Both algorithms are based on dynamic programming but solve different problems. Sellers' algorithm searches approximately for a substring in a text while the algorithm of Wagner and Fisher calculates Levenshtein distance, being appropriate for dictionary fuzzy search only.

On-line searching techniques have been repeatedly improved. Perhaps the most famous improvement is the bitap algorithm (also known as the shift-or and shift-and algorithm), which is very efficient for relatively short pattern strings. The Bitap algorithm is the heart of the Unix searching utility agrep. An excellent review of on-line searching algorithms was done by G. Navarro.[3]

Although very fast on-line techniques exist, their performance on large data is unacceptable. Text preprocessing or indexing makes searching dramatically faster. Today, a variety of indexing algorithms have been presented. Among them are suffix trees[4], metric trees[5] and n-gram methods.[6][7] For a detailed list of indexing techniques see the paper of Navarro et al.[8]

[edit] Applications

The most common application of approximate matchers until recently has been spell checking.[9] With the availability of large amounts of DNA data, matching of nucleotide sequences has become an important application.[1] Approximate matching is also used to identify pieces of music from small snatches and in spam filtering.[10]

[edit] See also

[edit] External links

[edit] References

  1. ^ a b c Cormen, Thomas; Leiserson, Rivest (2001). Introduction to Algorithms (2nd ed.). MIT Press. pp. 364–367. ISBN 0-262-03293-7. 
  2. ^ Skiena, Steve (1998). Algorithm Design Manual (1st ed.). Springer. ISBN 978-0387948607. 
  3. ^ Ottmann, Thomas; Peter Widmayer (2002) (in German). Algorithmen und Datenstrukturen (4th ed.). Spektrum Akademischer Verlag. pp. 636–639. ISBN 3-8274-1029-0. 
Personal tools
Languages