In computer science, edit distance is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the other. Edit distances find applications in natural language processing, where automatic spelling correction can determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question. In bioinformatics, it can be used to quantify the similarity of macromolecules such as DNA, which can be viewed as strings of the letters A, C, G and T.
Several definitions of edit distance exist, using different sets of string operations. One of the most common variants is called Levenshtein distance, named after the Soviet Russian computer scientist Vladimir Levenshtein. In this version, the allowed operations are the removal or insertion of a single character, or the substitution of one character for another.
Formal definition and properties
Given two strings a and b on an alphabet Σ (e.g. the set of ASCII characters, the set of bytes [0..255], etc.), the edit distance d(a, b) is the minimum-weight series of edit operations that transforms a into b. One of the simplest sets of edit operations is that defined by Levenshtein in 1966:
- Insertion of a single symbol. If a = uv, then inserting the symbol x produces uxv. This can also be denoted ε→x, using ε to denote the empty string.
- Deletion of a single symbol changes uxv to uv (x→ε).
- Substitution of a single symbol x for a symbol y ≠ x changes uxv to uyv (x→y).
In Levenshtein's original definition, each of these operations has unit cost (except that substitution of a character by itself has zero cost), so the Levenshtein distance is equal to the minimum number of operations required to transform a to b. A more general definition associates non-negative weight functions wins(x), wdel(x) and wsub(x y) with the operations.
Additional primitive operations have been suggested. A common mistake when typing text is transposition of two adjacent characters commonly occur, formally characterized by an operation that changes uxyv into uyxv where x, y ∈ Σ. For the task of correcting OCR output, merge and split operations have been used which replace a single character into a pair of them or vice-versa.
The Levenshtein distance between "kitten" and "sitting" is 3. The minimal edit script that transforms the former into the latter is:
- kitten → sitten (substitution of "s" for "k")
- sitten → sittin (substitution of "i" for "e")
- sittin → sitting (insertion of "g" at the end).
String edit distance with unit cost, i.e. wins(x) = wdel(x) = 1, wsub(x,y) = [x = y], satisfies the axioms of a metric:
- d(a, a) = 0, since each string can be transformed to itself using zero-cost substitutions.
- d(a, b) > 0 when a ≠ b
- d(a, b) = d(b, a) by symmetry of the costs.
- Triangle inequality: d(a, c) ≤ d(a, b) + d(b, c).
Other useful properties include:
- Edit distance with unit cost is bounded above by the length of the longer of the two strings.
- When a and b share a common prefix, this prefix has no effect on the distance. Formally, when a = uv and b = uw, then d(a, b) = d(v, w).
Using Levenshtein's original operations, the edit distance between a = a₁...aₘ and b = b₁...bₙ is given by dₘₙ, defined by the recurrence
- for 1 ≤ i ≤ m, 1 ≤ j ≤ n.
This algorithm can be generalized to handle transpositions by adding an additional term in the recursive clause's minimization.
The usual way to compute this recurrence is by using a dynamic programming algorithm that is usually credited to Wagner and Fisher, although it has a history of multiple invention. After completion of the Wagner–Fischer algorithm, a minimal sequence of edit operations can be read of as a backtrace of the DP operations.
This algorithm has a time complexity of Θ(mn). When the full dynamic programming table is constructed, its space complexity is also Θ(mn); this can be improved to Θ(min(m,n)) by observing that at any instant, the algorithm only requires two rows (or two columns) in memory. This optimization makes it impossible to read off the minimal series of edit operations, though.
Improving on the Wagner–Fisher algorithm described above, Ukkonen describes a variant that takes two strings and a maximum edit distance s, and returns min(s, d). It achieves this by only computing and storing a part of the dynamic programming table around its diagonal. This algorithm takes time O(s×min(m,n)), where m and n are the lengths of the strings. Space complexity is O(s²) or O(s), depending on whether the edit sequence needs to be read off.
Aside from Levenshtein distance and the variants already mentioned, some other well-known string metrics are special cases of edit distance, or strongly related:
- Hamming distance measures the number of transpositions needed to transform a string into another string of the same length. It can be obtained as an edit distance with substitution as the only allowed operation, restricted to equal-length strings, and for such strings is an upper bound on Levenshtein distance.
- Jaro–Winkler distance can be obtained from an edit distance where only transpositions are allowed.
- The longest common substring of two strings can be computed with a dynamic programming algorithm that is similar to the one for edit distance.
Edit distance finds applications in computational biology and natural language processing, e.g. the correction of spelling mistakes or OCR errors. Various algorithms exist that solve problems beside the computation of distance between a pair of strings, to solve related types of problems.
- Hirschberg's algorithm computes the optimal alignment of two strings, where optimality is defined as minimizing edit distance.
- Approximate string matching can be formulated in terms of edit distance. Ukkonen's 1985 algorithm takes a string p, called the pattern, and a constant k; it then builds a deterministic finite state automaton that finds, in an arbitrary string s, a substring whose edit distance to p is at most k (cf. the Aho–Corasick algorithm, which similarly constructs an automaton to search for any of a number of patterns, but without allowing edit operations). A similar algorithm for approximate string matching is the bitap algorithm, also defined in terms of edit distance.
- Levenshtein automata are finite-state machines that recognize a set of strings within bounded edit distance of a fixed reference string.
- Daniel Jurafsky; James H. Martin. Speech and Language Processing. Pearson Education International. pp. 107–111.
- Esko Ukkonen (1983). "On approximate string matching". Foundations of Computation Theory. Springer. pp. 487–495.
- Schulz, Klaus U.; Mihov, Stoyan (2002). "Fast string correction with Levenshtein automata". International Journal of Document Analysis and Recognition 5 (1): 67–85. doi:10.1007/s10032-002-0082-8. CiteSeerX: 10.1.1.16.652.
- Lei Chen; Raymond Ng (2004). "On the marriage of Lₚ-norms and edit distance". Proc. 30th Int'l Conf. on Very Large Databases (VLDB) 30.
- R. Wagner; M. Fisher (1974). "The string-to-string correction problem". J. ACM 21: 168–178.
- Esko Ukkonen (1985). "Finding approximate patterns in strings". J. Algorithms 6: 132–137.