String metric

From Wikipedia, the free encyclopedia
Jump to: navigation, search
"String distance" redirects here. For the distance between strings and the fingerboard in musical instruments, see Action (music).

In mathematics and computer science, a string metric (also known as a string similarity metric or string distance function) is a metric that measures similarity or dissimilarity (distance) between two text strings for approximate string matching or comparison and in fuzzy string searching. For example the strings "Sam" and "Samuel" can be considered to be similar. A string metric provides a number indicating an algorithm-specific indication of similarity.

The most widely known string metric is a rudimentary one called the Levenshtein Distance (also known as Edit Distance). It operates between two input strings, returning a score equivalent to the number of substitutions and deletions needed in order to transform one input string into another. Simplistic string metrics such as Levenshtein distance have expanded to include phonetic, token, grammatical and character-based methods of statistical comparisons.

A widespread example of a string metric is DNA sequence analysis and RNA analysis, which are performed by optimised string metrics to identify matching sequences.

String metrics are used heavily in information integration and are currently used in areas including fraud detection, fingerprint analysis, plagiarism detection, ontology merging, DNA analysis, RNA analysis, image analysis, evidence-based machine learning, database data deduplication, data mining, Web interfaces, e.g. Ajax-style suggestions as you type, data integration, and semantic knowledge integration.

List of string metrics[edit]

Selected string measures examples[edit]

Name Example
Hamming distance "karolin" and "kathrin" is 3.
Levenshtein distance and Damerau–Levenshtein distance
  1. kitten → sitten (substitution of "s" for "k")
  2. sitten → sittin (substitution of "i" for "e")
  3. sittin → sitting (insertion of "g" at the end).
Needleman–Wunsch distance or Sellers' algorithm
Sequences    Best Alignments
---------    ----------------------
Block distance or L1 distance or City block distance dist((2,2),(4,5))=abs(4-2)+abs(5-2)=5
Jaro–Winkler distance JaroWinklerDist("MARTHA","MARHTA")=d_j = \frac{1}{3}\left(\frac{6}{6} + \frac{6}{6} + \frac{6-\frac{countTranspositions(MARTHA[3]!=H + MARHTA[3]!=T)}{2}}{6}\right) = 0.944
Soundex distance metric "Robert" and "Rupert" return the same string "R163" while "Rubin" yields "R150"
Dice's coefficient

We would find the set of bigrams in each word:


Each set has four elements, and the intersection of these two sets has only one element: ht.

Inserting these numbers into the formula, we calculate, s = (2 · 1) / (4 + 4) = 0.25.

Most frequent k characters MostFreqKeySimilarity('research', 'seeking', 2) = 2
Lee distance LeeDistance("3140","2543") is sum([3-2, 5-1, 4-4, 0-3]): 1 + 2 + 0 + 3 = 6.

See also[edit]

External links[edit]