# Jaro–Winkler distance

Jump to navigation Jump to search

In computer science and statistics, the Jaro–Winkler distance is a string metric measuring an edit distance between two sequences. It is a variant proposed in 1990 by William E. Winkler of the Jaro distance metric (1989, Matthew A. Jaro).

The Jaro–Winkler distance uses a prefix scale $p$ which gives more favourable ratings to strings that match from the beginning for a set prefix length $\ell$ .

The lower the Jaro–Winkler distance for two strings is, the more similar the strings are. The score is normalized such that 1 equates to no similarity and 0 is an exact match. The Jaro–Winkler similarity is the inversion, (1 − Jaro–Winkler distance).

Although often referred to as a distance metric, the Jaro–Winkler distance is not a metric in the mathematical sense of that term because it does not obey the triangle inequality.

## Definition

### Jaro Similarity

The Jaro Similarity $sim_{j}$ of two given strings $s_{1}$ and $s_{2}$ is

$sim_{j}=\left\{{\begin{array}{l l}0&{\text{if }}m=0\\{\frac {1}{3}}\left({\frac {m}{|s_{1}|}}+{\frac {m}{|s_{2}|}}+{\frac {m-t}{m}}\right)&{\text{otherwise}}\end{array}}\right.$ Where:

• $|s_{i}|$ is the length of the string $s_{i}$ ;
• $m$ is the number of matching characters (see below);
• $t$ is half the number of transpositions (see below).

Two characters from $s_{1}$ and $s_{2}$ respectively, are considered matching only if they are the same and not farther than $\left\lfloor {\frac {\max(|s_{1}|,|s_{2}|)}{2}}\right\rfloor -1$ .

Each character of $s_{1}$ is compared with all its matching characters in $s_{2}$ . The number of matching (but different sequence order) characters divided by 2 defines the number of transpositions. For example, in comparing CRATE with TRACE, only 'R' 'A' 'E' are the matching characters, i.e. m=3. Although 'C', 'T' appear in both strings, they are farther than 1 (the result of $\lfloor {\tfrac {5}{2}}\rfloor -1$ ). Therefore, t=0 . In DwAyNE versus DuANE the matching letters are already in the same order D-A-N-E, so no transpositions are needed.

### Jaro–Winkler Similarity

Jaro–Winkler similarity uses a prefix scale $p$ which gives more favorable ratings to strings that match from the beginning for a set prefix length $\ell$ . Given two strings $s_{1}$ and $s_{2}$ , their Jaro–Winkler similarity $sim_{w}$ is:

$sim_{w}=sim_{j}+\ell p(1-sim_{j}),$ where:

• $sim_{j}$ is the Jaro similarity for strings $s_{1}$ and $s_{2}$ • $\ell$ is the length of common prefix at the start of the string up to a maximum of four characters
• $p$ is a constant scaling factor for how much the score is adjusted upwards for having common prefixes. $p$ should not exceed 0.25, otherwise the similarity could become larger than 1. The standard value for this constant in Winkler's work is $p=0.1$ The Jaro-Winkler distance $d_{w}$ is defined as $d_{w}=1-sim_{w}$ .

Although often referred to as a distance metric, the Jaro–Winkler distance is not a metric in the mathematical sense of that term because it does not obey the triangle inequality. The Jaro–Winkler distance also does not satisfy the identity axiom $d(x,y)=0\leftrightarrow x=y$ .

## Relationship with other edit distance metrics

There are other popular measures of edit distance, which are calculated using a different set of allowable edit operations. For instance,

Edit distance is usually defined as a parameterizable metric calculated with a specific set of allowed edit operations, and each operation is assigned a cost (possibly infinite). This is further generalized by DNA sequence alignment algorithms such as the Smith–Waterman algorithm, which make an operation's cost depend on where it is applied.

## Footnotes

1. ^ "Jaro-Winkler «  Inviting Epiphany". RichardMinerich.com. Retrieved 12 June 2017.