# Jaro–Winkler distance

(Redirected from Jaro distance)

In computer science and statistics, the Jaro–Winkler distance is a string metric for measuring the edit distance between two sequences. It is a variant proposed in 1999 by William E. Winkler of the Jaro distance metric (1989, Matthew A. Jaro). Informally, the Jaro distance between two words is the minimum number of single-character transpositions required to change one word into the other.

The Jaro–Winkler distance uses a prefix scale ${\displaystyle p}$ which gives more favourable ratings to strings that match from the beginning for a set prefix length ${\displaystyle \ell }$.

The lower the Jaro–Winkler distance for two strings is, the more similar the strings are. The score is normalized such that 1 equates to no similarity and 0 is an exact match. The Jaro–Winkler similarity is given by 1 − Jaro–Winkler distance.

Although often referred to as a distance metric, the Jaro–Winkler distance is not a metric in the mathematical sense of that term because it does not obey the triangle inequality.

## Definition

### Jaro distance

The Jaro distance ${\displaystyle d_{j}}$ of two given strings ${\displaystyle s_{1}}$ and ${\displaystyle s_{2}}$ is

${\displaystyle d_{j}=\left\{{\begin{array}{l l}0&{\text{if }}m=0\\{\frac {1}{3}}\left({\frac {m}{|s_{1}|}}+{\frac {m}{|s_{2}|}}+{\frac {m-t}{m}}\right)&{\text{otherwise}}\end{array}}\right.}$

Where:

• ${\displaystyle |s_{i}|}$ is the length of the string ${\displaystyle s_{i}}$;
• ${\displaystyle m}$ is the number of matching characters (see below);
• ${\displaystyle t}$ is half the number of transpositions (see below).

Two characters from ${\displaystyle s_{1}}$ and ${\displaystyle s_{2}}$ respectively, are considered matching only if they are the same and not farther than ${\displaystyle \left\lfloor {\frac {\max(|s_{1}|,|s_{2}|)}{2}}\right\rfloor -1}$.

Each character of ${\displaystyle s_{1}}$ is compared with all its matching characters in ${\displaystyle s_{2}}$. The number of matching (but different sequence order) characters divided by 2 defines the number of transpositions. For example, in comparing CRATE with TRACE, only 'R' 'A' 'E' are the matching characters, i.e. m=3. Although 'C', 'T' appear in both strings, they are farther than 1, i.e., floor(5/2)-1=1. Therefore, t=0 . In DwAyNE versus DuANE the matching letters are already in the same order D-A-N-E, so no transpositions are needed.

### Jaro–Winkler distance

Jaro–Winkler distance uses a prefix scale ${\displaystyle p}$ which gives more favourable ratings to strings that match from the beginning for a set prefix length ${\displaystyle \ell }$. Given two strings ${\displaystyle s_{1}}$ and ${\displaystyle s_{2}}$, their Jaro–Winkler distance ${\displaystyle d_{w}}$ is:

${\displaystyle d_{w}=d_{j}+(\ell p(1-d_{j})),}$

where:

• ${\displaystyle d_{j}}$ is the Jaro distance for strings ${\displaystyle s_{1}}$ and ${\displaystyle s_{2}}$
• ${\displaystyle \ell }$ is the length of common prefix at the start of the string up to a maximum of four characters
• ${\displaystyle p}$ is a constant scaling factor for how much the score is adjusted upwards for having common prefixes. ${\displaystyle p}$ should not exceed 0.25, otherwise the distance can become larger than 1. The standard value for this constant in Winkler's work is ${\displaystyle p=0.1}$

Although often referred to as a distance metric, the Jaro–Winkler distance is not a metric in the mathematical sense of that term because it does not obey the triangle inequality.[1] The Jaro–Winkler distance also does not satisfy that axiom that states that ${\displaystyle d(x,y)=0\leftrightarrow x=y}$.

In some implementations of Jaro–Winkler, the prefix bonus ${\displaystyle \ell p(1-d_{j})}$ is only added when the compared strings have a Jaro distance above a set "boost threshold" ${\displaystyle b_{t}}$. The boost threshold in Winkler's implementation was 0.7.

${\displaystyle d_{w}=\left\{{\begin{array}{l l}d_{j}&{\text{if }}d_{j}

## Example

Note that Winkler's "reference" C code differs in at least two ways from published accounts of the Jaro–Winkler metric. First is his use of a typo table (adjwt) and also some optional additional tolerance for long strings.

### Example #1

Given the strings ${\displaystyle s_{1}}$ MARTHA and ${\displaystyle s_{2}}$ MARHTA we find:

 M A R T H A M 1 0 0 0 0 0 A 0 1 0 0 0 0 R 0 0 1 0 0 0 H 0 0 0 0 1 0 T 0 0 0 1 0 0 A 0 0 0 0 0 1

Here, the shaded cells are the match window for each character. A 1 in a cell indicates a match.

• ${\displaystyle m=6}$
• ${\displaystyle |s_{1}|=6}$
• ${\displaystyle |s_{2}|=6}$
• There are mismatched characters T/H and H/T leading to ${\displaystyle t={\frac {2}{2}}=1}$

We find a Jaro score of:

${\displaystyle d_{j}={\frac {1}{3}}\left({\frac {6}{6}}+{\frac {6}{6}}+{\frac {6-1}{6}}\right)=0.944}$

To find the Jaro–Winkler score using the standard weight ${\displaystyle p=0.1}$, we continue to find:

${\displaystyle \ell =3}$

Thus:

${\displaystyle d_{w}=0.944+(3*0.1(1-0.944))=0.961}$

### Example #2

Given the strings ${\displaystyle s_{1}}$ DWAYNE and ${\displaystyle s_{2}}$ DUANE we find:

• ${\displaystyle m=4}$
• ${\displaystyle |s_{1}|=6}$
• ${\displaystyle |s_{2}|=5}$
• ${\displaystyle t=0}$

We find a Jaro score of:

${\displaystyle d_{j}={\frac {1}{3}}\left({\frac {4}{6}}+{\frac {4}{5}}+{\frac {4-0}{4}}\right)=0.822}$

To find the Jaro–Winkler score using the standard weight ${\displaystyle p=0.1}$, we continue to find:

${\displaystyle \ell =1}$

Thus:

${\displaystyle d_{w}=0.822+(1*0.1(1-0.822))=0.84}$

### Example #3

Given the strings ${\displaystyle s_{1}}$ DIXON and ${\displaystyle s_{2}}$ DICKSONX we find:

D I X O N 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0

Here, the shaded cells are the match window for each character. A 1 in a cell indicates a match. Note that the two Xs are not considered matches because they are outside the match window of 3.

• ${\displaystyle m=4}$
• ${\displaystyle |s_{1}|=5}$
• ${\displaystyle |s_{2}|=8}$
• ${\displaystyle t=0}$

We find a Jaro score of:

${\displaystyle d_{j}={\frac {1}{3}}\left({\frac {4}{5}}+{\frac {4}{8}}+{\frac {4-0}{4}}\right)=0.767}$

To find the Jaro–Winkler score using the standard weight ${\displaystyle p=0.1}$, one continues to find:

${\displaystyle \ell =2}$.

Thus:

${\displaystyle d_{w}=0.767+(2*0.1(1-0.767))=0.814}$

## Relationship with other edit distance metrics

There are other popular measures of edit distance, which are calculated using a different set of allowable edit operations. For instance,

Edit distance is usually defined as a parameterizable metric calculated with a specific set of allowed edit operations, and each operation is assigned a cost (possibly infinite). This is further generalized by DNA sequence alignment algorithms such as the Smith–Waterman algorithm, which make an operation's cost depend on where it is applied.