# Jaro–Winkler distance

In computer science and statistics, the Jaro–Winkler distance is a string metric measuring an edit distance between two sequences. It is a variant proposed in 1990 by William E. Winkler of the Jaro distance metric (1989, Matthew A. Jaro).

The Jaro–Winkler distance uses a prefix scale ${\displaystyle p}$ which gives more favourable ratings to strings that match from the beginning for a set prefix length ${\displaystyle \ell }$.

The lower the Jaro–Winkler distance for two strings is, the more similar the strings are. The score is normalized such that 1 equates to no similarity and 0 is an exact match. The Jaro–Winkler similarity is given by 1 − Jaro–Winkler distance.

Although often referred to as a distance metric, the Jaro–Winkler distance is not a metric in the mathematical sense of that term because it does not obey the triangle inequality.

## Definition

### Jaro Similarity

The Jaro Similarity ${\displaystyle sim_{j}}$ of two given strings ${\displaystyle s_{1}}$ and ${\displaystyle s_{2}}$ is

${\displaystyle sim_{j}=\left\{{\begin{array}{l l}0&{\text{if }}m=0\\{\frac {1}{3}}\left({\frac {m}{|s_{1}|}}+{\frac {m}{|s_{2}|}}+{\frac {m-t}{m}}\right)&{\text{otherwise}}\end{array}}\right.}$

Where:

• ${\displaystyle |s_{i}|}$ is the length of the string ${\displaystyle s_{i}}$;
• ${\displaystyle m}$ is the number of matching characters (see below);
• ${\displaystyle t}$ is half the number of transpositions (see below).

Two characters from ${\displaystyle s_{1}}$ and ${\displaystyle s_{2}}$ respectively, are considered matching only if they are the same and not farther than ${\displaystyle \left\lfloor {\frac {\max(|s_{1}|,|s_{2}|)}{2}}\right\rfloor -1}$.

Each character of ${\displaystyle s_{1}}$ is compared with all its matching characters in ${\displaystyle s_{2}}$. The number of matching (but different sequence order) characters divided by 2 defines the number of transpositions. For example, in comparing CRATE with TRACE, only 'R' 'A' 'E' are the matching characters, i.e. m=3. Although 'C', 'T' appear in both strings, they are farther than 1 (the result of ${\displaystyle \lfloor {\tfrac {5}{2}}\rfloor -1}$). Therefore, t=0 . In DwAyNE versus DuANE the matching letters are already in the same order D-A-N-E, so no transpositions are needed.

### Jaro–Winkler Similarity

Jaro–Winkler similarity uses a prefix scale ${\displaystyle p}$ which gives more favorable ratings to strings that match from the beginning for a set prefix length ${\displaystyle \ell }$. Given two strings ${\displaystyle s_{1}}$ and ${\displaystyle s_{2}}$, their Jaro–Winkler similarity ${\displaystyle sim_{w}}$ is:

${\displaystyle sim_{w}=sim_{j}+\ell p(1-sim_{j}),}$

where:

• ${\displaystyle sim_{j}}$ is the Jaro similarity for strings ${\displaystyle s_{1}}$ and ${\displaystyle s_{2}}$
• ${\displaystyle \ell }$ is the length of common prefix at the start of the string up to a maximum of four characters
• ${\displaystyle p}$ is a constant scaling factor for how much the score is adjusted upwards for having common prefixes. ${\displaystyle p}$ should not exceed 0.25, otherwise the distance can become larger than 1. The standard value for this constant in Winkler's work is ${\displaystyle p=0.1}$

The Jaro-Winkler distance ${\displaystyle d_{w}}$ is defined as ${\displaystyle d_{w}=1-sim_{w}}$.

Although often referred to as a distance metric, the Jaro–Winkler distance is not a metric in the mathematical sense of that term because it does not obey the triangle inequality.[1] The Jaro–Winkler distance also does not satisfy the identity axiom ${\displaystyle d(x,y)=0\leftrightarrow x=y}$.

## Relationship with other edit distance metrics

There are other popular measures of edit distance, which are calculated using a different set of allowable edit operations. For instance,

Edit distance is usually defined as a parameterizable metric calculated with a specific set of allowed edit operations, and each operation is assigned a cost (possibly infinite). This is further generalized by DNA sequence alignment algorithms such as the Smith–Waterman algorithm, which make an operation's cost depend on where it is applied.