# Optimal matching

Not to be confused with maximum matching.

Optimal matching is a sequence analysis method used in social science, to assess the dissimilarity of ordered arrays of tokens that usually represent a time-ordered sequence of socio-economic states two individuals have experienced. Once such distances have been calculated for a set of observations (e.g. individuals in a cohort) classical tools (such as cluster analysis) can be used. The method was tailored to social sciences[1] from a technique originally introduced to study molecular biology (protein or genetic) sequences (see sequence alignment). Optimal matching uses the Needleman-Wunsch algorithm.

## Algorithm

Let ${\displaystyle S=(s_{1},s_{2},s_{3},\ldots s_{T})}$ be a sequence of states ${\displaystyle s_{i}}$ belonging to a finite set of possible states. Let us denote ${\displaystyle {\mathbf {S} }}$ the sequence space, i.e. the set of all possible sequences of states.

Optimal matching algorithms work by defining simple operator algebras that manipulate sequences, i.e. a set of operators ${\displaystyle a_{i}:{\mathbf {S} }\rightarrow {\mathbf {S} }}$. In the most simple approach, a set composed of only three basic operations to transform sequences is used:

• one state ${\displaystyle s}$ is inserted in the sequence ${\displaystyle a_{s'}^{\rm {Ins}}(s_{1},s_{2},s_{3},\ldots s_{T})=(s_{1},s_{2},s_{3},\ldots ,s',\ldots s_{T})}$
• one state is deleted from the sequence ${\displaystyle a_{s_{2}}^{\rm {Del}}(s_{1},s_{2},s_{3},\ldots s_{T})=(s_{1},s_{3},\ldots s_{T})}$ and
• a state ${\displaystyle s_{1}}$ is replaced (substituted) by state ${\displaystyle s'_{1}}$, ${\displaystyle a_{s_{1},s'_{1}}^{\rm {Sub}}(s_{1},s_{2},s_{3},\ldots s_{T})=(s'_{1},s_{2},s_{3},\ldots s_{T})}$.

Imagine now that a cost ${\displaystyle c(a_{i})\in {\mathbf {R} }_{0}^{+}}$ is associated to each operator. Given two sequences ${\displaystyle S_{1}}$ and ${\displaystyle S_{2}}$, the idea is to measure the cost of obtaining ${\displaystyle S_{2}}$ from ${\displaystyle S_{1}}$ using operators from the algebra. Let ${\displaystyle A={a_{1},a_{2},\ldots a_{n}}}$ be a sequence of operators such that the application of all the operators of this sequence ${\displaystyle A}$ to the first sequence ${\displaystyle S_{1}}$ gives the second sequence ${\displaystyle S_{2}}$: ${\displaystyle S_{2}=a_{1}\circ a_{2}\circ \ldots \circ a_{n}(S_{1})}$ where ${\displaystyle a_{1}\circ a_{2}}$ denotes the compound operator. To this set we associate the cost ${\displaystyle c(A)=\sum _{i=1}^{n}c(a_{i})}$, that represents the total cost of the transformation. One should consider at this point that there might exist different such sequences ${\displaystyle A}$ that transform ${\displaystyle S_{1}}$ into ${\displaystyle S_{2}}$; a reasonable choice is to select the cheapest of such sequences. We thus call distance
${\displaystyle d(S_{1},S_{2})=\min _{A}\left\{c(A)~{\rm {such~that}}~S_{2}=A(S_{1})\right\}}$
that is, the cost of the least expensive set of transformations that turn ${\displaystyle S_{1}}$ into ${\displaystyle S_{2}}$. Notice that ${\displaystyle d(S_{1},S_{2})}$ is by definition nonnegative since it is the sum of positive costs, and trivially ${\displaystyle d(S_{1},S_{2})=0}$ if and only if ${\displaystyle S_{1}=S_{2}}$, that is there is no cost. The distance function is symmetric if insertion and deletion costs are equal ${\displaystyle c(a^{\rm {Ins}})=c(a^{\rm {Del}})}$; the term indel cost usually refers to the common cost of insertion and deletion.

Considering a set composed of only the three basic operations described above, this proximity measure satisfies the triangular inequality. Transitivity however, depends on the definition of the set of elementary operations.

## Criticism

Although optimal matching techniques are widely used in sociology and demography, such techniques also have their flaws. As was pointed out by several authors (for example L. L. Wu[2]), the main problem in the application of optimal matching is to appropriately define the costs ${\displaystyle c(a_{i})}$.

## Optimal matching in causal modelling

Optimal matching is also a term used in statistical modelling of causal effects. In this context it refers to matching "cases" with "controls", and is completely separate from the sequence-analytic sense.

## Software

• TDA is a powerful program, offering access to some of the latest developments in transition data analysis.
• STATA has implemented a package to run optimal matching analysis.
• TraMineR is an open source R-package for analysing and visualizing states and events sequences, including optimal matching analysis.

## References and notes

1. ^ A. Abbott and A. Tsay, (2000) Sequence Analysis and Optimal Matching Methods in Sociology: Review and Prospect Sociological Methods & Research], Vol. 29, 3-33. doi:10.1177/0049124100029001001
2. ^ L. L. Wu. (2000) Some Comments on "Sequence Analysis and Optimal Matching Methods in Sociology: Review and Prospect" Sociological Methods & Research, 29 41-64. doi:10.1177/0049124100029001003