# Optimal matching

Optimal matching is a sequence analysis method used in social science, to assess the dissimilarity of ordered arrays of tokens that usually represent a time-ordered sequence of socio-economic states two individuals have experienced. Once such distances have been calculated for a set of observations (e.g. individuals in a cohort) classical tools (such as cluster analysis) can be used. The method was tailored to social sciences[1] from a technique originally introduced to study molecular biology (protein or genetic) sequences (see sequence alignment). Optimal matching uses the Needleman-Wunsch algorithm.

## Algorithm

Let $S = (s_1, s_2, s_3, \ldots s_T)$ be a sequence of states $s_i$ belonging to a finite set of possible states. Let us denote ${\mathbf S}$ the sequence space, i.e. the set of all possible sequences of states.

Optimal matching algorithms work by defining simple operator algebras that manipulate sequences, i.e. a set of operators $a_i: {\mathbf S} \rightarrow {\mathbf S}$. In the most simple approach, a set composed of only three basic operations to transform sequences is used:

• one state $s$ is inserted in the sequence $a^{\rm Ins}_{s'} (s_1, s_2, s_3, \ldots s_T) = (s_1, s_2, s_3, \ldots, s', \ldots s_T)$
• one state is deleted from the sequence $a^{\rm Del}_{s_2} (s_1, s_2, s_3, \ldots s_T) = (s_1, s_3, \ldots s_T)$ and
• a state $s_1$ is replaced (substituted) by state $s'_1$, $a^{\rm Sub}_{s_1,s'_1} (s_1, s_2, s_3, \ldots s_T) = (s'_1, s_2, s_3, \ldots s_T)$.

Imagine now that a cost $c(a_i) \in {\mathbf R}^+_0$ is associated to each operator. Given two sequences $S_1$ and $S_2$, the idea is to measure the cost of obtaining $S_2$ from $S_1$ using operators from the algebra. Let $A={a_1, a_2, \ldots a_n}$ be a sequence of operators such that the application of all the operators of this sequence $A$ to the first sequence $S_1$ gives the second sequence $S_2$: $S_2 = a_1 \circ a_2 \circ \ldots \circ a_{n} (S_1)$ where $a_1 \circ a_2$ denotes the compound operator. To this set we associate the cost $c(A) = \sum_{i=1}^n c(a_i)$, that represents the total cost of the transformation. One should consider at this point that there might exist different such sequences $A$ that transform $S_1$ into $S_2$; a reasonable choice is to select the cheapest of such sequences. We thus call distance
$d(S_1,S_2)= \min_A \left \{ c(A)~{\rm such~that}~S_2 = A (S_1) \right \}$
that is, the cost of the least expensive set of transformations that turn $S_1$ into $S_2$. Notice that $d(S_1,S_2)$ is by definition nonnegative since it is the sum of positive costs, and trivially $d(S_1,S_2)=0$ if and only if $S_1=S_2$, that is there is no cost. The distance function is symmetric if insertion and deletion costs are equal $c(a^{\rm Ins}) = c(a^{\rm Del})$; the term indel cost usually refers to the common cost of insertion and deletion.

Considering a set composed of only the three basic operations described above, this proximity measure satisfies the triangular inequality. Transitivity however, depends on the definition of the set of elementary operations.

## Criticism

Although optimal matching techniques are widely used in sociology and demography, such techniques also have their flaws. As was pointed out by several authors (for example L. L. Wu[2]), the main problem in the application of optimal matching is to appropriately define the costs $c(a_i)$.

## Optimal matching in causal modelling

Optimal matching is also a term used in statistical modelling of causal effects. In this context it refers to matching "cases" with "controls", and is completely separate from the sequence-analytic sense.

## Software

• TDA is a powerful program, offering access to some of the latest developments in transition data analysis.
• STATA has implemented a package to run optimal matching analysis.
• TraMineR is an open source R-package for analysing and visualizing states and events sequences, including optimal matching analysis.

## References and notes

1. ^ A. Abbott and A. Tsay, (2000) Sequence Analysis and Optimal Matching Methods in Sociology: Review and Prospect Sociological Methods & Research], Vol. 29, 3-33. doi:10.1177/0049124100029001001
2. ^ L. L. Wu. (2000) Some Comments on "Sequence Analysis and Optimal Matching Methods in Sociology: Review and Prospect" Sociological Methods & Research, 29 41-64. doi:10.1177/0049124100029001003