Optimal matching

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Optimal matching is a sequence analysis method used in social science, to assess the dissimilarity of ordered arrays of tokens that usually represent a time-ordered sequence of socio-economic states two individuals have experienced. Once such distances have been calculated for a set of observations (e.g. individuals in a cohort) classical tools (such as cluster analysis) can be used. The method was tailored to social sciences[1] from a technique originally introduced to study molecular biology (protein or genetic) sequences (see sequence alignment). Optimal matching uses the Needleman-Wunsch algorithm.

Algorithm[edit]

Let S = (s_1, s_2, s_3, \ldots s_T) be a sequence of states s_i belonging to a finite set of possible states. Let us denote {\mathbf S} the sequence space, i.e. the set of all possible sequences of states.

Optimal matching algorithms work by defining simple operator algebras that manipulate sequences, i.e. a set of operators a_i: {\mathbf S} \rightarrow {\mathbf S}. In the most simple approach, a set composed of only three basic operations to transform sequences is used:

  • one state s is inserted in the sequence a^{\rm Ins}_{s'} (s_1, s_2, s_3, \ldots s_T) = (s_1, s_2, s_3, \ldots, s', \ldots s_T)
  • one state is deleted from the sequence a^{\rm Del}_{s_2} (s_1, s_2, s_3, \ldots s_T) = (s_1, s_3, \ldots  s_T) and
  • a state s_1 is replaced (substituted) by state s'_1, a^{\rm Sub}_{s_1,s'_1} (s_1, s_2, s_3, \ldots s_T) = (s'_1, s_2, s_3, \ldots s_T).

Imagine now that a cost c(a_i) \in {\mathbf R}^+_0 is associated to each operator. Given two sequences S_1 and S_2, the idea is to measure the cost of obtaining S_2 from S_1 using operators from the algebra. Let A={a_1, a_2, \ldots a_n} be a sequence of operators such that the application of all the operators of this sequence A to the first sequence S_1 gives the second sequence S_2: S_2 = a_1 \circ a_2 \circ \ldots \circ a_{n} (S_1) where a_1 \circ a_2 denotes the compound operator. To this set we associate the cost c(A) = \sum_{i=1}^n c(a_i), that represents the total cost of the transformation. One should consider at this point that there might exist different such sequences A that transform S_1 into S_2; a reasonable choice is to select the cheapest of such sequences. We thus call distance
d(S_1,S_2)= \min_A \left \{ c(A)~{\rm such~that}~S_2 = A (S_1)  \right \}
that is, the cost of the least expensive set of transformations that turn S_1 into S_2. Notice that d(S_1,S_2) is by definition nonnegative since it is the sum of positive costs, and trivially d(S_1,S_2)=0 if and only if S_1=S_2, that is there is no cost. The distance function is symmetric if insertion and deletion costs are equal c(a^{\rm Ins}) = c(a^{\rm Del}); the term indel cost usually refers to the common cost of insertion and deletion.

Considering a set composed of only the three basic operations described above, this proximity measure satisfies the triangular inequality. Transitivity however, depends on the definition of the set of elementary operations.

Criticism[edit]

Although optimal matching techniques are widely used in sociology and demography, such techniques also have their flaws. As was pointed out by several authors (for example L. L. Wu[2]), the main problem in the application of optimal matching is to appropriately define the costs c(a_i).

Optimal matching in causal modelling[edit]

Optimal matching is also a term used in statistical modelling of causal effects. In this context it refers to matching "cases" with "controls", and is completely separate from the sequence-analytic sense.

Software[edit]

  • TDA is a powerful program, offering access to some of the latest developments in transition data analysis.
  • STATA has implemented a package to run optimal matching analysis.
  • TraMineR is an open source R-package for analysing and visualizing states and events sequences, including optimal matching analysis.

References and notes[edit]

  1. ^ A. Abbott and A. Tsay, (2000) Sequence Analysis and Optimal Matching Methods in Sociology: Review and Prospect Sociological Methods & Research], Vol. 29, 3-33. doi:10.1177/0049124100029001001
  2. ^ L. L. Wu. (2000) Some Comments on "Sequence Analysis and Optimal Matching Methods in Sociology: Review and Prospect" Sociological Methods & Research, 29 41-64. doi:10.1177/0049124100029001003