# Shortest common supersequence problem

In computer science, the shortest common supersequence of two sequences X and Y is the shortest sequence which has X and Y as subsequences. This is a problem closely related to the longest common subsequence problem. Given two sequences X = < x1,...,xm > and Y = < y1,...,yn >, a sequence U = < u1,...,uk > is a common supersequence of X and Y if items can be removed from U to produce X or Y.

A shortest common supersequence (SCS) is a common supersequence of minimal length. In the shortest common supersequence problem, the two sequences X and Y are given and the task is to find a shortest possible common supersequence of these sequences. In general, an SCS is not unique.

For two input sequences, an SCS can be formed from a longest common subsequence (LCS) easily. For example, if X${\displaystyle [1..m]=abcbdab}$ and Y${\displaystyle [1..n]=bdcaba}$, the lcs is Z${\displaystyle [1..r]=bcba}$. By inserting the non-lcs symbols while preserving the symbol order, we get the SCS: U${\displaystyle [1..t]=abdcabdab}$.

It is quite clear that ${\displaystyle r+t=m+n}$ for two input sequences. However, for three or more input sequences this does not hold. Note also, that the lcs and the SCS problems are not dual problems.

## Shortest common superstring

The closely related problem of finding a string which is a superstring of a finite set of strings S = { s1,s2,...,sn } is NP-Complete.[1] Also, good (constant factor) approximations have been found for the average case but not for the worst case.[2][3] However, it can be formulated as an instance of weighted set cover in such a way that the weight of the optimal solution to the set cover instance is less than twice the length of the shortest superstring S. One can then use the O(log(n))-approximation for weighted set-cover to obtain an O(log(n))-approximation for the shortest superstring (note that this is not a constant factor approximation).

For any string x in this alphabet, define P(x) to be the set of all strings which are substrings of x. The instance I of set cover is formulated as follows:

• Let M be empty.
• For each pair of strings si and sj, if the last k symbols of si are the same as the first k symbols of sj, then add a string to M that consists of the concatenation with maximal overlap of si with sj.
• Define the universe ${\displaystyle {\mathcal {U}}}$ of the set cover instance to be S
• Define the set of subsets of the universe to be { P(x) | xSM }
• Define the cost of each subset P(x) to be |x|, the length of x.

The instance I can then be solved using an algorithm for weighted set cover, and the algorithm can output an arbitrary concatenation of the strings x for which the weighted set cover algorithm outputs P(x).[4]

### Example

Consider the set S = { abc, cde, fab }, which becomes the universe of the weighted set cover instance. In this case, M = { abcde, fabc }. Then the set of subsets of the universe is

{\displaystyle {\begin{aligned}\{P(x)|x\in S\cup M\}&=\{P(x)|x\in \{abc,cde,fab,abcde,fabc\}\}\\&=\{P(abc),P(cde),P(fab),P(abcde),P(fabc)\}\}\\&=\{\{a,b,c,ab,bc,abc\},\{c,d,e,cd,de,cde\},\ldots ,\{f,a,b,c,fa,ab,bc,fab,abc,fabc\}\}\}\\\end{aligned}}}

which have costs 3, 3, 3, 5, and 4, respectively.

## References

1. ^ Kari-Jouko Räihä, Esko Ukkonen (1981). "The shortest common supersequence problem over binary alphabet is NP-complete". Theoretical Computer Science. 16 (2): 187–198. doi:10.1016/0304-3975(81)90075-x.
2. ^ Tao Jiang and Ming Li (1994). "On the Approximation of Shortest Common Supersequences and Longest Common Subsequences". SIAM Journal on Computing. 24 (5): 1122–1139. doi:10.1137/s009753979223842x.
3. ^ Marek Karpinski and Richard Schmied (2013). "On Improved Inapproximability Results for the Shortest Superstring and Related Problems". Proceedings of 19th CATS CRPIT. 141: 27–36.
4. ^ Vazirani, p. 20.