Shortest common supersequence problem

(Redirected from Shortest common supersequence)

In computer science, the shortest common supersequence of two sequences X and Y is the shortest sequence which has X and Y as subsequences. This is a problem closely related to the longest common subsequence problem. Given two sequences X = < x1,...,xm > and Y = < y1,...,yn >, a sequence U = < u1,...,uk > is a common supersequence of X and Y if items can be removed from U to produce X or Y.

A shortest common supersequence (SCS) is a common supersequence of minimal length. In the shortest common supersequence problem, the two sequences X and Y are given and the task is to find a shortest possible common supersequence of these sequences. In general, an SCS is not unique.

For two input sequences, an SCS can be formed from a longest common subsequence (LCS) easily. For example, if X${\displaystyle [1..m]=abcbdab}$ and Y${\displaystyle [1..n]=bdcaba}$, the lcs is Z${\displaystyle [1..r]=bcba}$. By inserting the non-lcs symbols while preserving the symbol order, we get the SCS: U${\displaystyle [1..t]=abdcabdab}$.

It is quite clear that ${\displaystyle r+t=m+n}$ for two input sequences. However, for three or more input sequences this does not hold. Note also, that the LCS and the SCS problems are not dual problems.

Shortest common superstring

The closely related problem of finding a string which is a superstring of a finite set of strings S = { s1,s2,...,sn } is NP-Complete.[1] Also, good (constant factor) approximations have been found for the average case but not for the worst case.[2][3] However, it can be formulated as an instance of weighted set cover in such a way that the weight of the optimal solution to the set cover instance is less than twice the length of the shortest superstring S. One can then use the O(log(n))-approximation for weighted set-cover to obtain an O(log(n))-approximation for the shortest superstring (note that this is not a constant factor approximation).

For any string x in this alphabet, define P(x) to be the set of all strings which are substrings of x. The instance I of set cover is formulated as follows:

• Let M be empty.
• For each pair of strings si and sj, if the last k symbols of si are the same as the first k symbols of sj, then add a string to M that consists of the concatenation with maximal overlap of si with sj.
• Define the universe ${\displaystyle {\mathcal {U}}}$ of the set cover instance to be S
• Define the set of subsets of the universe to be { P(x) | xSM }
• Define the cost of each subset P(x) to be |x|, the length of x.

The instance I can then be solved using an algorithm for weighted set cover, and the algorithm can output an arbitrary concatenation of the strings x for which the weighted set cover algorithm outputs P(x).[4]

Example

Consider the set S = { abc, cde, fab }, which becomes the universe of the weighted set cover instance. In this case, M = { abcde, fabc }. Then the set of subsets of the universe is

{\displaystyle {\begin{aligned}\{P(x)|x\in S\cup M\}&=\{P(x)|x\in \{abc,cde,fab,abcde,fabc\}\}\\&=\{P(abc),P(cde),P(fab),P(abcde),P(fabc)\}\}\\&=\{\{a,b,c,ab,bc,abc\},\{c,d,e,cd,de,cde\},\ldots ,\{f,a,b,c,fa,ab,bc,fab,abc,fabc\}\}\}\\\end{aligned}}}

which have costs 3, 3, 3, 5, and 4, respectively.

References

1. ^ Kari-Jouko Räihä, Esko Ukkonen (1981). "The shortest common supersequence problem over binary alphabet is NP-complete". Theoretical Computer Science. 16 (2): 187–198. doi:10.1016/0304-3975(81)90075-x.
2. ^ Tao Jiang and Ming Li (1994). "On the Approximation of Shortest Common Supersequences and Longest Common Subsequences". SIAM Journal on Computing. 24 (5): 1122–1139. doi:10.1137/s009753979223842x.
3. ^ Marek Karpinski and Richard Schmied (2013). "On Improved Inapproximability Results for the Shortest Superstring and Related Problems". Proceedings of 19th CATS CRPIT. 141: 27–36.
4. ^ Vazirani, p. 20.