LCP array: Difference between revisions

Content deleted Content added

Inline

Revision as of 14:09, 29 August 2012

In computer science, the longest common prefix array (LCP array) is an auxiliary data structure to the suffix array, which stores the length of the longest common prefix between every pair of consecutive suffixes in the suffix array.

Augmenting the suffix array with the LCP array allows to efficiently simulate top-down and bottom-up traversals of the suffix tree^[1]^[2], speeds up pattern matching on the suffix array^[3] and is a prerequisite for compressed suffix trees^[4].

History

The LCP array was introduced by Udi Manber and Gene Myers alongside the suffix array in order to improve the running time of their string search algorithm.^[3]

Definition

Let SA be the suffix array of the string $S=s_{1},s_{2},...s_{n}\$$ and let $\operatorname {lcp} (v,w)$ denote the length of the longest common prefix between two strings $v$ and $w$ . Let further denote $S[i,j]$ the substring of $S$ ranging from $i$ to $j$ .

Then the LCP array $H[1,n]$ is an integer array of size $n$ such that $H[1]$ is undefined and $H[i]=\operatorname {lcp} (S[A[i-1],n],S[A[i],n])$ for every $1<i\leq n$ . Thus $H[i]$ stores the length of longest common prefix of the lexicographically $(i-1)$ 'st and $i$ th smallest suffix of $S$ .

Example

Consider the string $S=banana\$$ :

i	1	2	3	4	5	6	7
S[i]	b	a	n	a	n	a	$

and its corresponding suffix array $A$ :

i	1	2	3	4	5	6	7
A[i]	7	6	4	2	1	5	3

Then the LCP array $H$ is constructed by comparing lexicographically consecutive suffixes to determine their longest common prefix:

i	1	2	3	4	5	6	7
H[i]	$\bot$	0	1	3	0	0	2

So, for example, $H[4]=3$ is the length of the longest common prefix $ana$ shared by the suffixes $A[3]=S[4,7]=ana\$$ and $A[4]=S[2,7]=anana\$$ . Note that $H[1]=\bot$ , since there is not lexicographically smaller suffix.

Efficient Construction Algorithms

LCP array construction algorithms can be divided into two different categories: algorithms that compute the LCP array as a byproduct to the suffix array and algorithms that use an already constructed suffix array in order to compute the LCP values.

Manber & Myers (1993) provide an algorithm to compute the LCP array alongside the suffix array in $O(n\log n)$ time. Kärkkäinen & Sanders (2003) show that it is also possible to modify their $O(n)$ time algorithm such that it computes the LCP array as well. Kasai et al. (2001) presents the first $O(n)$ time algorithm (FLAAP) that computes the LCP array given the text and the suffix array.

Assuming that each text symbol takes one byte and each entry of the suffix or LCP array takes 4 bytes, the major drawback of their algorithm is a large space occupancy of $13n$ bytes, while the original output (text, suffix array, LCP array) only occupies $9n$ bytes. Therefore Manzini (2004) created a refined version of the algorithm of Kasai et al. (2001) (lcp9) and reduced the space occupancy to $9n$ bytes. Kärkkäinen, Manzini & Puglisi (2009) provide another refinement of Karsai's algorithm ( $\Phi$ -algorithm) that improves the running time. Rather than the actual LCP array, this algorithm builds the permuted LCP (PLCP) array, in which the values appear in text order rather than lexicographical order.

Ohlebusch & Gog (2011) provide two algorithms that although being theoretically slow ( $O(n^{2})$ ) were faster than the above mentioned algorithms in practice.^[5]

As of 2012, the currently fastest linear-time LCP array construction algorithm is due to Fischer (2011), which in turn is based on one of the fastest suffix array construction algorithms by Nong, Zhang & Chang (2009).

Applications

Suffix Tree Construction

Given the suffix array $A$ and the LCP array $H$ of a string $S=s_{1},s_{2},...s_{n}\$$ of length $n+1$ , its suffix tree $ST$ can be constructed in $O(n)$ time based on the following idea: Start with the partial suffix tree for the lexicographically smallest suffix and repeatedly insert the other suffixes in the order given by the suffix array.

Let $ST_{i}$ be the partial suffix tree for $0\leq i\leq n$ . Further let $d(v)$ be the length of the concatenation of all path labels from the root of $ST_{i}$ to note $v$ .

Start with $S_{0}$ , the tree consisting only of the root. To insert $A[i+1]$ into $ST_{i}$ , walk up the rightmost path beginning at the recently inserted leaf $A[i]$ to the root, until the deepest node $v$ with $d(v)\leq H[i+1]$ is reached.

We need to distinguish two cases:

$d(v)=H[i+1]$ : This means that the concatenation of the labels on the root-to- $v$ path equals the longest common prefix of suffixes $A[i]$ and $A[i+1]$ .
In this case, insert $A[i+1]$ as a new leaf $x$ of node $v$ and label the edge $(v,x)$ with $S[A[i+1]+H[i+1],n]$ . Thus the edge label consists of the remaining characters of suffix $A[i+1]$ that are not already represented by the concatenation of the labels of the root-to- $v$ path.
This creates the partial suffix tree $ST_{i+1}$ .

$d(v)<H[i+1]$ : This means that the concatenation of the labels on the root-to- $v$ path displays less characters than the longest common prefix of suffixes $A[i]$ and $A[i+1]$ and the missing characters are contained in the edge label of $v$ 's rightmost edge. Therefore we have to split up that edge as follows:
Let $w$ be the child of $v$ on $ST_{i}$ 's rightmost path.

Delete the edge $(v,w)$ .
Add a new internal node $y$ and a new edge $(v,y)$ with label $S[A[i]+d(v),A[i]+H[i+1]-1]$ . The new label consists of the missing characters of the longest common prefix of $A[i]$ and $A[i+1]$ . Thus, the concatenation of the labels of the root-to- $y$ path now displays the longest common prefix of $A[i]$ and $A[i+1]$ .
Connect $w$ to the newly created internal node $y$ by an edge $(y,w)$ that is labeled $S[A[i]+H[i+1],A[i]+d(w)-1]$ . The new label consists of the remaining characters of the deleted edge $(v,w)$ that were not used as the label of edge $(v,y)$ .
Add $A[i+1]$ as a new leaf $x$ and connect it to the new internal node $y$ by an edge $(y,x)$ that is labeled $S[A[i+1]+H[i+1],n]$ . Thus the edge label consists of the remaining characters of suffix A[i+1] that are not already represented by the concatenation of the labels of the root-to- $v$ path.
This creates the partial suffix tree $ST_{i+1}$ .

A simple amortization argument shows that the running time of this algorithm is bounded by $O(n)$ :

The nodes that are traversed in step $i$ by walking up the rightmost path of $ST_{i}$ (apart from the last node $v$ ) are removed from the rightmost path, when $A[i+1]$ is added to the tree as a new leaf and these nodes will never be traversed again for all subsequent steps $j>i$ . Therefore, at most $2n$ nodes will be traversed in total.

TODO: add picture that visualizes the above mentioned steps

Simulating top-down traversal

References

Abouelhoda, Mohamed Ibrahim; Ohlebusch, Enno; Ohlebusch, Enno (2004). "Replacing suffix trees with enhanced suffix arrays". J. of Discrete Algorithms. 2 (1): 53–86. doi:10.1016/S1570-8667(03)00065-0. Retrieved 2012-08-28. {{cite journal}}: Invalid |ref=harv (help)
Manber, U.; Myers, G. (1993). "Suffix Arrays: A New Method for On-Line String Searches". SIAM Journal on Computing. 22 (5): 935–948. doi:10.1137/0222058. Retrieved 2012-08-28. {{cite journal}}: Invalid |ref=harv (help)
Kasai, Toru; Lee, Gunho; Arimura, Hiroki; Arikawa, Setsuo; Park, Kunsoo (2001). "Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications". Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching: 181–192. ISBN 3-540-42271-4. Retrieved 2012-08-28. {{cite journal}}: Invalid |ref=harv (help)
Ohlebusch, Enno; Fischer, Johannes; Gog, Simon (2010). "CST++". String Processing and Information Retrieval - 17th International Symposium, SPIRE 2010, Los Cabos, Mexico, October 11-13, 2010. Proceedings. Lecture Notes in Computer Science. 6393: 322–333. doi:10.1007/978-3-642-16321-0_34. ISBN 978-3-642-16320-3. {{cite journal}}: Invalid |ref=harv (help)
Kärkkäinen, Juha; Sanders, Peter (2003). "Simple linear work suffix array construction". Proceedings of the 30th international conference on Automata, languages and programming: 943–955. Retrieved 2012-08-28. {{cite journal}}: Invalid |ref=harv (help)
Fischer, Johannes (2011). "Inducing the LCP-Array". arXiv:1101.3448v1 [cs.DS].
Manzini, Giovanni (2004). "Two space saving tricks for linear time LCP computation". Proc. SWAT. Volume 3111 of Lecture Notes in Computer Science. Lecture Notes in Computer Science. 3111: 372–383. doi:10.1007/978-3-540-27810-8_32. ISBN 978-3-540-22339-9. {{cite journal}}: Invalid |ref=harv (help)
Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi: 10.1007/978-3-642-02441-2_17, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with |doi= 10.1007/978-3-642-02441-2_17 instead.
Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi: 10.1007/978-3-540-92182-0_14, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with |doi= 10.1007/978-3-540-92182-0_14 instead.
Gog, Simon; Ohlebusch, Enno (2011). "Fast and Lightweight LCP-Array Construction Algorithms" (PDF). Proceedings of the Workshop on Algorithm Engineering and Experiments, ALENEX 2011: 25–34. Retrieved 2012-08-28. {{cite journal}}: Invalid |ref=harv (help)
Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi: 10.1109/DCC.2009.42, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with |doi= 10.1109/DCC.2009.42 instead.

[Kasai2001-1] Kasai et al. (2001)

[2] Abouelhoda & Ohlebusch (2004)

[Myers1993-3] Manber & Myers (1993)

[4] Ohlebusch, Fischer & Gog (2010)

[5] Ohlebusch & Gog (2011)

[1]

[2]

[3]

[4]

[5]

@@ Line 105: / Line 105: @@
 ==References==
+*{{cite journal|ref=harv
-*{{Citation
  | title = Replacing suffix trees with enhanced suffix arrays
  | doi = 10.1016/S1570-8667(03)00065-0
@@ Line 119: / Line 119: @@
  | last3 = Ohlebusch
  | first3 = Enno	}}
+*{{cite journal|ref=harv
-*{{Citation
  | title = Suffix Arrays: A New Method for On-Line String Searches
  | doi = 10.1137/0222058
@@ Line 131: / Line 131: @@
  | last2 =  Myers	 | first2 =  G.
  | accessdate = 2012-08-28	}}
+*{{cite journal|ref=harv
-*{{Citation
  | title = Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications
  | url = http://dl.acm.org/citation.cfm?id=647820.736222}
@@ Line 144: / Line 144: @@
  | last5 =  Park	 | first5 =  Kunsoo
  | accessdate = 2012-08-28	}}
+*{{cite journal|ref=harv
-*{{Citation
  | title = CST++
  | year = 2010
@@ Line 156: / Line 156: @@
  | series = Lecture Notes in Computer Science
  | isbn = 978-3-642-16320-3	}}
+*{{cite journal|ref=harv
-*{{Citation
  | title =  Simple linear work suffix array construction
  | url = http://dl.acm.org/citation.cfm?id=1759210.1759301}
@@ Line 165: / Line 165: @@
  | last2 =  Sanders	 | first2 =  Peter
  | accessdate = 2012-08-28	}}
-* {{cite arXiv | eprint=1101.3448v1 | author1=Johannes Fischer | title=Inducing the LCP-Array | class=cs.DS | year=2011 }}
+* {{cite arXiv | eprint=1101.3448v1 | author=Fischer, Johannes | title=Inducing the LCP-Array | class=cs.DS | year=2011 }}
+*{{cite journal|ref=harv
-*{{Citation
  | title = Two space saving tricks for linear time LCP computation
  | year = 2004
@@ Line 179: / Line 179: @@
 *{{cite doi | 10.1007/978-3-642-02441-2_17}}
 *{{cite doi | 10.1007/978-3-540-92182-0_14}}
+*{{cite journal|ref=harv
-*{{Citation
  | title = Fast and Lightweight LCP-Array Construction Algorithms
  | url = http://www.siam.org/proceedings/alenex/2011/alx11_03_gogs.pdf