Jump to content

LCP array: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Line 105: Line 105:


==References==
==References==
*{{cite journal|ref=harv
*{{Citation
| title = Replacing suffix trees with enhanced suffix arrays
| title = Replacing suffix trees with enhanced suffix arrays
| doi = 10.1016/S1570-8667(03)00065-0
| doi = 10.1016/S1570-8667(03)00065-0
Line 119: Line 119:
| last3 = Ohlebusch
| last3 = Ohlebusch
| first3 = Enno }}
| first3 = Enno }}
*{{cite journal|ref=harv
*{{Citation
| title = Suffix Arrays: A New Method for On-Line String Searches
| title = Suffix Arrays: A New Method for On-Line String Searches
| doi = 10.1137/0222058
| doi = 10.1137/0222058
Line 131: Line 131:
| last2 = Myers | first2 = G.
| last2 = Myers | first2 = G.
| accessdate = 2012-08-28 }}
| accessdate = 2012-08-28 }}
*{{cite journal|ref=harv
*{{Citation
| title = Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications
| title = Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications
| url = http://dl.acm.org/citation.cfm?id=647820.736222}
| url = http://dl.acm.org/citation.cfm?id=647820.736222}
Line 144: Line 144:
| last5 = Park | first5 = Kunsoo
| last5 = Park | first5 = Kunsoo
| accessdate = 2012-08-28 }}
| accessdate = 2012-08-28 }}
*{{cite journal|ref=harv
*{{Citation
| title = CST++
| title = CST++
| year = 2010
| year = 2010
Line 156: Line 156:
| series = Lecture Notes in Computer Science
| series = Lecture Notes in Computer Science
| isbn = 978-3-642-16320-3 }}
| isbn = 978-3-642-16320-3 }}
*{{cite journal|ref=harv
*{{Citation
| title = Simple linear work suffix array construction
| title = Simple linear work suffix array construction
| url = http://dl.acm.org/citation.cfm?id=1759210.1759301}
| url = http://dl.acm.org/citation.cfm?id=1759210.1759301}
Line 165: Line 165:
| last2 = Sanders | first2 = Peter
| last2 = Sanders | first2 = Peter
| accessdate = 2012-08-28 }}
| accessdate = 2012-08-28 }}
* {{cite arXiv | eprint=1101.3448v1 | author1=Johannes Fischer | title=Inducing the LCP-Array | class=cs.DS | year=2011 }}
* {{cite arXiv | eprint=1101.3448v1 | author=Fischer, Johannes | title=Inducing the LCP-Array | class=cs.DS | year=2011 }}
*{{cite journal|ref=harv
*{{Citation
| title = Two space saving tricks for linear time LCP computation
| title = Two space saving tricks for linear time LCP computation
| year = 2004
| year = 2004
Line 179: Line 179:
*{{cite doi | 10.1007/978-3-642-02441-2_17}}
*{{cite doi | 10.1007/978-3-642-02441-2_17}}
*{{cite doi | 10.1007/978-3-540-92182-0_14}}
*{{cite doi | 10.1007/978-3-540-92182-0_14}}
*{{cite journal|ref=harv
*{{Citation
| title = Fast and Lightweight LCP-Array Construction Algorithms
| title = Fast and Lightweight LCP-Array Construction Algorithms
| url = http://www.siam.org/proceedings/alenex/2011/alx11_03_gogs.pdf
| url = http://www.siam.org/proceedings/alenex/2011/alx11_03_gogs.pdf

Revision as of 14:09, 29 August 2012

In computer science, the longest common prefix array (LCP array) is an auxiliary data structure to the suffix array, which stores the length of the longest common prefix between every pair of consecutive suffixes in the suffix array.

Augmenting the suffix array with the LCP array allows to efficiently simulate top-down and bottom-up traversals of the suffix tree[1][2], speeds up pattern matching on the suffix array[3] and is a prerequisite for compressed suffix trees[4].

History

The LCP array was introduced by Udi Manber and Gene Myers alongside the suffix array in order to improve the running time of their string search algorithm.[3]

Definition

Let SA be the suffix array of the string and let denote the length of the longest common prefix between two strings and . Let further denote the substring of ranging from to .

Then the LCP array is an integer array of size such that is undefined and for every . Thus stores the length of longest common prefix of the lexicographically 'st and th smallest suffix of .

Example

Consider the string :

i 1 2 3 4 5 6 7
S[i] b a n a n a $

and its corresponding suffix array  :

i 1 2 3 4 5 6 7
A[i] 7 6 4 2 1 5 3

Then the LCP array is constructed by comparing lexicographically consecutive suffixes to determine their longest common prefix:

i 1 2 3 4 5 6 7
H[i] 0 1 3 0 0 2

So, for example, is the length of the longest common prefix shared by the suffixes and . Note that , since there is not lexicographically smaller suffix.

Efficient Construction Algorithms

LCP array construction algorithms can be divided into two different categories: algorithms that compute the LCP array as a byproduct to the suffix array and algorithms that use an already constructed suffix array in order to compute the LCP values.

Manber & Myers (1993) provide an algorithm to compute the LCP array alongside the suffix array in time. Kärkkäinen & Sanders (2003) show that it is also possible to modify their time algorithm such that it computes the LCP array as well. Kasai et al. (2001) presents the first time algorithm (FLAAP) that computes the LCP array given the text and the suffix array.

Assuming that each text symbol takes one byte and each entry of the suffix or LCP array takes 4 bytes, the major drawback of their algorithm is a large space occupancy of bytes, while the original output (text, suffix array, LCP array) only occupies bytes. Therefore Manzini (2004) created a refined version of the algorithm of Kasai et al. (2001) (lcp9) and reduced the space occupancy to bytes. Kärkkäinen, Manzini & Puglisi (2009) provide another refinement of Karsai's algorithm (-algorithm) that improves the running time. Rather than the actual LCP array, this algorithm builds the permuted LCP (PLCP) array, in which the values appear in text order rather than lexicographical order.

Ohlebusch & Gog (2011) provide two algorithms that although being theoretically slow () were faster than the above mentioned algorithms in practice.[5]

As of 2012, the currently fastest linear-time LCP array construction algorithm is due to Fischer (2011), which in turn is based on one of the fastest suffix array construction algorithms by Nong, Zhang & Chang (2009).

Applications

Suffix Tree Construction

Given the suffix array and the LCP array of a string of length , its suffix tree can be constructed in time based on the following idea: Start with the partial suffix tree for the lexicographically smallest suffix and repeatedly insert the other suffixes in the order given by the suffix array.

Let be the partial suffix tree for . Further let be the length of the concatenation of all path labels from the root of to note .

Start with , the tree consisting only of the root. To insert into , walk up the rightmost path beginning at the recently inserted leaf to the root, until the deepest node with is reached.

We need to distinguish two cases:

  • : This means that the concatenation of the labels on the root-to- path equals the longest common prefix of suffixes and .
    In this case, insert as a new leaf of node and label the edge with . Thus the edge label consists of the remaining characters of suffix that are not already represented by the concatenation of the labels of the root-to- path.
    This creates the partial suffix tree .
  • : This means that the concatenation of the labels on the root-to- path displays less characters than the longest common prefix of suffixes and and the missing characters are contained in the edge label of 's rightmost edge. Therefore we have to split up that edge as follows:
    Let be the child of on 's rightmost path.
  1. Delete the edge .
  2. Add a new internal node and a new edge with label . The new label consists of the missing characters of the longest common prefix of and . Thus, the concatenation of the labels of the root-to- path now displays the longest common prefix of and .
  3. Connect to the newly created internal node by an edge that is labeled . The new label consists of the remaining characters of the deleted edge that were not used as the label of edge .
  4. Add as a new leaf and connect it to the new internal node by an edge that is labeled . Thus the edge label consists of the remaining characters of suffix A[i+1] that are not already represented by the concatenation of the labels of the root-to- path.
  5. This creates the partial suffix tree .

A simple amortization argument shows that the running time of this algorithm is bounded by :

The nodes that are traversed in step by walking up the rightmost path of (apart from the last node ) are removed from the rightmost path, when is added to the tree as a new leaf and these nodes will never be traversed again for all subsequent steps . Therefore, at most nodes will be traversed in total.

TODO: add picture that visualizes the above mentioned steps

Simulating top-down traversal

Simulating top-down traversal

Notes

  1. ^ Kasai et al. (2001)
  2. ^ Abouelhoda & Ohlebusch (2004)
  3. ^ a b Manber & Myers (1993)
  4. ^ Ohlebusch, Fischer & Gog (2010)
  5. ^ Ohlebusch & Gog (2011)

References

  • Abouelhoda, Mohamed Ibrahim; Ohlebusch, Enno; Ohlebusch, Enno (2004). "Replacing suffix trees with enhanced suffix arrays". J. of Discrete Algorithms. 2 (1): 53–86. doi:10.1016/S1570-8667(03)00065-0. Retrieved 2012-08-28. {{cite journal}}: Invalid |ref=harv (help)
  • Manber, U.; Myers, G. (1993). "Suffix Arrays: A New Method for On-Line String Searches". SIAM Journal on Computing. 22 (5): 935–948. doi:10.1137/0222058. Retrieved 2012-08-28. {{cite journal}}: Invalid |ref=harv (help)
  • Kasai, Toru; Lee, Gunho; Arimura, Hiroki; Arikawa, Setsuo; Park, Kunsoo (2001). "Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications". Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching: 181–192. ISBN 3-540-42271-4. Retrieved 2012-08-28. {{cite journal}}: Invalid |ref=harv (help)
  • Ohlebusch, Enno; Fischer, Johannes; Gog, Simon (2010). "CST++". String Processing and Information Retrieval - 17th International Symposium, SPIRE 2010, Los Cabos, Mexico, October 11-13, 2010. Proceedings. Lecture Notes in Computer Science. 6393: 322–333. doi:10.1007/978-3-642-16321-0_34. ISBN 978-3-642-16320-3. {{cite journal}}: Invalid |ref=harv (help)
  • Kärkkäinen, Juha; Sanders, Peter (2003). "Simple linear work suffix array construction". Proceedings of the 30th international conference on Automata, languages and programming: 943–955. Retrieved 2012-08-28. {{cite journal}}: Invalid |ref=harv (help)
  • Fischer, Johannes (2011). "Inducing the LCP-Array". arXiv:1101.3448v1 [cs.DS].
  • Manzini, Giovanni (2004). "Two space saving tricks for linear time LCP computation". Proc. SWAT. Volume 3111 of Lecture Notes in Computer Science. Lecture Notes in Computer Science. 3111: 372–383. doi:10.1007/978-3-540-27810-8_32. ISBN 978-3-540-22339-9. {{cite journal}}: Invalid |ref=harv (help)
  • Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi: 10.1007/978-3-642-02441-2_17, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with |doi= 10.1007/978-3-642-02441-2_17 instead.
  • Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi: 10.1007/978-3-540-92182-0_14, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with |doi= 10.1007/978-3-540-92182-0_14 instead.
  • Gog, Simon; Ohlebusch, Enno (2011). "Fast and Lightweight LCP-Array Construction Algorithms" (PDF). Proceedings of the Workshop on Algorithm Engineering and Experiments, ALENEX 2011: 25–34. Retrieved 2012-08-28. {{cite journal}}: Invalid |ref=harv (help)
  • Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi: 10.1109/DCC.2009.42, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with |doi= 10.1109/DCC.2009.42 instead.