Talk:Suffix tree

From Wikipedia, the free encyclopedia
Jump to: navigation, search
WikiProject Computer science (Rated C-class, High-importance)
WikiProject icon This article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
C-Class article C  This article has been rated as C-Class on the project's quality scale.
 High  This article has been rated as High-importance on the project's importance scale.
 


[Untited][edit]

Comment in the article history accompanying an edit from [[patricia tree]] to [[radix tree]]

(Terminology: "patricia trie" redirects to "radix tree". Therefore I consider the last term to be the preferred one.)

The edit might be appropriate, but the stated logic is dubious. The most appropriate textual display term should be chosen before interlinking is considered; redirects do not reliably indicate normative usage, rarely distinguish precise usage, and frequently derive from incumbency effects.

This edit should be reviewed for semantic validity. MaxEnt 00:41, 2 June 2006 (UTC)

Important History[edit]

This is considered to be an important historical development in the field of algorithms, so I felt it was important to give a fuller account of the historical developments and major contributions. People coming to the article might know of this construction in association with any of the principle innovations, so mentioning each of them helps to contextualize the developments in terms of their prior reference point. MaxEnt 01:20, 2 June 2006 (UTC)

Should be merged with Suffix trie?[edit]

Is Suffix trie about the same thing as this? Maybe the title of the other article is misspelt?152.66.129.162 10:48, 16 April 2007 (UTC)

In a suffix tree, all internal nodes are branching. In a suffix trie, they are not, but all edges have length 1. A suffix trie has nodes, while a suffix tree has . Klem fra Nils Grimsmo 05:52, 19 April 2007 (UTC)
Do you have a reference for using "suffix trie" to mean this separate concept, and for the use of uncompressed tries as a useful data structure? The data structure you describe is useful only pedagogically as a step towards describing suffix trees, as far as I know. I've redirected that article to this one, because I don't see the value of a separate article on a useless data structure. —David Eppstein (talk) 18:12, 3 April 2008 (UTC)

those dashed(fixme) lines are weird a little[edit]

at first i though it is not a tree at all :P then i realized they just have to help you. but still i do not see how. those dashes are good. —Preceding unsigned comment added by 84.16.123.194 (talk) 03:21, 24 February 2008 (UTC)

This phrase leaves the reader somewhat hanging:[edit]

Suffix trees also provided one of the first linear-time solutions for the longest common substring problem.

This suggests that there are other ways of finding LCS's, maybe discovered even earlier. But the LCS article only lists suffix tries and a matrix oriented solution, which is quadratic. Shinobu (talk) 20:18, 5 October 2008 (UTC)

This is outright incorrect phrase. According to Gusfield, the first linear-time algorithm was proposed by Weiner!! Then this algorithm was improved by MacCreight and Ukkonen.User:itman —Preceding undated comment added 02:42, 22 August 2010 (UTC).

Applications[edit]

Variants of the LZW compression schemes use it (LZSS).

LZSS is not variant of LZW (LZ78), but LZ77. 83.5.92.106 (talk) 23:53, 8 March 2009 (UTC)

format for references with pages[edit]

This article has a number of references of the form "([6] page 90)" where there are multiple references to different pages of the same book. Is this really the way we want it to look? RJFJR (talk) 15:12, 8 August 2012 (UTC)

I don't think it is. I changed them to what I think is a better format. —David Eppstein (talk) 02:54, 9 August 2012 (UTC)

Image[edit]

The image at the top of the article contains internal links as dotted lines. But those links aren't part of the suffix tree, they are part of the method of constructing the tree (and they are used in Ukkonen's linear time construction method while Weiner's method of linear time construction uses a different data structure.) When I first started on the article I couldn't make sense of them even though I spent some time on it (I had to get a book listed in the references to find out what they were for.) Since they aren't part of the tree (once it is constructed they can be discarded even if you are using Ukkonen's algorithm) I think we should get a diagram that doesn't contain the internal links and we can save the current image for discussing Ukkonen's algorithm. RJFJR (talk) 01:12, 26 September 2012 (UTC)

Usage is very unclear[edit]

This article does not explain how to use suffix trees, only that they are of a great benefit. For example - if searching a string of size n for a substring of size m can be done with O(m), a simple algorithm should be given. Otherwise it is left as an exercise for the reader, defeating the purpose of the article. Thanks! --Yurik (talk) 10:07, 6 October 2012 (UTC)

Sounds reasonable to me. Here's a draft of how to search for a target string using a suffix tree:
To search for a particular target string using a suffix tree begin at the root of the tree and follow the path that matches the target. If at any point it is impossible to progress for the target then the target does not exist anywhere in the string represented by the suffix tree and you can stop. If the entire target is found in the tree beginning at the root then it indicates it the target exists in at least one place. However the location information is stored only in the leaves of the tree. From the end of the target string the program must now follow all the children, noting the distance from the end of the target to the leaf. The number of characters from the end of the target to a leaf indicates how far from the end of the string being searched the target ended and the number of leaves is the number of occurences of the target.
RJFJR (talk) 16:05, 24 November 2012 (UTC)
Looks very good! We might also want to have some sort of a pseudo-algorithm in addition to the textual description, or it might be overkill. Thanks!!! --Yurik (talk) 08:38, 30 November 2012 (UTC)
I added a paragraph giving an 'example' of using the chart at the top of the article. How's this?
To search for a particular target string using a suffix tree begin at the root of the tree and follow the path that matches the target. If at any point it is impossible to progress for the target then the target does not exist anywhere in the string represented by the suffix tree and you can stop. If the entire target is found in the tree beginning at the root then it indicates it the target exists in at least one place. However the location information is stored only in the leaves of the tree. From the end of the target string the program must now follow all the children, noting the distance from the end of the target to the leaf. The number of characters from the end of the target to a leaf indicates how far from the end of the string being searched the target ended and the number of leaves is the number of occurences of the target.
For example consider the suffix tree illustrated at the top of the article for the seven character string 'BANANA$'. To search for the string 'ANA' begin at the root. Follow the A link and then the NA link. The entire string being searched for has been followed indicating at least one occurrence in the string being searched. On the other hand, searching for the string ANNA would follow the A link and then not find a link labelled NN indicating the string being search for does not exist. Thus the maximum processing time to determine presence of a string is proportional to the length of the string being searched for, independent of the length of the text being search (after the suffix tree has been built.) Continuing the example of searching for ANA we can recursively follow all the links below the node reached by ANA noting the length of the labels on the links used to reach each leaf. The left branch is labelled with 1 character, '$'. There for one occurrence of ANA occurs at 7-(3+1) or beginning at the fourth character of the text. Following the other link there is a three character label to the leaf so another instance of ANA begins at 7-(3+3) or beginning at the 2nd character.
We may need to clarify that in the calculation of where the two occurrences start the 7 is the length of the text and the 3 is the length of ANA. Let me know how it oooks over all and then we can add it to the article if it looks good. RJFJR (talk) 00:03, 3 March 2013 (UTC)

I believe that some sort of clear and understandable description of how the search algorithm works is a good idea, but the draft above seems a bit too close to WP:NOTHOWTO to me — it might be improved by a little more editing, aimed at focusing less on leading the reader through the steps of the search procedure and more on just describing them. —David Eppstein (talk) 02:38, 3 March 2013 (UTC)

Agree, it is a very good description, but needs to be slightly less wordy. Would a pseudo-code help? Also, why not edit directly in the article, this is not a critical page, its ok for it to be slightly less than perfect while it is getting into a better shape. --Yurik (talk) 13:05, 3 March 2013 (UTC)
I don't think it is really howto, it's intended as an example. Does anyone have an example of pseudocode for some tree algorithm? I can't figure out how to write a reasonable pseudocode to explain this (I keep getting stuck trying to think of a good notation for what I want to describe.) RJFJR (talk) 13:33, 9 March 2013 (UTC)


How's this:

Basic algorithms

Three basic questions about the content of the text the suffix tree that can be answered easily are 1) is this target string anywhere in the text 2) how many times does it occur 3)where does it occur. The question of does it occur can be answered by walking the tree from the root. If at any point the next character of the target string can not be found then the target does not occur; if the entire string can be found from the root of the tree then the target occurs at least one in the text. To determine how many times the target occurs do a depth first search of the tree beginning at tree node the target ends at and count the number of leaves encountered. To determine where the target occurs in the text do a depth first search of the tree keeping track of the number of characters in the tree to a leaf, this is the distance from the end of the text.

RJFJR (talk) 15:17, 9 November 2013 (UTC)

Your methods for the second two problems are wrong (because the whole point of a suffix tree over other methods like KMP is to make queries take time proportional to the query string, and by doing a depth first search you are taking too much time). The answers should be precomputed and stored in the nodes so you can just look them up. Also, your "to determine where" description is vague to the point where I can't tell what you intend by it and can't figure out how someone would go about implementing it. —David Eppstein (talk) 16:31, 9 November 2013 (UTC)
I like your idea of adding a leaf count to internal nodes if 'number of occurrences' is a common action, but it isn't part of the definition of a suffix tree. The previous draft had a lot more detail but was considered too long, to detailed and a bit 'how-to'. RJFJR (talk) 15:47, 10 November 2013 (UTC)


Absolute value of string lengths[edit]

A question about the following paragraph:

Assume that a suffix tree has been built for the string of length , or that a generalised suffix tree has been built for the set of strings of total length .

Why are we taking the absolute value of etc? It seems like the length of a string must be non-negative. Scientific29 (talk) 04:57, 2 April 2014 (UTC)

I think that indicates 'length of' rather than 'absolute value'. RJFJR (talk) 20:21, 12 June 2014 (UTC)

It should be changed to abs(S1), or just n1 Jmichael ll (talk) 05:40, 2 July 2017 (UTC)

Definitely not — that would be a type mismatch. Absolute values are defined for numbers, not strings. would be ok, because the vertical bars are a standard notation for string length. But, because the individual lengths are used later, and sort-of-defined in this sentence, I think would be best. —David Eppstein (talk) 06:35, 2 July 2017 (UTC)