Talk:Ternary search tree
|WikiProject Computer science|
|This article is the subject of an educational assignment at College Of Engineering Pune supported by Wikipedia Ambassadors through the India Education Program during the 2011 Q3 term. Further details are available on the course page.|
I would like to dispute the claim that a ternary search tree is faster than hashing "for many typical search problems". I believe this claim is based on the paper located at:
..which the Dr. Dobbs article is also based on. This paper compares to a very slow hashtable implementation. In particular, they use a hashtable size that is not prime (thus increasing the chance of collisions) and a load factor of 1.0, when most performant implementations use around 0.75 (see Java hashtable for instance). A higher load increases collisions and decreases performance. They also use chaining where fast hashtables (for example google's freely available code) includes part of the chain in the table itself, thus almost always eliminating an extra cache miss. There may be say 2-4 total cache misses for a typical hashtable lookup so this is a huge difference.
Finally, the test setup is biased towards tst performance. They do lookups one after another in a tight loop and in sorted order, meaning that most of the ~9 nodes the TST has to traverse for a successful lookup are in the cpu cache already. For unsuccessful lookups they change the first letter of a string, thus ensuring the best possible results from the TST (which always compares parts of the string in order). For an isolated lookup, as in the program does a bunch of code then does a few query/add operation, a hash table will have far fewer cache misses than a TST.
Given the above, perhaps original article creator or somebody with in-depth knowledge can elaborate in article about which search problems are faster.
I don't know how relevant this is (only found this article by googling 'digital trie'), but people looking to compare hashing to a ternary search tree may find this interesting: http://unthought.net/c++/c_vs_c++.html --MaXiMiUS 22:47, 17 August 2007 (UTC)
-- When storing a ternary tree on disk, I/O time has a large impact and your argument holds true.
I am familiar with the article by mr. Sedgewick and Bentley and believe the reason they claim that the ternary tree is "faster than hashing" is because
1. they assume the ternary tree will be stored in RAM and 2. the overhead of calculating a hash can be a fairly heavy (depending on the demands placed on the hash), whereas using bytes of the key itself as hash is trivial.
Apparently the performance lost by the overhead of a hash calculation outweighs the performance win of not having any collisions.
Also, because all read/write operations in a ternary tree start at its root and tend to follow the same paths, it is highly suitable for caching even if relatively slow storage is used.
It is also worth to keep in mind that a ternary tree allows traversing its contents in sorted order and performing partial-match key searches, which a hash table (typically) will not allow. Although this seems more an issue of functionality than one of performance, 'many typical search problems' does include partial matches, case insensitivity, and traversing the data set in sorted order. Using a hash table, such operations would be (much) less efficient, even with O(1) performance for full key matches. —Preceding unsigned comment added by 18.104.22.168 (talk) 14:25, 19 March 2010 (UTC)
"consists of a series of binary searches, one for each character in the string" - there can be more than one binary search for each character. To see how, look at the u in "us" in the example.