Jump to content

Linear hashing

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by BattyBot (talk | contribs) at 04:04, 24 October 2023 (Split control: Fixed CS1 errors: extra text: edition and general fixes). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Linear hashing (LH) is a dynamic data structure which implements a hash table and grows or shrinks one bucket at a time. It was invented by Witold Litwin in 1980.[1] [2] It has been analyzed by Baeza-Yates and Soza-Pollman.[3] It is the first in a number of schemes known as dynamic hashing [3] [4] such as Larson's Linear Hashing with Partial Extensions, [5] Linear Hashing with Priority Splitting, [6] Linear Hashing with Partial Expansions and Priority Splitting, [7] or Recursive Linear Hashing. [8]

The file structure of a dynamic hashing data structure adapts itself to changes in the size of the file, so expensive periodic file reorganization is avoided.[4] A Linear Hashing file expands by splitting a pre-determined bucket into two and contracts by merging two predetermined buckets into one. The trigger for a reconstruction depends on the flavor of the scheme; it could be an overflow at a bucket or load factor (number of records divided by the number of buckets) moving outside of a predetermined range.[1] In Linear Hashing there are two types of buckets, those that are to be split and those already split. While extendible hashing splits only overflowing buckets, spiral hashing (a.k.a. spiral storage) distributes records unevenly over the buckets such that buckets with high costs of insertion, deletion, or retrieval are earliest in line for a split.[5]

Linear Hashing has also been made into a scalable distributed data structure, LH*. In LH*, each bucket resides at a different server. [9] LH* itself has been expanded to provide data availability in the presence of failed buckets. [10] Key based operations (inserts, deletes, updates, reads) in LH and LH* take maximum constant time independent of the number of buckets and hence of records. [1][10]

Algorithm details

Records in LH or LH* consists of a key and a content, the latter basically all the other attributes of the record.[1][10] They are stored in buckets. For example, in Ellis' implementation, a bucket is a linked list of records.[2] The file allows the key based CRUD operations create or insert, read, update, and delete as well as a scan operations that scans all records, for example to do a database select operation on a non-key attribute.[10] Records are stored in buckets whose numbering starts with 0.[10]

Hash functions

In order to access a record with key , a family of hash functions, called collectively a dynamic hash function is applied to the key . At any time, at most two hash functions and are used. A typical example uses the division modulo x operation. If the original number of buckets is , then the family of hash functions is [10]

File expansion

As the file grows through insertions, it expands gracefully through the splitting of one bucket into two buckets. The sequence of buckets to split is predetermined. This is the fundamental difference to schemes like Fagin's extendible hashing. [11] For the two new buckets, the hash function is replaced with . The number of the bucket to be split is part of the file state and called the split pointer .[10]

Split control

An uncontrolled split occurs when a split is performed whenever a bucket overflows.

Controlled splitting occurs if a split is performed whenever the load factor, which is monitored by the file, exceeds a predetermined threshold.[10] If the hash index uses controlled splitting, the buckets are allowed to overflow by using linked overflow blocks. When the records/buckets ratio surpasses a set threshold, the designated bucket is split. This threshold can also be expressed as an occupancy percentage, in which case, the maximum number of records in the hash index equals (occupancy percentage)*(max records per non-overflowed bucket)*(number of buckets).[12]

Addressing

Addressing is based on the file state, consisting of the split pointer and the level . If the level is , then the hash functions used are and .

The LH algorithm for hashing key is[10]

if

Splitting

When a bucket is split, split pointer and possibly the level are updated according to

.[10]

For example, if numerical records are inserted into the hash index according to their farthest right binary digits, the buckets are split according to which one they correspond with. Thus, if we have the buckets labelled as 000, 001, 10, 11, 100, 101, we would split the bucket 10 because we are appending and creating the next sequential bucket 110.[12]

File contraction

If under controlled splitting the load factor sinks below a threshold, a merge operation is triggered. The merge operation undoes the last split, also resetting the file state. [10]

File state calculation

The file state consists of split pointer and level . If the original file started with buckets, then the number of buckets and the file state are related via [13]

LH*

The main contribution of LH* is to allow a client of an LH* file to find the bucket where the record resides even if the client does not know the file state. Clients in fact store their version of the file state, which is initially just the knowledge of the first bucket, namely Bucket 0. Based on their file state, a client calculates the address of a key and sends a request to that bucket. At the bucket, the request is checked and if the record is not at the bucket, it is forwarded. In a reasonably stable system, that is, if there is only one split or merge going on while the request is processed, it can be shown that there are at most two forwards. After a forward, the final bucket sends an Image Adjustment Message to the client whose state is now closer to the state of the distributed file.[10] While forwards are reasonably rare for active clients, their number can be even further reduced by additional information exchange between servers and clients [13]

Adoption in language systems

Griswold and Townsend [14] discussed the adoption of linear hashing in the Icon language. They discussed the implementation alternatives of dynamic array algorithm used in linear hashing, and presented performance comparisons using a list of Icon benchmark applications.

Adoption in database systems

Linear hashing is used in the Berkeley database system (BDB), which in turn is used by many software systems, using a C implementation derived from the CACM article and first published on the Usenet in 1988 by Esmond Pitt.

References

  1. ^ a b c d Litwin, Witold (1980), "Linear hashing: A new tool for file and table addressing" (PDF), Proc. 6th Conference on Very Large Databases: 212–223
  2. ^ a b Ellis, Carla Schlatter (June 1987), "Concurrency in Linear Hashing", ACM Transactions on Database Systems, 12 (2): 195–217, doi:10.1145/22952.22954, S2CID 14260177
  3. ^ a b Baeza-Yates, Ricardo; Soza-Pollman, Hector (1998), "Analysis of Linear Hashing Revised" (PDF), Nordic Journal of Computing: 70–85, S2CID 7497598, archived from the original (PDF) on 2019-03-07
  4. ^ a b Enbody, Richard; Du, HC (June 1988), "Dynamic hashing schemes", ACM Computing Surveys, 20 (2): 85–113, doi:10.1145/46157.330532, S2CID 1437123
  5. ^ a b Larson, Per-Åke (April 1988), "Dynamic Hash Tables", Communications of the ACM, 31 (4): 446–457, doi:10.1145/42404.42410, S2CID 207548097
  6. ^ Ruchte, Willard; Tharp, Alan (Feb 1987), "Linear hashing with Priority Splitting: A method for improving the retrieval performance of linear hashing", IEEE Third International Conference on Data Engineering: 2–9
  7. ^ Manolopoulos, Yannis; Lorentzos, N. (1994), "Performance of linear hashing schemes for primary key retrieval", Information Systems, 19 (5): 433–446, doi:10.1016/0306-4379(94)90005-1
  8. ^ Ramamohanarao, K.; Sacks-Davis, R. (Sep 1984), "Recursive linear hashing", ACM Transactions on Databases, 9 (3): 369–391, doi:10.1145/1270.1285, S2CID 18577730
  9. ^ Litwin, Witold; Neimat, Marie-Anne; Schneider, Donavan A. (1993), "LH: Linear Hashing for distributed files", ACM SIGMOD Record, 22 (2): 327–336, doi:10.1145/170036.170084, S2CID 259938726
  10. ^ a b c d e f g h i j k l Litwin, Witold; Moussa, Rim; Schwarz, Thomas (Sep 2005), "LH*RS - a highly-available scalable distributed data structure", ACM Transactions on Database Systems, 30 (3): 769–811, doi:10.1145/1093382.1093386, S2CID 1802386
  11. ^ Fagin, Ronald; Nievergelt, Jurg; Pippenger, Nicholas; Strong, Raymond (Sep 1979), "Extendible Hashing - A Fast Access Method for Dynamic Files", ACM Transactions on Database Systems, 4 (2): 315–344, doi:10.1145/320083.320092, S2CID 2723596
  12. ^ a b Silberschatz, Abraham; Korth, Henry F.; Sudarshan, S. (2020). Database system concepts (Seventh ed.). New York, NY: McGraw-Hill Education. ISBN 978-1-260-08450-4.
  13. ^ a b Chabkinian, Juan; Schwarz, Thomas (2016), "Fast LH*", International Journal of Parallel Programming, 44 (4): 709–734, doi:10.1007/s10766-015-0371-8, S2CID 7448240
  14. ^ Griswold, William G.; Townsend, Gregg M. (April 1993), "The Design and Implementation of Dynamic Hashing for Sets and Tables in Icon", Software: Practice and Experience, 23 (4): 351–367, doi:10.1002/spe.4380230402, S2CID 11595927

See also