Primary clustering

From Wikipedia, the free encyclopedia
Jump to: navigation, search

In computer programming, primary clustering is one of two major failure modes of open addressing based hash tables, especially those using linear probing. It occurs after a hash collision causes two of the records in the hash table to hash to the same position, and causes one of the records to be moved to the next location in its probe sequence. Once this happens, the cluster formed by this pair of records is more likely to grow by the addition of even more colliding records, regardless of whether the new records hash to the same location as the first two. This phenomenon causes searches for keys within the cluster to be longer.[1]

For instance, in linear probing, a record involved in a collision is always moved to the next available hash table cell subsequent to the position given by its hash function, creating a contiguous cluster of occupied hash table cells. Whenever another record is hashed to anywhere within the cluster, it grows in size by one cell. Because of this phenomenon, it is likely that a linear-probing hash table with a constant load factor (that is, with the size of the table proportional to the number of items it stores) will have some clusters of logarithmic length, and will take logarithmic time to search for the keys within that cluster.[2]

A related phenomenon, secondary clustering, occurs more generally with open addressing modes including linear probing and quadratic probing in which the probe sequence is independent of the key, as well as in hash chaining. In this phenomenon, a low-quality hash function may cause many keys to hash to the same location, after which they all follow the same probe sequence or are placed in the same hash chain as each other, causing them to have slow access times.[1]

Both types of clustering may be reduced by using a higher-quality hash function, or by using a hashing method such as double hashing that is less susceptible to clustering.[1]

References[edit]

  1. ^ a b c Smith, Peter (2004), Applied Data Structures with C++, Jones & Bartlett Learning, pp. 186–188, ISBN 9780763725624 .
  2. ^ Pittel, B. (1987), "Linear probing: the probable largest search time grows logarithmically with the number of records", Journal of Algorithms, 8 (2): 236–249, doi:10.1016/0196-6774(87)90040-X, MR 890874 .