Count–min sketch

From Wikipedia, the free encyclopedia
  (Redirected from Count-Min sketch)
Jump to: navigation, search

The Count–min sketch (or CM sketch) is a probabilistic sub-linear space streaming algorithm which can be used to summarize a data stream in many different ways. The algorithm was invented in 2003 by Graham Cormode and S. Muthu Muthukrishnan.[1]

Count–min sketches are somewhat similar to Bloom filters; the main distinction is that Bloom filters represent sets, while CM sketches represent multisets and frequency tables.

Algorithm[edit]

Setup[edit]

The data structure is parameterized by the constants w and d which determine the time and space needs and the probability of error of the queries. The algorithm needs a two dimensional array, called here count, with w columns and d rows. A series of d hash functions must be randomly drawn from a pairwise independent hash function family, each associated with a row in the array.

For later convenience we assign w = \lceil e/\epsilon \rceil and  d = \lceil \ln{1/\delta} \rceil, where the error in answering a query is within a factor of \epsilon with probability \delta.

Update[edit]

When a new value a arrives we update as follows: \forall j : 1 \leq j \leq d,  count[j,h_j(a)] \leftarrow count[j,h_j(a)] + 1 . That is, for each row we take the corresponding hash function, apply it to the newly received value and add one to the column corresponding to the hash value.

Query[edit]

The array can then be used to estimate any of several different statistics at any point. If we want to estimate, for instance, the number of times  a_i for a specific value i appeared so far in the stream we would compute \hat a_i=\min_j count[j,h_j(i)] (this assumes all added values are positive). This estimate has the guarantee that \hat a_i \leq a_i + \epsilon |a| with probability 1-\delta.

Small modifications to the data structure can be used to sketch other different stream statistics.

External links[edit]

See also[edit]

References[edit]

  1. ^ Cormode, Graham; S. Muthukrishnan (2004). "An Improved Data Stream Summary: The Count-Min Sketch and its Applications". J. Algorithms 55: 29–38.