# Count–min sketch

The Count–min sketch (or CM sketch) is a probabilistic sub-linear space streaming algorithm which can be used to summarize a data stream in many different ways. The algorithm was invented in 2003 by Graham Cormode and S. Muthu Muthukrishnan.[1]

Count–min sketches are somewhat similar to Bloom filters; the main distinction is that Bloom filters represent sets, while CM sketches represent multisets and frequency tables. Spectral Bloom filters with multi-set policy, are conceptually isomorphic to the Count-Min Sketch.[2]

## Algorithm

### Setup

The data structure is parameterized by the constants $w$ and $d$ which determine the time and space needs and the probability of error of the queries. The algorithm needs a two dimensional array, called here count, with $w$ columns and $d$ rows. A series of $d$ hash functions must be randomly drawn from a pairwise independent hash function family, each associated with a row in the array.

For later convenience we assign $w = \lceil e/\epsilon \rceil$ and $d = \lceil \ln{1/\delta} \rceil$, where the error in answering a query is within a factor of $\epsilon$ with probability $\delta$.

### Update

When a new value $a$ arrives we update as follows: $\forall j : 1 \leq j \leq d$, $count[j,h_j(a)] \leftarrow count[j,h_j(a)] + 1$. That is, for each row we take the corresponding hash function, apply it to the newly received value and add one to the column corresponding to the hash value.

### Query

The array can then be used to estimate any of several different statistics at any point. If we want to estimate, for instance, the number of times $a_i$ for a specific value $i$ appeared so far in the stream we would compute $\hat a_i=\min_j count[j,h_j(i)]$ (this assumes all added values are positive). This estimate has the guarantee that $\hat a_i \leq a_i + \epsilon |a|$ with probability $1-\delta$.

Small modifications to the data structure can be used to sketch other different stream statistics.