The Count–min sketch (or CM sketch) is a probabilistic sub-linear space streaming algorithm which can be used to summarize a data stream in many different ways. The algorithm was invented in 2003 by Graham Cormode and S. Muthu Muthukrishnan.
The data structure is parameterized by the constants and which determine the time and space needs and the probability of error of the queries. The algorithm needs a two dimensional array, called here count, with columns and rows. A series of hash functions must be randomly drawn from a pairwise independent hash function family, each associated with a row in the array.
For later convenience we assign and , where the error in answering a query is within a factor of with probability .
When a new value arrives we update as follows: , . That is, for each row we take the corresponding hash function, apply it to the newly received value and add one to the column corresponding to the hash value.
The array can then be used to estimate any of several different statistics at any point. If we want to estimate, for instance, the number of times for a specific value appeared so far in the stream we would compute (this assumes all added values are positive). This estimate has the guarantee that with probability .
Small modifications to the data structure can be used to sketch other different stream statistics.
- Cormode, Graham; S. Muthukrishnan (2004). "An Improved Data Stream Summary: The Count-Min Sketch and its Applications". J. Algorithms 55: 29–38.