Reservoir sampling: Difference between revisions
→Statistical properties: Fix Tillé's accent |
Added random sort based algorithm, cleaned up pseudocode, added section for Algorithm-R |
||
Line 4: | Line 4: | ||
}} |
}} |
||
'''Reservoir sampling''' is a family of [[randomized algorithm]]s for randomly choosing a sample of ''k'' items from a list ''S'' containing ''n'' items, where ''n'' is either a very large or unknown number. Typically ''n'' is large enough that the list doesn't fit into [[main memory]]. |
'''Reservoir sampling''' is a family of [[randomized algorithm]]s for randomly choosing a [[Sampling (statistics)|sample]] of ''k'' items from a list ''S'' containing ''n'' items, where ''n'' is either a very large or unknown number. Typically ''n'' is large enough that the list doesn't fit into [[main memory]]. |
||
== Algorithm R == |
|||
This simple O(''n'') algorithm as described in the ''Dictionary of Algorithms and Data Structures''<ref>[http://xlinux.nist.gov/dads/HTML/reservoirSampling.html Dictionary of Algorithms and Data Structures]</ref> consists of the following steps (assuming that the arrays are one-based, and that the number of items to select, ''k'', is smaller than the size of the source array, ''S''): |
The most common example was labelled ''Algorithm R'' by [[Jeffrey Vitter]] in his paper on the subject.<ref>{{cite journal|last1=Vitter|first1=Jeffrey S.|title=Random sampling with a reservoir|journal=ACM Transactions on Mathematical Software|date=1 March 1985|volume=11|issue=1|pages=37–57|doi=10.1145/3147.3165|url=http://www.cs.umd.edu/~samir/498/vitter.pdf}}</ref>. This simple O(''n'') algorithm as described in the ''Dictionary of Algorithms and Data Structures''<ref>[http://xlinux.nist.gov/dads/HTML/reservoirSampling.html Dictionary of Algorithms and Data Structures]</ref> consists of the following steps (assuming that the arrays are one-based, and that the number of items to select, ''k'', is smaller than the size of the source array, ''S''): |
||
<source lang="pascal"> |
|||
/* |
|||
S has items to sample, R will contain the result |
|||
*/ |
|||
ReservoirSample(S[1..n], R[1..k]) |
|||
⚫ | |||
for i = 1 to k |
|||
⚫ | |||
⚫ | |||
for i = k+1 to n |
|||
⚫ | |||
if j <= k |
|||
⚫ | |||
</source> |
|||
'''array''' ''R''[''k'']; <span style="color:grey;">// result</span> |
|||
'''integer''' ''i'', ''j''; |
|||
⚫ | |||
'''for each''' ''i'' '''in''' 1 '''to''' ''k'' '''do''' |
|||
⚫ | |||
'''done'''; |
|||
⚫ | |||
'''for each''' ''i'' '''in''' ''k''+1 '''to length'''(''S'') '''do''' |
|||
⚫ | |||
'''if''' ''j'' <= ''k'' '''then''' |
|||
⚫ | |||
'''fi''' |
|||
'''done''' |
|||
The algorithm creates a "reservoir" array of size ''k'' and populates it with the first ''k'' items of ''S''. It then iterates through the remaining elements of ''S'' until ''S'' is exhausted. At the ''i''<sup>th</sup> element of ''S'', the algorithm generates a random number ''j'' between 1 and ''i''. If ''j'' is less than or equal to ''k'', the ''j''<sup>th</sup> element of the reservoir array is replaced with the ''i''<sup>th</sup> element of ''S''. In effect, for all ''i'', the ''i''<sup>th</sup> element of ''S'' is chosen to be included in the reservoir with probability ''k/i''. Similarly, at each iteration the ''j''<sup>th</sup> element of the reservoir array is chosen to be replaced with probability ''1/k * k/i'', which simplifies to ''1/i''. It can be shown that when the algorithm has finished executing, each item in ''S'' has equal probability (i.e. ''k/length(S)'') of being chosen for the reservoir. |
The algorithm creates a "reservoir" array of size ''k'' and populates it with the first ''k'' items of ''S''. It then iterates through the remaining elements of ''S'' until ''S'' is exhausted. At the ''i''<sup>th</sup> element of ''S'', the algorithm generates a random number ''j'' between 1 and ''i''. If ''j'' is less than or equal to ''k'', the ''j''<sup>th</sup> element of the reservoir array is replaced with the ''i''<sup>th</sup> element of ''S''. In effect, for all ''i'', the ''i''<sup>th</sup> element of ''S'' is chosen to be included in the reservoir with probability ''k/i''. Similarly, at each iteration the ''j''<sup>th</sup> element of the reservoir array is chosen to be replaced with probability ''1/k * k/i'', which simplifies to ''1/i''. It can be shown that when the algorithm has finished executing, each item in ''S'' has equal probability (i.e. ''k/length(S)'') of being chosen for the reservoir. |
||
To see this, consider the following proof by [[Mathematical induction|induction]]. After the (i-1)<sup>th</sup> round, let us assume, the probability of a number being in the reservoir array is ''k/(i-1)''. Since the probability of the number being replaced in the i<sup>th</sup>round is ''1/i'', the probability that it survives the i<sup>th</sup> round is ''(i-1)/i''. Thus, the probability that a given number is in the reservoir after the i<sup>th</sup> round is the product of these two probabilities, i.e. the probability of being in the reservoir after the (i-1)<sup>th</sup> round, and surviving replacement in the i<sup>th</sup> round. This is ''(k/(i-1)) * ((i-1)/i)=k/i''. Hence, the result holds for ''i'', and is therefore true by induction. |
To see this, consider the following proof by [[Mathematical induction|induction]]. After the (i-1)<sup>th</sup> round, let us assume, the probability of a number being in the reservoir array is ''k/(i-1)''. Since the probability of the number being replaced in the i<sup>th</sup>round is ''1/i'', the probability that it survives the i<sup>th</sup> round is ''(i-1)/i''. Thus, the probability that a given number is in the reservoir after the i<sup>th</sup> round is the product of these two probabilities, i.e. the probability of being in the reservoir after the (i-1)<sup>th</sup> round, and surviving replacement in the i<sup>th</sup> round. This is ''(k/(i-1)) * ((i-1)/i)=k/i''. Hence, the result holds for ''i'', and is therefore true by induction. |
||
== Reservoir with Random Sort == |
|||
A simple reservoir-based algorithm can be designed using random sort<ref name=Sunter1977>{{cite journal|last1=Sunter|first1=A. B.|title=List Sequential Sampling with Equal or Unequal Probabilities without Replacement|journal=Applied Statistics|date=1977|volume=26|issue=3|pages=261|doi=10.2307/2346966}}</ref> and implemented using priority queue data structure. This algorithm assigns random number as keys to each item and maintain k items with minimum value for keys. In essence, this is equivalent to assigning a random number to each item as key, sorting items using these keys and taking top k items. The run time of the algoritm is <math>O(n \log n)</math> which is worse then Algorithm R however this algorithm can easily be extended to weighted sampling. Note that both algorithms can operate on streams of unspecified lengths. |
|||
<source lang="pascal"> |
|||
/* |
|||
S is a stream of items to sample, R will contain the result |
|||
S.Current returns current item in stream |
|||
S.Next advances stream to next position |
|||
max-priority-queue supports: |
|||
Count -> number of items in priority queue |
|||
Maximum() -> returns maximum key value of all items |
|||
Extract-Max() -> Remove the item with max key |
|||
Insert(key, Item) -> Adds item with specified key |
|||
*/ |
|||
ReservoirSample(S[1..?], R[1..k]) |
|||
H = new max-priority-queue |
|||
while S has data |
|||
r = Random(0,1) // important: inclusive range |
|||
if H.Count < k |
|||
H.Insert(r, S.Current) |
|||
else |
|||
if H.Maximum > r |
|||
H.Extract-Max() |
|||
H.Insert(r, S.Current) |
|||
S.Next |
|||
</source> |
|||
== Relation to Fisher-Yates shuffle == |
== Relation to Fisher-Yates shuffle == |
Revision as of 06:21, 23 July 2015
This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these messages)
|
Reservoir sampling is a family of randomized algorithms for randomly choosing a sample of k items from a list S containing n items, where n is either a very large or unknown number. Typically n is large enough that the list doesn't fit into main memory.
Algorithm R
The most common example was labelled Algorithm R by Jeffrey Vitter in his paper on the subject.[1]. This simple O(n) algorithm as described in the Dictionary of Algorithms and Data Structures[2] consists of the following steps (assuming that the arrays are one-based, and that the number of items to select, k, is smaller than the size of the source array, S):
/*
S has items to sample, R will contain the result
*/
ReservoirSample(S[1..n], R[1..k])
// fill the reservoir array
for i = 1 to k
R[i] := S[i]
// replace elements with gradually decreasing probability
for i = k+1 to n
j := random(1, i); // important: inclusive range
if j <= k
R[j] := S[i]
The algorithm creates a "reservoir" array of size k and populates it with the first k items of S. It then iterates through the remaining elements of S until S is exhausted. At the ith element of S, the algorithm generates a random number j between 1 and i. If j is less than or equal to k, the jth element of the reservoir array is replaced with the ith element of S. In effect, for all i, the ith element of S is chosen to be included in the reservoir with probability k/i. Similarly, at each iteration the jth element of the reservoir array is chosen to be replaced with probability 1/k * k/i, which simplifies to 1/i. It can be shown that when the algorithm has finished executing, each item in S has equal probability (i.e. k/length(S)) of being chosen for the reservoir. To see this, consider the following proof by induction. After the (i-1)th round, let us assume, the probability of a number being in the reservoir array is k/(i-1). Since the probability of the number being replaced in the ithround is 1/i, the probability that it survives the ith round is (i-1)/i. Thus, the probability that a given number is in the reservoir after the ith round is the product of these two probabilities, i.e. the probability of being in the reservoir after the (i-1)th round, and surviving replacement in the ith round. This is (k/(i-1)) * ((i-1)/i)=k/i. Hence, the result holds for i, and is therefore true by induction.
Reservoir with Random Sort
A simple reservoir-based algorithm can be designed using random sort[3] and implemented using priority queue data structure. This algorithm assigns random number as keys to each item and maintain k items with minimum value for keys. In essence, this is equivalent to assigning a random number to each item as key, sorting items using these keys and taking top k items. The run time of the algoritm is which is worse then Algorithm R however this algorithm can easily be extended to weighted sampling. Note that both algorithms can operate on streams of unspecified lengths.
/*
S is a stream of items to sample, R will contain the result
S.Current returns current item in stream
S.Next advances stream to next position
max-priority-queue supports:
Count -> number of items in priority queue
Maximum() -> returns maximum key value of all items
Extract-Max() -> Remove the item with max key
Insert(key, Item) -> Adds item with specified key
*/
ReservoirSample(S[1..?], R[1..k])
H = new max-priority-queue
while S has data
r = Random(0,1) // important: inclusive range
if H.Count < k
H.Insert(r, S.Current)
else
if H.Maximum > r
H.Extract-Max()
H.Insert(r, S.Current)
S.Next
Relation to Fisher-Yates shuffle
Suppose one wanted to draw k random cards from a deck of playing cards (i.e., n=52). A natural approach would be to shuffle the deck and then take the top k cards. In the general case, the shuffle also needs to work even if the number of cards in the deck is not known in advance, a condition which is satisfied by the inside-out version of the Fisher-Yates shuffle:
To initialize an array a of n elements to a randomly shuffled copy of S, both 0-based:
a[0] ← S[0]
for i from 1 to n - 1 do
r ← random (0 .. i)
a[i] ← a[r]
a[r] ← S[i]
Note that although the rest of the cards are shuffled, only the top k are important in the present context. Therefore, the array a need only track the cards in the top k positions while performing the shuffle, reducing the amount of memory needed. Truncating a to length k, the algorithm is modified accordingly:
To initialize an array a to k random elements of S (which is of length n), both 0-based:
a[0] ← S[0]
for i from 1 to k - 1 do
r ← random (0 .. i)
a[i] ← a[r]
a[r] ← S[i]
for i from k to n - 1 do
r ← random (0 .. i)
if (r < k) then a[r] ← S[i]
Since the order of the first k cards is immaterial, the first loop can be removed and a can be initialized to be the first k items of S. This yields Algorithm R.
Example implementation
The following is a simple implementation of the algorithm in Python that samples the set of English Wikipedia page titles:
import random
SAMPLE_COUNT = 10
# Force the value of the seed so the results are repeatable
random.seed(12345)
sample_titles = []
for index, line in enumerate(open("enwiki-20091103-all-titles-in-ns0")):
# Generate the reservoir
if index < SAMPLE_COUNT:
sample_titles.append(line)
else:
# Randomly replace elements in the reservoir
# with a decreasing probability.
# Choose an integer between 0 and index (inclusive)
r = random.randint(0, index)
if r < SAMPLE_COUNT:
sample_titles[r] = line
print sample_titles
Statistical properties
Probabilities of selection of the reservoir methods are discussed in Chao (1982)[4] and Tillé (2006).[5] While the first-order selection probabilities are equal to k/n (or, in case of Chao's procedure, to an arbitrary set of unequal probabilities), the second order selection probabilities depend on the order in which the records are sorted in the original reservoir. The problem is overcome by the cube sampling method of Deville and Tillé (2004).[6]
Limitations
Reservoir sampling makes the assumption that the desired sample fits into main memory, often implying that k is a constant independent of n. In applications where we would like to select a large subset of the input list (say a third, i.e. k=n/3), other methods need to be adopted. Distributed implementations for this problem have been proposed.[7]
See also
References
- ^ Vitter, Jeffrey S. (1 March 1985). "Random sampling with a reservoir" (PDF). ACM Transactions on Mathematical Software. 11 (1): 37–57. doi:10.1145/3147.3165.
- ^ Dictionary of Algorithms and Data Structures
- ^ Sunter, A. B. (1977). "List Sequential Sampling with Equal or Unequal Probabilities without Replacement". Applied Statistics. 26 (3): 261. doi:10.2307/2346966.
- ^ Chao, M.T. (1982) A general purpose unequal probability sampling plan. Biometrika, 69 (3): 653-656.
- ^ Tillé, Y. (2006). Sampling Algorithms. Springer
- ^ Deville, J.-C., and Y. Tillé (2004). Efficient balanced sampling: The cube method. Biometrika 91 (4): 893-912.
- ^ Reservoir Sampling in MapReduce