Talk:Bloom filter: Difference between revisions

Content deleted Content added

Inline

Revision as of 04:37, 30 October 2013

A Graph

Maybe the following graph could be added to the "Probability of false positives" section. --Jerz4835 (talk) 12:29, 12 March 2008 (UTC)[reply]

this graph is weird. why is there no linear probability axis? i mean, this graph suggests that this is a shitty algorithm? the probability for false positives is really steep... --78.53.219.53 (talk) 13:00, 21 April 2010 (UTC)[reply]

The graph isn't really showing how "good" the bloom filter is, rather how various filter sizes and set sizes relate to the false positive rate. The probability axis emphasizes the area between 0 and 0.01 because differences there are a lot more meaningful than between, say, 0.99 and 1. Unfortunately I agree that it's not very clear at a glance, but I am no good at these types of visualizations and probably couldn't improve on it. Maybe someone else has ideas. Maghnus (talk) 04:11, 13 August 2010 (UTC)[reply]

The example

False positives are silly in a spelling checker. I think an example more like [1] should be used, with the secrecy thing mentioned elsewhere. But my english very bad, so I leave in capable hands of brother and sister. —Preceding unsigned comment added by 220.233.116.128 (talk) 16:19, 9 January 2008 (UTC)[reply]

Completely agree, if you read the article your mind becomes "stuck" on the example; it is so silly you wonder if you misunderstood the concept. 80.79.35.68 (talk) 17:16, 11 January 2008 (UTC)Lambert[reply]

Operation Names

I propose changing insert to add and test membership to query. I think add better matches what actually happens and query is just more concise. --Pfunk42 22:42, 1 Jun 2005 (UTC)

Discoverer?

I propose changing discoverer in "...named after its discoverer Burton H. Bloom," to inventor. Algorithms like this aren't 'discovered' so much as invented.

I tried "conceived by" whoever in year whatever, a pattern I've seen elsewhere. Deco 01:21, 5 Jun 2005 (UTC)

Removing from a Bloom filter

It feels like you can't. This should probably be emphasized in the article, as this is a distinction from similar data structures. It's mentioned in the intro, but then never discussed...I don't have the technical expertise to say definitively how to handle this, so I can't write it up myself. -- Metahacker 15:32, 21 Jun 2005 (UTC)

Good idea. I added a short explanation. You think it'll do? Deco 19:46, 21 Jun 2005 (UTC)

While strictly speaking practical removal may not be possible in a simple Bloom filter, two facts presented in the article might use some clarification as to why removal is logically impossible. First, the universe can be represented by an "full" array, and second, Bloom arrays can be ANDed. In cases with a finite data universe the two of these seems to idicate that I could build the Bloom array of everything but the element I want to delete then AND that with the Bloom array I have. I can't say that I see this never introducing false negatives, but it does seem worthy of mention one way or the other by someone more apt in these things than myself. JonShops 03:06, 15 April 2007 (UTC)[reply]

The problem with that idea is that the Bloom filter corresponding to "universe minus my element" will almost certainly be identical to the Bloom filter corresponding to "entire universe". You'll never introduce a false negative, but you will usually introduce false positives — that is, if you take the Bloom filter for (A,B,C) and try to subtract out (B) using your method, you'll probably end up with the Bloom filter for (A,B,C). The "removal" didn't actually do anything! Hope this helps; also, check out the other variations mentioned on this Talk page. Some of them support removals, with various restrictions. --Quuxplusone 03:26, 15 April 2007 (UTC)[reply]

I know this discussion is a few years old, but you could technically remove items from a bloom filter if you use a counter array rather than a bit array. It kills the efficiency of the table, but at least you can remove items as long as you don't manage to overflow a given counter (but you'd hit false positives first.) --Sigma 7 (talk) 17:36, 17 April 2010 (UTC)[reply]

Yes, that's already in the article under Bloom filter#Counting filters. —David Eppstein (talk) 18:13, 17 April 2010 (UTC)[reply]

What if you have a second Bloom filter, which stores a set of deleted items? So, check the first Bloom filter to see if the item was (at some point) a member of the set, then check the second Bloom filter to see if it was deleted. Of course, there is a downside: once an item is deleted, you can never add that item again, simply because there's still no way to delete something from the second Bloom filter. Well, unless you have *another* Bloom filter for 'undeleted' items, but it's not a practical solution. For applications where deletion is irrevocable (such as with SSL certificates), this may be an extremely desirable feature. 208.81.212.224 (talk) 19:04, 24 October 2013 (UTC)[reply]

False positives in the second filter could be a problem... They would eliminate one of the more useful properties of Bloom filters, that its errors are only one-sided. —David Eppstein (talk) 19:12, 24 October 2013 (UTC)[reply]

Adding and removing in a single-bit Bloom filter

You could implement a voting algorithm in a single bit Bloom filter to allow both adds and deletes, at the cost of admitting false negatives, with an add setting k hash-bits, and a delete clearing k hash bits. The difference from an ordinary Bloom filter would be that the decision threshold would be less than requiring all k bits to be set for a rejection: for example, say, k/2, for say k=20. In this way, although both false-positive and false-negative results would be possible, the threshold factor could be tuned to give low error rates for both types of error.

Has anyone done this? Or are multi-bit Bloom filters simply superior? Inquiring minds weant to know. -- The Anome 13:54, September 9, 2005 (UTC)

Good question. I can't imagine why nobody's thought about this already, it's pretty obvious considering the implementation of your average BPP algorithm. I shall have to ask some people about it. Dcoetzee 08:03, 28 May 2007 (UTC)[reply]

Adding and removing in a counting Bloom filter

Riffing on The Anome's suggestion above: You could also implement a "counting" Bloom filter, by using b bits per hash instead of only one. For example, using b=2, you'd replace Bloom's array of m bits with an array of m 2-bit counters. Adding an element would increment all k relevant counters; if any counter was already at 3, it would "stick" there instead of wrapping around. Therefore, a count of 0 represents "absence," and a count of 1 or 2 indicates 1 or 2 entries sharing that counter. A count of 3 indicates at least three entries sharing that counter.

To remove an element, you'd decrement all k relevant counters, except for those counters that contained 3 — such a counter you'd leave at 3, since to decrement it would falsely imply that it was shared by exactly two entries.

This kind of "counting" Bloom filter would have all the problems with false positives of a normal Bloom filter, but it would allow a certain number of removals to proceed correctly. Add enough elements, though, and removals would stop working — you'd be unable to remove any element all of whose counters had maxed out.

I suspect that the added complexity of implementation wouldn't make a "counting" Bloom filter worth the trouble, but it's worth a mention on this Talk page, at least. :) --Quuxplusone, 19:06, 18 March 2006 (UTC)[reply]

Bloomier filter insertion

The bloomier filter insertion description is wrong:

"Now that we have the structure and a search algorithm, we also need to know how to insert new key/value pairs. The program must not attempt to insert the same key with both values. If the value is 0, insert the key into A0 and then test if the key is in B0. If so, this is a false positive for B0, and the key must also be inserted into A1 recursively in the same manner. If we reach the last level, we simply insert it. When the value is 1, the operation is similar but with A and B reversed.".

So any key in A0 that is a false positive in B0 must be present in A1 (and so on). But a key inserted in A0 might BECOME a false positive key in B0 after other insertions in B0. Imagine we insert a key in A0 which is "almost present" in B0 - only one of the k bits is not set; the key is not (yet) a false positive in B0. But after this, maybe we insert a key in B0 that also sets that kth bit to 1; now the initial key is in A0, it is a false positive in B0, and it is not in A1 - violating the data structure invariants. --Raduberinde 08:09, 28 July 2006 (UTC)[reply]

You're right. My mistake. I will reconsider the original paper. Dcoetzee 04:38, 25 May 2007 (UTC)[reply]

I think the paragraph on insertion in bloomier filter should be removed. The described bloomier filter is static, it doesnt permit insertion or deletion. The bloomier filter is constructed globally. I have added a section on dynamic bloomier filters that support insertions and deletions.

Okay, dynamic is certainly better - and surprisingly simple. We ought to still cite the original paper, but we can discuss the more recent result in detail. Dcoetzee 17:40, 24 July 2007 (UTC)[reply]

You should break this into multiple pages

A bloom filter is a bloom filter - a counting bloom filter is a different beast with enough different permutations to discuss for the rest of eternity.

Break it into two - at least

Generalized

Don't mean to be a wank, but has the "Generalized Bloom Filter" thingy been accepted into a peer-reviewed publication?

Regardless, the author should stick in info on how much bigger than a regular BF a GBF that uses ones and zeros needs to be to have the same false positive rate. Oh, I see. You forgot to mention that because it's A LOT (many times) bigger. ;) --pfunk42 04:35, 28 May 2007 (UTC)[reply]

Proof of false positive probability is wrong

The events of each particular bit being 1 are not independent. Intuitively, if you know the first bit mapped to 1, it suggests the array is more dense. That is, the probability the second bit maps to 1 given that the first did is greater than the probability the second bit maps to 1 given the first mapped to 0 (or say, given no other information). 128.2.16.192 22:49, 27 September 2007 (UTC)[reply]

First of all, i would call it an "analysis", not a "proof". And, in fact, this is an average case analysis, because it doesn't take into account the fact that the number of bits set in the Bloom filter from n additions follows a probability distribution, and with random hash functions, the number of set bits determines the false positive probability. Thus, adding n things to a Bloom filter doesn't always result in a certain false positive probability--you might get lucky or unlucky. But the variance is small. There are possibly some other suble, mostly insignificant inaccuracies in that formula. See this paper for what seems to be a more detailed analysis, but it has not, to my knowledge been accepted under peer review. --pfunk42 05:32, 30 September 2007 (UTC)[reply]

What are they used for?

What are bloom filters used for? Any applications people can think of should be put in the intro. Fresheneesz 02:09, 16 October 2007 (UTC)[reply]

No. Any applications people can source should be put into later parts of the article, and (per WP:LEDE) mentioned briefly in the intro. —David Eppstein 02:47, 16 October 2007 (UTC)[reply]

meaning of bloom filter

if it gives "false positives" and "correct negatives" then it must be used to find whether element is *not* in a set (and so one would be 100% sure). maybe this article could be change according to it. 84.16.123.194 (talk) 15:47, 28 January 2008 (UTC)[reply]

Purpose of k hash functions?

Why are there k hash functions? Why would you need more than one? —Preceding unsigned comment added by 66.11.82.73 (talk) 15:07, 15 April 2008 (UTC)[reply]

You try doing it with one hash function and let me know what error bound you obtain.

This was confusing for me as well. Couldn't a single function duplicate the functionality of k hash functions? It's always been the case that a single function can call 10 different functions, for instance. Isn't the whole point that the single value ends up in k different places with a more or less even distribution? I can see the need for k different algorithms, maybe - although even then a little bit of modulus arithmetic would enable one to use a single algorithm in a loop k times and, again, would duplicate the functionality of using k different algorithms perfectly. So yeah, why the heck do you need k different hash functions? To answer your question, the error bound would be exactly the same. — Preceding unsigned comment added by 209.180.36.114 (talk) 20:31, 27 March 2013 (UTC)[reply]

I think it's unnecessarily ambiguous that the diagram in the Algorithm description has k=3 and also a set of 3 elements. To the novice like me it's initially confusing. Anyone agree? --Flexdream (talk) 15:35, 16 May 2012 (UTC)[reply]

Decay?

Can anybody provide a description of various techniques for decaying bloom filters? Google has hits for several different terms, "Scope Decay", "Deterministic Decay", "Time Decaying", etc. I conjecture this is a way to stochastically remove no longer relevant members from the set, but the articles I've found all require various subscriptions. nuffin (talk) 20:25, 26 July 2008 (UTC)[reply]

Filter size (m)

It would be helpful to note that the number of bits in the filter (m) should be a multiple of the expected number of elements (usually called n in the papers). Not as obvious as you might think. Could save someone's afternoon, a note like that... —Preceding unsigned comment added by 208.85.112.97 (talk) 15:15, 24 September 2008 (UTC)[reply]

k for a particular p

An interesting consequence of the formulas presented that if one wants a particular false positive rate, p, and intend to use the required number of bits (m) to get that rate for the number of entries (n) then the optimal number of hashes is

k=-\log _{2}p\,.

regardless of the specifics. Although in some sense obvious (adding one more hash halves the false positive rate) it is a clean clear, simple and useful result. If you want, say, no more than one false positive out of 1024 tests, and you want to use the minimum number of bits in the filter, you will need at least 10 hashes. This might be worth including in the article. Topher Cooper--65.196.61.126 (talk) 22:12, 22 September 2009 (UTC)[reply]

Probability of false positives - demand for sources is a rejection of science and the scientific method.

Everything in the section "Probability of false positives" is verifiable from basic principles of mathematics and science by following the author's reasoning, thus the demand for proper authoritative sources is a rejection of science and the scientific method.

This section not original research, rather it is replicated research that each one and every one should replicate.

Roger Bacon was the first scientist since because insisted on observing things for himself and urged others to do the same. If Wikipedia prohibits people from doing the same, it becomes the holy book of a scientistic religion.

The motto of the Royal society was "Nullius in verba" meaning "Take no one's word for it".

Wikipedia is scientifically and morally wrong to demand that people take someone else's word for things. Replicability, not proper authority is the gold standard of truth. James A. Donald (talk) 02:48, 13 January 2010 (UTC)[reply]

Technically, a citation is required for the jump from (1-(1-m)^kn)^k to (1-e^(-kn/m))^k, since that requires some level of calculus. A better reason for removing the tag is that it echoed information found in an external link (and also has a hard reference), rather than complaining that Wikipedia doesn't follow the scietific method of more formal peer review. --Sigma 7 (talk) 18:51, 17 April 2010 (UTC)[reply]

Hash Tables are no longer Hash Tables when they are Bloom Filters

Saying that a Bloom Filter at k=1 is a special kind of Hash Table (in the Space & Time Advantages section) seems counter intuitive. Hash Tables are deterministic and a Bloom Filter with k=1 is exactly what is described, not a surprising anomaly having to do with what happens when a Hash Table no longer checks for collisions and uses single bits rather than full entries. Specifically, "Note also that hash tables gain a space and time advantage if they begin ignoring collisions and store only whether each bucket contains an entry; in this case, they have effectively become Bloom filters with k = 1." should be removed for conciseness.

Rbkillea (talk) 22:08, 6 December 2011 (UTC)[reply]

No other data structure?

The page says "Bloom filters also have the unusual property that the time needed either to add items or to check whether an item is in the set is a fixed constant, O(k), completely independent of the number of items already in the set. No other constant-space set data structure has this property."

It is misleading to think that no other data structure has the property of the complexity being independent of the number of elements in the set. However, vEB Trees, constant size Fenwick Trees and Segment Trees have this property. The running time is a function of the universe size and not the set size.

Maybe it needs to be made more clear that a constant running time, which is independent of the number of elements in the set is what is being talked about.

Dhruvbird (talk) 18:27, 2 February 2012 (UTC)[reply]

I agree with @Dhruvbird - and I see this part was changed to the following:

"Bloom filters also have the unusual property that the time needed either to add items or to check whether an item is in the set is a fixed constant, O(k), completely independent of the number of items already in the set. No other constant-space set data structure has this property, but the average access time of sparse hash tables can make them faster in practice than some Bloom filters. In a hardware implementation, however, the Bloom filter shines because its k lookups are independent and can be parallelized."

First of all, 'time' in computer science is never exact, due to threading and operating system priority. It then becomes an approximation - nothing ever takes constant 'time' in terms of a computer processor. So to use the word 'time' is therefore a calculation of processor operations and not an actual measure of real processing time (nanoseconds or milliseconds per lookup). Using the word 'time' therefore becomes misleading to the average computer user, or novice programmer, and it also depends on the definition of 'k', or the variables one uses to describe an algorithm's required computational operations. Here, 'k' is a hash function, which takes at least m additional bitwise operations. Therefore, to be more exact, it takes 0(m*k) operations, when comparing operations noted that are required of a 'trie' on wikipedia. Using a common base to define 'time' may save some programmers the trouble of figuring out what is actually faster...

Second, I double-checked @Dhruvbird's vEB tree, Fenwick trees, and segment trees, and found that they also provide a calculatable fixed 'time' independent of the number of items already in the set. I would assume there are many more that have this property, lying somewhere in the deeply buried pile of patents.

To say that 'no other constant-space set data structure has this property' is also incorrect. Using a simple array as an example, it provides 0(1) fixed operations to add an item, assuming it is a constant-sized set; and neither an array or a bloom filter guarantees 0(1) or 0(3) (in the case of the example bloom filter in this article k=3) operations for checking whether an item is in the set. The reason I say the latter is because the bloom filter only says 'it probably exists' - you then have to go check the underlying data set and see if it does.

This means that a bloom filter's computational time is an additional operation in conjunction with the underlying data structure/algorithm that either refers to in-memory pointers or disk-based pointers - which makes the argument for its running time irrelevant. It doesn't actually point to any object or data item, on disk or in memory. So to say this has 0(k) running time is also inherently wrong, being the case that a secondary structure or algorithm must accompany a bloom filter to actually perform the real lookup, insertion, or deletion of an item.

The statement is still misleading indeed. We should rather say that "A bloom filter has the property to detect whether an item is either NOT in the set or probably in the set in 0(k) hash operations. This property is an uncommon one; being independent of the number of items in the underlying set." (You have to keep in perspective - a bloom fitler is not actually a set, that's why it's called a filter). It's ok to say that its k lookups can be parallelized, as well as the bloom filter's inserts and deletions. Then we should probably add "A bloom filter can reduce the need to perform additional operations on the underlying data structure..." - which is its original intended usage.

In addition, we have to realize the space and time advantages section is also dependent on perspective or use of the bloom filter, and is therefore not a really relevant argument to compare to other Actual Sets that do point to Actual Objects and Data...

If, for example, the programmer has a list of numbers or strings - and no objects - then we can say the bloom filter has advantages over other data structures. However, this is almost never the case. The reason is because a bloom filter never says 'this is definitely in the set' - it only says if (key == all_true) check_underlying_set; else return;... with the case that (and this is the most usual case) that the bloom filter is not intended to be used as an actual set of objects - where we need to point to actual objects in memory, or on disk, then its space and time advantages become irrelevant and only serve to filter the underlying data structure; therefore implying that we are potentially saving more costly operations on the underlying data structure/algorithm....

75.133.175.2 (talk) 03:56, 30 October 2013 (UTC)[reply]

@@ Line 163: / Line 163: @@
 :The statement is still misleading indeed.  We should rather say that "A bloom filter has the property to detect whether an item is either NOT in the set or probably in the set in 0(k) hash operations.  This property is an uncommon one; being independent of the number of items in the underlying set."  (You have to keep in perspective - a bloom fitler is not actually a set, that's why it's called a ''filter'').  It's ok to say that its k lookups can be parallelized, as well as the bloom filter's inserts and deletions.  Then we should probably add "A bloom filter can reduce the need to perform additional operations on the underlying data structure..." - which is its original intended usage.
+:In addition, we have to realize the ''space and time advantages section'' is also dependent on perspective or use of the bloom filter, and is therefore not a really relevant argument to compare to other Actual Sets that do point to Actual Objects and Data...
+:If, for example, the programmer has a list of numbers or strings - and no objects - then we can say the bloom filter has advantages over other data structures.  '''''However, this is almost never the case.'''''  The reason is because a bloom filter never says 'this is definitely in the set' - it only says if (key == all_true) check_underlying_set; else return;...  with the case that (and this is the most usual case) that the bloom filter is not intended to be used as an actual set of objects - where we need to point to actual objects in memory, or on disk, then its space and time advantages become irrelevant and only serve to filter the underlying data structure; therefore implying that we are potentially saving more costly operations on the underlying data structure/algorithm....
 :[[Special:Contributions/75.133.175.2|75.133.175.2]] ([[User talk:75.133.175.2|talk]]) 03:56, 30 October 2013 (UTC)