Talk:Bloom filter

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search
WikiProject Computing (Rated C-class, Mid-importance)
WikiProject icon This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
C-Class article C  This article has been rated as C-Class on the project's quality scale.
 Mid  This article has been rated as Mid-importance on the project's importance scale.
 

A Graph[edit]

The false positive probability as a function of number of elements in the filter and the filter size . An optimal number of hash functions k= m/n ln2 has been assumed.

Maybe the following graph could be added to the "Probability of false positives" section. --Jerz4835 (talk) 12:29, 12 March 2008 (UTC)


this graph is weird. why is there no linear probability axis? i mean, this graph suggests that this is a shitty algorithm? the probability for false positives is really steep... --78.53.219.53 (talk) 13:00, 21 April 2010 (UTC)

The graph isn't really showing how "good" the bloom filter is, rather how various filter sizes and set sizes relate to the false positive rate. The probability axis emphasizes the area between 0 and 0.01 because differences there are a lot more meaningful than between, say, 0.99 and 1. Unfortunately I agree that it's not very clear at a glance, but I am no good at these types of visualizations and probably couldn't improve on it. Maybe someone else has ideas. Maghnus (talk) 04:11, 13 August 2010 (UTC)

The example[edit]

False positives are silly in a spelling checker. I think an example more like [1] should be used, with the secrecy thing mentioned elsewhere. But my english very bad, so I leave in capable hands of brother and sister. —Preceding unsigned comment added by 220.233.116.128 (talk) 16:19, 9 January 2008 (UTC)


Completely agree, if you read the article your mind becomes "stuck" on the example; it is so silly you wonder if you misunderstood the concept. 80.79.35.68 (talk) 17:16, 11 January 2008 (UTC)Lambert

The example image both has k=3 and also includes 3 elements. This makes it less than ideal for explaining how the 'k' thing works. Rsclient (talk) 03:34, 1 May 2018 (UTC)

Operation Names[edit]

I propose changing insert to add and test membership to query. I think add better matches what actually happens and query is just more concise. --Pfunk42 22:42, 1 Jun 2005 (UTC)

Discoverer?[edit]

I propose changing discoverer in "...named after its discoverer Burton H. Bloom," to inventor. Algorithms like this aren't 'discovered' so much as invented.

I tried "conceived by" whoever in year whatever, a pattern I've seen elsewhere. Deco 01:21, 5 Jun 2005 (UTC)

Removing from a Bloom filter[edit]

It feels like you can't. This should probably be emphasized in the article, as this is a distinction from similar data structures. It's mentioned in the intro, but then never discussed...I don't have the technical expertise to say definitively how to handle this, so I can't write it up myself. -- Metahacker 15:32, 21 Jun 2005 (UTC)

Good idea. I added a short explanation. You think it'll do? Deco 19:46, 21 Jun 2005 (UTC)

While strictly speaking practical removal may not be possible in a simple Bloom filter, two facts presented in the article might use some clarification as to why removal is logically impossible. First, the universe can be represented by an "full" array, and second, Bloom arrays can be ANDed. In cases with a finite data universe the two of these seems to idicate that I could build the Bloom array of everything but the element I want to delete then AND that with the Bloom array I have. I can't say that I see this never introducing false negatives, but it does seem worthy of mention one way or the other by someone more apt in these things than myself. JonShops 03:06, 15 April 2007 (UTC)

The problem with that idea is that the Bloom filter corresponding to "universe minus my element" will almost certainly be identical to the Bloom filter corresponding to "entire universe". You'll never introduce a false negative, but you will usually introduce false positives — that is, if you take the Bloom filter for (A,B,C) and try to subtract out (B) using your method, you'll probably end up with the Bloom filter for (A,B,C). The "removal" didn't actually do anything! Hope this helps; also, check out the other variations mentioned on this Talk page. Some of them support removals, with various restrictions. --Quuxplusone 03:26, 15 April 2007 (UTC)

I know this discussion is a few years old, but you could technically remove items from a bloom filter if you use a counter array rather than a bit array. It kills the efficiency of the table, but at least you can remove items as long as you don't manage to overflow a given counter (but you'd hit false positives first.) --Sigma 7 (talk) 17:36, 17 April 2010 (UTC)

Yes, that's already in the article under Bloom filter#Counting filters. —David Eppstein (talk) 18:13, 17 April 2010 (UTC)
What if you have a second Bloom filter, which stores a set of deleted items? So, check the first Bloom filter to see if the item was (at some point) a member of the set, then check the second Bloom filter to see if it was deleted. Of course, there is a downside: once an item is deleted, you can never add that item again, simply because there's still no way to delete something from the second Bloom filter. Well, unless you have *another* Bloom filter for 'undeleted' items, but it's not a practical solution. For applications where deletion is irrevocable (such as with SSL certificates), this may be an extremely desirable feature. 208.81.212.224 (talk) 19:04, 24 October 2013 (UTC)
False positives in the second filter could be a problem... They would eliminate one of the more useful properties of Bloom filters, that its errors are only one-sided. —David Eppstein (talk) 19:12, 24 October 2013 (UTC)

Adding and removing in a single-bit Bloom filter[edit]

You could implement a voting algorithm in a single bit Bloom filter to allow both adds and deletes, at the cost of admitting false negatives, with an add setting k hash-bits, and a delete clearing k hash bits. The difference from an ordinary Bloom filter would be that the decision threshold would be less than requiring all k bits to be set for a rejection: for example, say, k/2, for say k=20. In this way, although both false-positive and false-negative results would be possible, the threshold factor could be tuned to give low error rates for both types of error.

Has anyone done this? Or are multi-bit Bloom filters simply superior? Inquiring minds weant to know. -- The Anome 13:54, September 9, 2005 (UTC)

Good question. I can't imagine why nobody's thought about this already, it's pretty obvious considering the implementation of your average BPP algorithm. I shall have to ask some people about it. Dcoetzee 08:03, 28 May 2007 (UTC)

Adding and removing in a counting Bloom filter[edit]

Riffing on The Anome's suggestion above: You could also implement a "counting" Bloom filter, by using b bits per hash instead of only one. For example, using b=2, you'd replace Bloom's array of m bits with an array of m 2-bit counters. Adding an element would increment all k relevant counters; if any counter was already at 3, it would "stick" there instead of wrapping around. Therefore, a count of 0 represents "absence," and a count of 1 or 2 indicates 1 or 2 entries sharing that counter. A count of 3 indicates at least three entries sharing that counter.

To remove an element, you'd decrement all k relevant counters, except for those counters that contained 3 — such a counter you'd leave at 3, since to decrement it would falsely imply that it was shared by exactly two entries.

This kind of "counting" Bloom filter would have all the problems with false positives of a normal Bloom filter, but it would allow a certain number of removals to proceed correctly. Add enough elements, though, and removals would stop working — you'd be unable to remove any element all of whose counters had maxed out.

I suspect that the added complexity of implementation wouldn't make a "counting" Bloom filter worth the trouble, but it's worth a mention on this Talk page, at least. :) --Quuxplusone, 19:06, 18 March 2006 (UTC)

Bloomier filter insertion[edit]

The bloomier filter insertion description is wrong:

"Now that we have the structure and a search algorithm, we also need to know how to insert new key/value pairs. The program must not attempt to insert the same key with both values. If the value is 0, insert the key into A0 and then test if the key is in B0. If so, this is a false positive for B0, and the key must also be inserted into A1 recursively in the same manner. If we reach the last level, we simply insert it. When the value is 1, the operation is similar but with A and B reversed.".

So any key in A0 that is a false positive in B0 must be present in A1 (and so on). But a key inserted in A0 might BECOME a false positive key in B0 after other insertions in B0. Imagine we insert a key in A0 which is "almost present" in B0 - only one of the k bits is not set; the key is not (yet) a false positive in B0. But after this, maybe we insert a key in B0 that also sets that kth bit to 1; now the initial key is in A0, it is a false positive in B0, and it is not in A1 - violating the data structure invariants. --Raduberinde 08:09, 28 July 2006 (UTC)

You're right. My mistake. I will reconsider the original paper. Dcoetzee 04:38, 25 May 2007 (UTC)

I think the paragraph on insertion in bloomier filter should be removed. The described bloomier filter is static, it doesnt permit insertion or deletion. The bloomier filter is constructed globally. I have added a section on dynamic bloomier filters that support insertions and deletions.

Okay, dynamic is certainly better - and surprisingly simple. We ought to still cite the original paper, but we can discuss the more recent result in detail. Dcoetzee 17:40, 24 July 2007 (UTC)

You should break this into multiple pages[edit]

A bloom filter is a bloom filter - a counting bloom filter is a different beast with enough different permutations to discuss for the rest of eternity.

Break it into two - at least

Generalized[edit]

Don't mean to be a wank, but has the "Generalized Bloom Filter" thingy been accepted into a peer-reviewed publication?

Regardless, the author should stick in info on how much bigger than a regular BF a GBF that uses ones and zeros needs to be to have the same false positive rate. Oh, I see. You forgot to mention that because it's A LOT (many times) bigger.  ;) --pfunk42 04:35, 28 May 2007 (UTC)

Proof of false positive probability is wrong[edit]

The events of each particular bit being 1 are not independent. Intuitively, if you know the first bit mapped to 1, it suggests the array is more dense. That is, the probability the second bit maps to 1 given that the first did is greater than the probability the second bit maps to 1 given the first mapped to 0 (or say, given no other information). 128.2.16.192 22:49, 27 September 2007 (UTC)

First of all, i would call it an "analysis", not a "proof". And, in fact, this is an average case analysis, because it doesn't take into account the fact that the number of bits set in the Bloom filter from n additions follows a probability distribution, and with random hash functions, the number of set bits determines the false positive probability. Thus, adding n things to a Bloom filter doesn't always result in a certain false positive probability--you might get lucky or unlucky. But the variance is small. There are possibly some other suble, mostly insignificant inaccuracies in that formula. See this paper for what seems to be a more detailed analysis, but it has not, to my knowledge been accepted under peer review. --pfunk42 05:32, 30 September 2007 (UTC)

What are they used for?[edit]

What are bloom filters used for? Any applications people can think of should be put in the intro. Fresheneesz 02:09, 16 October 2007 (UTC)

No. Any applications people can source should be put into later parts of the article, and (per WP:LEDE) mentioned briefly in the intro. —David Eppstein 02:47, 16 October 2007 (UTC)

meaning of bloom filter[edit]

if it gives "false positives" and "correct negatives" then it must be used to find whether element is *not* in a set (and so one would be 100% sure). maybe this article could be change according to it. 84.16.123.194 (talk) 15:47, 28 January 2008 (UTC)

I know this is old but this conclusion is either slightly incorrect or just poorly phrased. Bloom Filters are used to filter things out but not to conclusively determine whether an element is not in a set since it can return a "maybe" result. You can sometimes use it to avoid doing more expensive lookups/calculations if the Bloom Filter first determines your element is NOT in the set. 209.131.62.115 (talk) 05:49, 16 November 2013 (UTC)

Purpose of k hash functions?[edit]

Why are there k hash functions? Why would you need more than one? —Preceding unsigned comment added by 66.11.82.73 (talk) 15:07, 15 April 2008 (UTC)

You try doing it with one hash function and let me know what error bound you obtain.
This was confusing for me as well. Couldn't a single function duplicate the functionality of k hash functions? It's always been the case that a single function can call 10 different functions, for instance. Isn't the whole point that the single value ends up in k different places with a more or less even distribution? I can see the need for k different algorithms, maybe - although even then a little bit of modulus arithmetic would enable one to use a single algorithm in a loop k times and, again, would duplicate the functionality of using k different algorithms perfectly. So yeah, why the heck do you need k different hash functions? To answer your question, the error bound would be exactly the same. — Preceding unsigned comment added by 209.180.36.114 (talk) 20:31, 27 March 2013 (UTC)

I think it's unnecessarily ambiguous that the diagram in the Algorithm description has k=3 and also a set of 3 elements. To the novice like me it's initially confusing. Anyone agree? --Flexdream (talk) 15:35, 16 May 2012 (UTC)

Decay?[edit]

Can anybody provide a description of various techniques for decaying bloom filters? Google has hits for several different terms, "Scope Decay", "Deterministic Decay", "Time Decaying", etc. I conjecture this is a way to stochastically remove no longer relevant members from the set, but the articles I've found all require various subscriptions. nuffin (talk) 20:25, 26 July 2008 (UTC)

Filter size (m)[edit]

It would be helpful to note that the number of bits in the filter (m) should be a multiple of the expected number of elements (usually called n in the papers). Not as obvious as you might think. Could save someone's afternoon, a note like that... —Preceding unsigned comment added by 208.85.112.97 (talk) 15:15, 24 September 2008 (UTC)

k for a particular p[edit]

An interesting consequence of the formulas presented that if one wants a particular false positive rate, p, and intend to use the required number of bits (m) to get that rate for the number of entries (n) then the optimal number of hashes is

regardless of the specifics. Although in some sense obvious (adding one more hash halves the false positive rate) it is a clean clear, simple and useful result. If you want, say, no more than one false positive out of 1024 tests, and you want to use the minimum number of bits in the filter, you will need at least 10 hashes. This might be worth including in the article. Topher Cooper--65.196.61.126 (talk) 22:12, 22 September 2009 (UTC)

I'm new to editing wikipedia, so please bare with me on this. I just made a change which I believe corrects the claim in the article which says that k is proportional to 1/p instead of, as you observe, log(1/p). Of course, it depends on how k is chosen, e.g., if is chosen to be optimal for the given m (size of bit vector) and n (number of elements inserted into the set), then k ~ log(1/p).
I didn't see your comment here until after I posted the change. I would welcome any feedback on this change since I want to make sure my edit is sound and you seem familiar with the math.
Queelius (talk) 09:05, 11 July 2014 (UTC)

Probability of false positives - demand for sources is a rejection of science and the scientific method.[edit]

Everything in the section "Probability of false positives" is verifiable from basic principles of mathematics and science by following the author's reasoning, thus the demand for proper authoritative sources is a rejection of science and the scientific method.

This section not original research, rather it is replicated research that each one and every one should replicate.

Roger Bacon was the first scientist since because insisted on observing things for himself and urged others to do the same. If Wikipedia prohibits people from doing the same, it becomes the holy book of a scientistic religion.

The motto of the Royal society was "Nullius in verba" meaning "Take no one's word for it".

Wikipedia is scientifically and morally wrong to demand that people take someone else's word for things. Replicability, not proper authority is the gold standard of truth. James A. Donald (talk) 02:48, 13 January 2010 (UTC)

Technically, a citation is required for the jump from (1-(1-m)^kn)^k to (1-e^(-kn/m))^k, since that requires some level of calculus. A better reason for removing the tag is that it echoed information found in an external link (and also has a hard reference), rather than complaining that Wikipedia doesn't follow the scietific method of more formal peer review. --Sigma 7 (talk) 18:51, 17 April 2010 (UTC)

Hash Tables are no longer Hash Tables when they are Bloom Filters[edit]

Saying that a Bloom Filter at k=1 is a special kind of Hash Table (in the Space & Time Advantages section) seems counter intuitive. Hash Tables are deterministic and a Bloom Filter with k=1 is exactly what is described, not a surprising anomaly having to do with what happens when a Hash Table no longer checks for collisions and uses single bits rather than full entries. Specifically, "Note also that hash tables gain a space and time advantage if they begin ignoring collisions and store only whether each bucket contains an entry; in this case, they have effectively become Bloom filters with k = 1." should be removed for conciseness.

Rbkillea (talk) 22:08, 6 December 2011 (UTC)

No other data structure?[edit]

The page says "Bloom filters also have the unusual property that the time needed either to add items or to check whether an item is in the set is a fixed constant, O(k), completely independent of the number of items already in the set. No other constant-space set data structure has this property."

It is misleading to think that no other data structure has the property of the complexity being independent of the number of elements in the set. However, vEB Trees, constant size Fenwick Trees and Segment Trees have this property. The running time is a function of the universe size and not the set size.

Maybe it needs to be made more clear that a constant running time, which is independent of the number of elements in the set is what is being talked about.

Dhruvbird (talk) 18:27, 2 February 2012 (UTC)

I agree with @Dhruvbird - [edit] There are, indeed, other data structures that are even actual Sets (containing data items or indicating a data item exists in the set) that do have constant "running time"..

[rest of this discussion was moved to "This page needs revision.."

TL;DR but it is indeed a constant-time set data structure (with one-sided errors). If you don't think so then you don't understand it. It is entirely possible that our article mis-states the time bounds for other alternative structures, however. —David Eppstein (talk) 03:56, 31 October 2013 (UTC)

Note: this discussion was edited and moved below for clarity and lack of offensiveness and furthering frustration at the wikipedia community..75.133.175.2 (talk) 15:40, 6 November 2013 (UTC)

This page needs revision - conflicts, clarity, the rest can stay as-is[edit]

I am going to be borrowing from another page, called a 'quotient filter', which is a derivative data structure which came from the idea of a bloom filter. It's usage, design, and purpose are exactly the same as a bloom filter, however, its implementation was performed in a way that 'enhances' what a bloom filter fails to do - where we can see in the bloom filter article that it needs restructured, which is a total restructure, as n changes.

Here are the changes I am proposing:

1. --> Clarify Introductory paragraphs to more accurately represent what the data structure is, what it is intended to be used for, and what it does and does not do.

[note] then add this (an important piece of information borrowed from quotient filters. I borrowed this from quotient filters, which has the exact same properties of a bloom filter only its internal representation is different. Both a bloom filter and a quotient filter are equivalent data structures - probabilistic data structure filters... This is a better description of what a bloom filter is, what it is used for, and why secondary storage is needed. A bloom filter cannot resize without storing keys - if you think it can, I would love to see you try and build a bloom filter that can rehash and resize for optimal m and k as per p and n. In fact, I would take great pleasure at seeing you attempt this impossible programming feat:

"Each Bloom Filter is associated with a more space-consuming set, such as a B-tree and its variants, or other disk-based algorithms, and its contents are reflective of the associated set. As elements – key/value pairs – are added to the set, their keys are also added to the Bloom Filter. However the Bloom Filter stores only a few bits per key, whereas the set stores the entire key, which can be of arbitrary size; therefore, a Bloom Filter can often be memory-resident while the associated set is stored in slower secondary storage. Thus this association can dramatically improve the performance of membership tests, because a test that results in "negative" (definitely not in the set) can be resolved by the Bloom Filter without necessitating any I/Os to access the set itself. The reason for having secondary storage, even though it is never accessed by the lookup operation for the stand-alone case of "negative" only lookups, is due to the fact that a bloom filter needs to resize according to the number of elements in the data set to account for optimally small false positive rates and therefore more definite "negative" results. If the bloom filter is static, the probability suffers where its size and hash usage are constants. A bloom filter cannot re-hash by itself, because it never stores any keys, only bits or numbers. Therefore, in order to resize the bloom filter we must at least store the actual keys somewhere. This gives the bloom filter the ability to re-hash, and therefore resize, without difficulty."

[note] Borrowing from 'Quotient Filter' page,

2. --> Algorithm Description section - careful on wording..

"To query for an element (test whether it is in the set), feed it to each of the k hash functions to get k array positions. If any of the bits at these positions are 0, the element is definitely not in the set – if it were, then all the bits would have been set to 1 when it was inserted. If all are 1, then either the element is in the set, or the bits have by chance been set to 1 during the insertion of other elements, resulting in a false positive. In a simple bloom filter, there is no way to distinguish between the two cases, but more advanced techniques can address this problem."

[note] The above paragraph just needs re-wording. This is also a fact. A bloom fiter cannot and never does say 'A contains x', it only says 'A may contain x with standard probability error p', basically meaning the only definitive test is 'x is not in A' - but that isn't even definitive, since as the bloom filter fills up, 'x is not in A' becomes 'A may contain x with standard probability error p' These are facts about the data structure. The more advanced techniques refer to other sections of this article...

"To query for an element (test whether it might be in the set), feed it to each of the k hash functions to get k array positions. If any of the bits at these positions are 0, the element is definitely not in the set - if it were, then all the bits would have been set to 1 when it was inserted. If all are 1, then either the element is in the set, or it actually is not (a collision resulting in a false positive). More advanced techniques work out the problems of a normal bloom filter - such as dynamic resizing, allowing removal, and maintaing a preferable false positive rate."

[note] and the following paragraph

"One-time removal of an element from a Bloom filter can be simulated by having a second Bloom filter that contains items that have been removed. However, false positives in the second filter become false negatives in the composite filter, which may be undesirable. In this approach re-adding a previously removed item is not possible, as one would have to remove it from the "removed" filter."

[note] should read - this is factual. A bloom filter cannot be resized without rehashing, and most bloom filters are not constant-sized arrays. Therefore, if you want to resize a bloom filter secondary storage of keys is required. A scalable bloom filter does address the problem of resizing. Cite Scalable Bloom filter for more info, these are borrowed from the respective links to articles.:

"More advanced algorithms resolve the problem of not being able to remove from a bloom filter. Additionally, a bloom filter cannot be resized - in most proposed bloom filter algorithms there is no way to re-hash, since no keys are stored. A Scalable Bloom Filter addresses this problem. Additionally, if the actual keys are stored in secondary storage, this gives the bloom filter the ability to resize and re-hash, therefore implying that we can choose an optimal m and k for any p and n."

[note] The following should be added to algorithm description/merged into it somewhere. This is also fact. A bloom filter only says 'x maybe is in the set' or 'x is not in the set'. The probability of false positives is a probability application. Following probabilistic math, this means that if you have a 1% chance of error, then 1/100 queries, based on that probability, will return a false positive, on average. However, this is not exact - as you can see the probability of flipping a coin is 1/2, but usually by the sixth time you flip it will land on a different side eventually. This is a fact of probability. It is also fact that as the bloom filter fills up, if it doesn't resize itself, p will rise. This means more false positives and also less definitive negatives. That is a fact. The only way to have a constant bloom filter with acceptable p is to set its initial size arbitrarily large - which defeats the purpose of memory efficiency and also increase k, which worsens performance - usually to six hashes, according to your formulas...

"A bloom filter resolves two situations. The first is when a program only wishes to know whether an item is definitely not in the set. In this situation, a bloom filter can be used as a stand-alone data structure. However, if the program wants to know whether an item definitely is in the set, an accompanying data structure must be utilized. The reason for an accompanying data structure in the latter case is because even though a bloom filter employs a 1% false positive rate storing fewer than 10 bits, and can be reduced easily to 0.1% - this still means that 1 out of every 100 or 1 out of every 1000 queries, respectively, will return a false positive in the general case; yielding an item that is not actually in the set when the bloom filter says it is. As the data structure fills up and the bloom filter size in bits m and hash function numbers k are not set to ideal proportions of item numbers n, it becomes the case that the probability p will suffer, meaning higher chances of false positive lookups. This is especially a problem when n is dynamic and the bloom filter cannot resize itself without total reconstruction, if that is possible according to the implementation of the bloom filter. Therefore, care must be taken, unless using Scalable Bloom Filters, in choosing an optimal m and p."

[note] - this is also a factual statement. What would you call hashing to the same bucket? A collision. However, bloom filters do not provide collision resolution - that is how it arrives as a probabilistic data structure. Increasing false positive rates means that you have decreased the amount of definite negatives. What is a false positive? It is an item that is not in the set, but the bloom filter says it is. What would the inverse of that situation be? The bloom filter says it is not in the set. Because false positives exist - a false positive is actually a negative. Let x be 'maybe in Set' and y be 'not in set'. As n increases, and the bloom filter does not resize, p increases. As p increases, x gets bigger and y gets smaller. In fact, false positives only exist because y gets smaller. In other words, the negative members of x in a real data set would be members of y, or to put it more bluntly, false positives are members of y. understood? this is why you have an algorithm page that contradicts itself...

"Because the margin of error is usually 1% to test if an item is in the set, the bloom filter cannot guarantee with 100% certainty that an item actually is in the set. In fact, just like a hashtable; a bloom filter can have collisions. A collision in a bloom filter is when an item lookup results in all 1s but is actually not in the set, meaning a false positive. As the number of items it represents increases, the data structure, if left alone, continues to turn 'not in this set' to 'maybe in this set' - increasing false positive rates; also meaning negative results will dwindle to false positives. An exhausted bloom filter, if it cannot be re-hashed, may get full at which point it considers almost everything 'a part of the set', and almost nothing 'not in this set'. Hence the term probablistic bloom filter: it is a probablistic data structure, intended to be used as a filter."

[note] now the article does not conflict

3. --> Space and Time Advantages

[note] change the entire section to the following. The following paragraph re-iterates the points made above.. It does you all no good to discuss calculations if you are using a constant size m. Given a constant size m, why would you need to calculate m? It would be a constant. So would k. Why do any calculation? The only time this is fine is if n is also constant - which in practice, n is hardly ever constant size. Therefore, if you do not resize, and therefore re-hash, and therefore need to store keys somewhere, then these calculations are meaningless. That's why I propose you state the following:

Bloom filters with 1% error and an optimal value of k requires only about 9.6 bits per element - regardless of the size of the individual elements. This advantage comes partly from its compactness, inherited from arrays, and partly from its probablistic nature. The 1% false positive rate can be reduced by a factor of ten by adding only about 4.8 bits per element. However, simple bloom filters and counting filters are static data sets - that also cannot be resized unless actual keys are stored somewhere. This is because items that are inserted do not store their keys - there is no way to re-hash. For this reason, we must either choose ahead of time an optimal, static m and then an appropriate k (meaning an unecessarily large data structure to account for any n items), or store the actual keys in secondary storage so we can re-hash the bloom filter and therefore acheive optimal, calculated m and k in accordance with any p and n.

[note] - see usage examples and actually read them

Its general use is to accompany and filter unecessary disk I/O, or avoid similar more costly operations - but will come at the cost of memory consumption, as well as insertions, removals, and lookups being in addition to the underlying secondary data structure. Filtering is employed to weed out what isn't in the set, but no matter how small the probability of false positives p, a bloom filter never says with absolute certainty whether or not an item is in the set - it just says it might be; and in addition it cannot find a nearest neighbor that definitely is in the set. In this situation, the bloom filter serves to eliminate, sometimes dramatically, unecessary operations simply by ruling out what items aren't members of a set - which results in performance enhancements on the secondary data structure lookup. See use cases below for examples of how this is used.

[note] - I do not see what's wrong with this statement, either.

For the case of lookups where the only concern is what definitely isn't in the set, it can be used as a stand-alone data structure; and in those cases it may be beneficial over other algorithms in terms of lookup speed 0(k) and memory consumption (m bits). Note that a Scalable Bloom Filter may reduce the need for setting m higher than is necessary relative to a dynamic n, but so too can re-hashing and therefore calculating appropriate sizes of m and k. If n is static, then m and k can easily be pre-calculated for optimization. See use cases below for examples of how this is used.

[note] cited from article

An example of the problem in practice with bloom filters using up more memory than desired or is ideal was proven by Rottenstreich, Kanizo & Keslassy (2012): "Counting Bloom Filters (CBFs) are widely used in networking device algorithms. They implement fast set representations to support membership queries with limited error, and support element deletions unlike Bloom Filters. However, they consume significant amounts of memory..."

[note] this is basically what you had, just re-worded slightly..

Bloom filters have the unusual property that the time needed either to add items or to check whether an item is potentially in the underlying data set is a fixed contant, 0(k) - where k is a multiple of hash functions, completely independent of the number of items already in the set. These lookups can also be parallelized, even in a multithreaded environment. This gives a bloom filter its main advantage over or in accordance with other data structures - like hashtables, tries, lists, or trees.


[auth]Thank you - sorry it got heated. All I wanted was for this page to get corrected and clarified. One person actually posted on a site saying it was a form of a trie. That is very incorrect, and I am assuming they got it off of this page..75.133.175.2 (talk) 18:54, 6 November 2013 (UTC)

Re your point #1, please see WP:TECHNICAL. The lead is supposed to be for an easy-to-read introduction to the subject. It is not supposed to be the place to go into this level of detail. And your writing could use copyediting, e.g. the last sentence no verb. And your last sentence in your last point, "A Bloom filter must be accompanied by..." is simply blatantly, false. There are plenty of use cases for Bloom filters that do not accompany it with another data structure. —David Eppstein (talk) 19:09, 6 November 2013 (UTC)

David: I'm assuming you meant the discription of the algorithm, and not the discription of a bloom filter's use as discussed on the very closely related quotient filter page (I would assume since a quotient filter is an equivalent algorithm, that page is at least not wrong - as noted by wikipedia by not saying that the quotient filter page does not contradict itself and has no errors in statement)... So I moved it to 'Algorithm Description' above. I also moved more from 'Space and Time Advantages' above to 'Algorithm Description', since it made more sense for some of it to be moved in order to only discuss actual space and time advantages. If you feel that my last sentence is gramatically incorrect, feel free to modify it to fit your needs, or any other sentences that are gramatically incorrect (I'm not a writer by profession). Also, when you do agree on the wording, please then revise the bloom filter page itself - since you seem to be the most interested in its revision at this time...

Also, 'A bloom filter must be accompanied by...' is not blatently false. Sure, there are situations, where the programmer just wants to know what isn't in the set. Then it can be used as a stand-alone structure. However, the only way to know whether an item is in the set is to have the 'bloom filter accompanied by...' (another data set). We are disagreeing here on a matter of a special case, when both of us are correct when you look at a bloom filter from a certain angle. But I would argue that in order to have optimal sizes of m and k, which is relative to n and desired p, we must at least store the keys in secondary storage so we can re-hash the bloom filter. Otherwise, there is no way to re-hash the bloom filter and we end up with a static set in which p suffers as n increases - meaning more false positives and less definitive negatives. The exact same feature is true of a quotient filter.

I'll make the corrections above, let me know when you have the time to actually read it. When we reach agreement, please update the bloom filter page with the changes.

75.133.175.2 (talk) 06:37, 7 November 2013 (UTC)

I think we are still far apart. You seem to have a very dogmatic point of view about what operations must be supported to call something a set data structure. Dogmatism and opinions are not a good mix for editing here, where you need to back up things with reliable sources. What are your sources? —David Eppstein (talk) 06:58, 7 November 2013 (UTC)

[2]

A set is a gathering together into a whole of definite, distinct objects of our perception [Anschauung] or of our thought – which are called elements of the set.

A set is a well defined collection of distinct objects. The objects that make up a set (also known as the elements or members of a set) can be anything: numbers, people, letters of the alphabet, other sets, and so on.

elements: wikipedia- The relation "is an element of", also called set membership, is denoted by the symbol "∈". Writing

   x \in A 

means that "x is an element of A". Equivalent expressions are "x is a member of A", "x belongs to A", "x is in A" and "x lies in A". The expressions "A includes x" and "A contains x" are also used to mean set membership, however some authors use them to mean instead "x is a subset of A".[1] Logician George Boolos strongly urged that "contains" be used for membership only and "includes" for the subset relation only.[2]

Another possible notation for the same relation is

   A \ni x,

meaning "A contains x", though it is used less often.

The negation of set membership is denoted by the symbol "∉". Writing

   x \notin A

means that "x is not an element of A".

[3]

The definition of a set sounds very vague at first. A set can be defined as a collection of things that are brought together because they obey a certain rule.

These 'things' may be anything you like: numbers, people, shapes, cities, bits of text ..., literally anything.

The key fact about the 'rule' they all obey is that it must be well-defined. In other words, it enables us to say for sure whether or not a given 'thing' belongs to the collection. If the 'things' we're talking about are English words, for example, a well-defined rule might be:

       '... has 5 or more letters'

A rule which is not well-defined (and therefore couldn't be used to define a set) might be:

       '... is hard to spell'


The mathematical requirements for a set means "distinct elements", and a requirement for elements in the set is both "A contains x AND x is not an element of A", and further a requirement for a set is that the set is "well defined, which means it enables us to say for sure whether or not a given 'thing' belongs to the collection".

Since a bloom filter does not have distinct elements, does not contain elements, and never says that for sure 'A contains x'; as such it is not a well-defined set - so according to mathematics, a bloom filter does not meet the requirements of a set - whereas a trie, hashtable, tree, list, etc. all do - because they meet not only the requirement that 'x is not an element of A' but also 'A contains X', contains distinct and definite elements, etc.

You seem to be very dogmatic in your understanding of set theory... I'll look for sources to back up statements and make adjustments since you cannot trust the advise of an expert on the subject of algorithms speaking plainly. That said, it is possible if I were to change this page you would either change it right back or put citations needed notices everywhere. I'll fix the above to only reference cited articles.. That way you can also trust the words of some other expert who is also just speaking their minds... Until then, david..

75.133.175.2 (talk) 06:31, 8 November 2013 (UTC)

I know what a set is. That wasn't the point, and cluttering up this talk page with copied and pasted definitions doesn't seem to be very constructive to me (nor does copying and reversing lines from my comments). The question was, what is your source for the assertions that Bloom filters are not data structures, do not (even approximately) represent sets, and can only used with an exact backing data structure? —David Eppstein (talk) 07:09, 8 November 2013 (UTC)

David: You asked for references - so I gave them. Sorry it cluttered the disussion up. If asking for references, citations, and external sources doesn't seem to be constructive to you, then I will simply stop here.

Your question 'the assertion that bloom bilters are not data structures' - is something I never said. I said they are not data sets - by mathematical definition.

Your next question, 'do not (even approximately) represent sets' is your implication - a twist of my words. I simply said a Bloom filter cannot be a set - because it does not meet the mathematical requirements of a set. I'm sorry it's taking this long to explain it to you. An approximate set is not a set. A bloom filter is more like a null set or not a set at all. There are no 'approximate sets' in mathematics. In fact, a set that is not well-defined or automic cannot be a set and therefore has paradoxes. In this case, the paradox is the false positives, where 'x is in A' becomes 'x may be in A' and 'y is not in A' becomes 'y may be in A', as the set fills up. In other words, it cannot say for certainty, in all truthfulness, the set requirement that 'x is in A' or 'y is not in A'. The reason for this is your false positive paradox. As such, you are defying mathematics by suggesting it is representative of a set. It is not a set, and approximate sets are paradoxes - and therefore an error, as shown.

Your third question, 'can only be used with an exact backing data structure' - is also a twist of my words. I said if n is dynamic, which it usually is - that means if you want an appropriate, calculatable value of m and k to guarantee a certain probability p you want according to the number of items n, then in that case you have to store the keys somewhere. That is because (fact) you cannot ever resize the bit array in a bloom filter because if you do not store the keys somewhere it is impossible to rehash - therefore impossible to resize. If you can't resize the data structure, then you pick a constant m and k, which means as the array fills p becomes larger - as per your page's own calculations. Picking a constant size m and k is a really poor choice...

Therefore, as n is usually not a constant size, where you say yourself a bloom fiter must allow for insertions and deletions - then how, pray tell me - do you suppose you resize the bloom filter? If you need to resize it, you must rehash it - therefore you have to store the keys somewhere in secondary storage. That's what I said, and it is completely correct and factual.

While we are talking about facts, let me go back to my original statement which you flat out refuse, along with any citations, facts, observations, or anything about this data structure - you just flat out refuse to read and accept things, and instead are insistant on just throwing up strawmen, argumentative abuses, and being rather ignorant and close-minded. You do not understand this data structure. Because you do not - and this wikipedia page clearly displays this, the web is full of false statements regarding bloom filters. In fact, it is my understanding that many of you are willing to accept a margin of error in lookups...:

Let me ask you this: Let's say your bloom filter set is your bank account. Would you want it to be 99% accurate? What if that 1% inaccurate thing that happens to you every month is a 1,000-1,000,000 dollar charge on your account - a charge that never really happened? Would you like that scenario? That is the same case as a bloom filter.

What if you went to the doctor, and he was 99% certain that you had cancer? What if the screening test was 99.9% accurate, and you test positive? let's further say you go on to surgery, where they find no tumor. You now owe millions of dollars in hospital and surgeon fees, you did the surgery and radiation for no reason, and you have to declare medical bankruptcy - destroying your credit and ability to purchase anything.

Do these scenarios sound acceptable to you? If they do, feel free to use a bloom filter as a set and accept the margin of error. If you think a data set must be well-defined and more exact, I suggest using a secondary data structure and using a bloom filter as the name implies - to filter out unecessary queries from the underlying data structure. I am done with you here, you are so ignorant...

2. I explain my points in the article and are observations and facts about the data structure - not my own opinion or 'dogmatic view', as you so put it.. Ignorance must be bliss, even on an encyclopedia. People use wikipedia as a source of correct information. Sad you all cannot correct this... I see now that this argument is just going to go on and on, with me pointing to ever more evidence only to be told 'nah - I don't agree with anything'. You must be republican or a member of the tea party :*75.133.175.2 (talk) 01:43, 9 November 2013 (UTC)

Separate derivation from results in the "Probability of false positives" section[edit]

The presentation might be clearer if we were to separate the formulas from their derivations in this section. Anyone agree/disagree? --Doradus (talk) 16:31, 8 November 2013 (UTC)


Imagine A World With An Infinitely Sparse Hash Table[edit]

"With unlimited core memory an error-free hash eliminates all unnecessary disk accesses..." Nice thought experiment, but back here in the 3rd dimension we could probably just say that an in-memory hash table based on some reasonable multiple of the size of your keys would suffice, and then from there go into how those requirements could be made lower with a compact Trie or a Bloom Filter. -- The Arcadian — Preceding unsigned comment added by 98.169.58.20 (talk) 21:34, 6 March 2014 (UTC)

Some anonymous user rewrote all the hypotheticals leading to such sentences. Anyway, I've changed "unlimited" to "sufficient", etc. Shreevatsa (talk) 10:36, 7 March 2014 (UTC)

External links modified[edit]

Hello fellow Wikipedians,

I have just modified 3 external links on Bloom filter. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:

When you have finished reviewing my changes, please set the checked parameter below to true or failed to let others know (documentation at {{Sourcecheck}}).

As of February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required on behalf of editors regarding these talk page notices, other than regular verification, as with any edit, using the archive tools per instructions below. This message updated dynamically through the template {{sourcecheck}} (last update: 1 May 2018).

  • If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
  • If you found an error with any archives or the URLs themselves, you can fix them with this tool.


Cheers.—InternetArchiveBot (Report bug) 09:55, 4 November 2016 (UTC)

Chrome browser problem[edit]

A browser incompatibility is affecting the first sentence of "Algorithm description". The editor content is as follows.

An ''empty Bloom filter'' is a [[bit array]] of {{mvar|m}} bits, all set to 0. There must also be {{mvar|k}} different [[hash function]]s defined,

In MS IE9 the text is correctly shown as:

An empty Bloom filter is a bit array of m bits, all set to 0. There must also be k different hash functions defined,...

but in Google Chrome the text is incorrectly shown as:

An empty Bloom filter is a bit array of bits, all set to 0. There must also be different hash functions defined,...

Perhaps the problem can be avoided by

 replace {{mvar|m}} by <i>m</i>, replace {{mvar|k}} by <i>k</i>.

Blooteuth (talk) 03:10, 26 December 2016 (UTC)

That would produce a slanted sans-serif letter m, where the result should be an italic serif m to indicate that it is a mathematical variable, not a fragment of emphasized text. —David Eppstein (talk) 04:31, 26 December 2016 (UTC)

External links modified[edit]

Hello fellow Wikipedians,

I have just modified 5 external links on Bloom filter. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:

When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.

As of February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required on behalf of editors regarding these talk page notices, other than regular verification, as with any edit, using the archive tools per instructions below. This message updated dynamically through the template {{sourcecheck}} (last update: 1 May 2018).

  • If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
  • If you found an error with any archives or the URLs themselves, you can fix them with this tool.


Cheers.—InternetArchiveBot (Report bug) 02:14, 22 July 2017 (UTC)