Talk:Naive Bayes spam filtering

From Wikipedia, the free encyclopedia
  (Redirected from Talk:Bayesian spam filtering)
Jump to: navigation, search
WikiProject Computing  
WikiProject icon This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
 ???  This article has not yet received a rating on the project's quality scale.
 ???  This article has not yet received a rating on the project's importance scale.
 


Errors[edit]

This entry is all wrong ! The text simply describe Naive_Bayes_classifier applyied to Text classification (such example already exist in the mentioned article). Bayesian filtering is a estimation process generally called "Sequential Bayesian Filtering" that estimates the state of a system through observations. It relates to Particle Filters, Kalman filters and Hidden Markov Models. I strongly suggest to reformat the content (actually I do not have free time to do it). Please someone take a little google search and rewrite this area. I suggest as starting point A survey of probabilistic models, using the Bayesian Programming methodology as a unifying framework


I have rewritten this article to bear some relationship to reality. See the critiques above by some other clueful fellow. The main problem is that the previous version failed to recognize that what they called "Bayesian Filtering"

  • made no reference to what most of the world aside from semi-literate slashdot nerds calls Bayesian filtering
  • was not bayesian except in the most vague sense in which all inference could be considered bayesian
  • was not really filtering except under a *very* loose definition of the word (in the same sense that classification is regression to {0,1})
  • described what is more commonly called classification not filtering
  • was invented in the 60s
  • and not by paul graham

In conclusion, paul graham is a smart, entertaining guy, but he has a phd in computer science and failed to acknowledge the *many* articles on spam classification using the naive bayes (not to mention many more effective methods) which predate him by *years*. The fact that his article recieved so much publicity is a interesting comment on the gap between academic thought and the general public. But it is the duty of an encyclopedia to bridge that gap not widen it.

This entry could still use a lot of good work. Someone should incorporate some information on the actual methods of Bayesian filtering. I am not that someone.

PopFile[edit]

Do we really need the advert for PopFile?

POPFile isn't the archetypical Bayesian spam filter. Maybe a specific spam filtering program (such as Spam Bayes or K9) should be mentioned as filter, and POPFile cited as example of the general classification also mentioned in the article. Drangon

I am a POPFile user so I may be biased, but I think that is a good idea to include examples of bayesian filters. The POPFile link was just removed with the comment "Don't need the advert at the end". I agree the text was pretty strong in saying it is one of the most popular, but providing the reader some links is a good idea. Currently the text only mentions Thunderbird as an option. Programs such as K9, SpamBayes, and POPFile let the user continue to use their current mail client. Thunderbird is a good program but it is not right for everyone. And following what Drangon said, POPFile is a good example of how bayesian filtering can be used for more than just spam filtering. JoeChongq 04:45, 16 Dec 2004 (UTC)

Proposal for Move[edit]

I suggest that this page be renamed "bayesian spam filtering" and moved. As people have pointed out, this article is not about bayesian filters in general, but rather a specific application in spam filtering. It does have some good info/links for spam filtering, which people are generally interested in, but as an article on "Bayesian Filtering" it could be much broader/technical/better.

  • I totally agree, and have now renamed the article to "Bayesian spam filtering". --Fredrik Orderud 13:04, 2 January 2006 (UTC)

I am real new here and not yet prepared to be bold. The page "Bayesian spam filtering" should address an audience that are short on the prerequisite skills to deal with learn from the other pages on say Boosting. Thus editors skilled in mathematics and related disciplines should in my view keep it simple. It should provide specific information on Bayesian Filtering as used in every day filters that are called Bayesian, even if to an expert they are naive Bayes etal.

I think the article lacks in that readers should be made aware that

  • many 'features' and not just emails contents(Words) may be useful input to a Bayesian algorithm.
  • 'Features' should link to a page that describes it in a more generalised manner.
  • As the page is more about spam filtering using Bayesian filter, then to be balanced, the possibility of using other algortihms for spam filtering should be raised. eg SVM.
  • The top of the page should also link to directly Baysean Inference (I think) the sentence could for instance indicate this is an application of the Baysean Inference technique to a particular problem. The link to Bayesian statistical methods looks too much like its just a biography, or addresses the dry metaphysical 'what does probability mean debate', when I scanned around trying to findout how do I get to the real stuff, I thought that was dead end in historical section of wikipedia.

ZuluWarrior 05:42, 28 February 2006 (UTC)

An article needs to be balanced only in reference to its scope, and this here is about Bayesian methods only. The current place for comparing spam filtering techniques is Stopping_e-mail_abuse#Examination_of_anti-spam_methods, for Bayes see Stopping_e-mail_abuse#Statistical_filtering.--84.188.179.95 10:05, 27 June 2006 (UTC)

log filter[edit]

In order to understand how it works in filtering a log, I need to get a code example , do you have any?

Pronunciation?[edit]

Anybody know how this is typically pronounced? Is it BAY-zee-an, or bay-EEZ-yun or by-EEZ-yun, or what?

I work with programmers from the U.S. (both coasts), France, India, and China. They all pronounce it bay-EE-sian. Don't get me started on the French pronouncing SQL as "squirrel". Kainaw 18:05, 21 Apr 2005 (UTC)
This is late to the party, but all the statistics types that I know pronounce it "BAY-zee-an" or "BAY-zhun", as it's named after "Bayes", prounounced "bays", not "BAY-ess", "BUY-ez", or other alternatives. 136.251.12.63 18:28, 13 January 2006 (UTC)

I agree with the last comment. (Keep in mind that I don't have any experience, but the last presents an important point. Plus, it makes the most sense.) Posted by SimpleBeep

Correct, it is pronounced "Bays-e-en", after Rev. "Thomas Bayes" who wrote the paper that turned probability science upsidedown in 1763. I used to pronounce it wrong as well and did nearly two hours of research on this. - Onexdata 11/20/2006

I reformatted the phonetic respelling of the word to Wikipedia standard. - Fenwayguy (talk) 21:43, 5 August 2008 (UTC)

Formula[edit]

Is the formula correct? It appears to be ((words_in_spam/spam_count)*(spam_count/total_count))/(words_in_total/total_count). If that is so, then this reduces to words_in_spam/words_in_total. Bayes would have noticed that easily, so there must be a reason for a more complex formula. If so, it is not mentioned in the article. Kainaw 18:05, 21 Apr 2005 (UTC)

The formula ought to be read as, to be Bayesian, the probability of a message being spam spam given certain words equals the prior probability of a message being spam times the probability of those certain words given the message is spam all divided by the probability of those words. There is no simple reduction. Remember the | is meant to display the term "given". **This may or may not relate to actual Bayesian filtering, but this is how typical Bayesian formulations in statistics and induction work**
Yes, but this "formula" is nevertheless unrelated to the way spam filtering software works. Its author probably tried to replace directly the values in Bayes' theorem, but that not how it works, it's slightly more complex than that. This "formula" has to be removed.

The "Bayesian" here is because spam filters are typically Naive Bayes classifiers, as noted below. As such, the formula should indicate that the we make the naive, incorrect, but often useful assumption that the words all occur independently, given that the message is spam (or not). That is, the final formula that spam filters use has a numerator something like this:

P(word1|spam) P(word2 | spam) .... P(spam)

If we don't make this assumption, we can't use individual word statistics, we need statistics for the entire set of words making up the message. johndburger 01:59, 24 April 2006 (UTC)

There are other ways of combining probabilities than using the "naive" approach. One of them is tu use the inverse of the chi-square function. I think that the two applications of Bayes' formula should be clearly distinguished. The first application gives the way to compute the probabilities to be a spam knowing that a given word is in the message. The second application is one of the methods of combining the individual probabilities. I think that software using bayes to compute the individual probabilities should already be qualified as bayesian, even if it does not use the naive approach for the second part.
In the numerator, it's P(spam|wordN) that is used, not the reverse way ;-). P(spam) term can disappear with some assumptions.

Revert[edit]

BoredAndriod's revision:

Bayesian filtering is the process of using Bayesian statistics to attempt to remove noise from a corrupted signal.

Bayesian statistics is a paradigm of statistics named for the Rev. Thomas Bayes. It treats probabilities as estimates of uncertainty and derives methods of statistical inference from decision theory. Bayesian statistics is a reaction to frequentist statistics which interpretes probabilities as the limiting frequency of events upon infinite repetition of some experiment.

The term Bayesian filtering has lately been used to refer to the naive Bayes algorithm which was invented in the 1960s, but recently made popular by the web posting A Plan for Spam by Paul Graham. Presumably Graham chose to refer to the naive Bayes classifier as "Bayesian Filtering" due to its use of Bayes' theorem. This is somewhat confusing, however, as neither Bayes' theorem nor the naive Bayes classifier is necessarily Bayesian. Both are direct applications of probability theory with no interpretation of what the probabilities mean.

I am reverting the article because a lot of material was taken out, uncessarily it would seem, and without any explanation as well as discussion, in short without a word. This is very bad form but we can see which changes we want to add back in from this article. --ShaunMacPherson 13:24, 5 Jun 2005 (UTC)
I began to address the above problem (that the content of this page should be completely different) at User:Samohyl_Jan/Bayesian_filtering. Anyone is invited to help. Samohyl Jan 08:36, 15 November 2005 (UTC)

Useful None the less[edit]

My brief reading of the article explained why the content of most of my 'vi-agra' span contains approximately 300 words in a 'story'. It was therefore useful. I suggest keeping 'it simple,stupid' but correcting any errors in reference and math. The in-depth statistics and programming can be handled via links as necessary.

This article needs more technical content. It should be clear enough that one could implement a mostly working Bayesian spam filter based on the information in this article and no other sources.

External Links[edit]

I agree that the SpamBully pointer should go, but I put back the SpamBayes pointer. It's not a product, it's an open source project. johndburger 15:04, 4 May 2006 (UTC)

The spam reduction tools are in the article Stopping e-mail abuse. Besides, there already is an article about SpamBayes. The external link is not needed. --Sbluen 22:54, 4 May 2006 (UTC)

Okay, I buy that. johndburger 22:56, 4 May 2006 (UTC)

Advantages and Disadvantages[edit]

There is a lengthy section on advantages. Are there any disadvantages? Shouldn't there be a section on the disadvantages? The only one I can think of is the process requires "training" the filter for the filter to work well, right? ~a (usertalkcontribs) 16:35, 16 August 2006 (UTC)

Paul Graham's formula[edit]

Perhaps it would be pertinent to mention that Graham's formulas are incorrect, in that they don't properly use Bayes' theorem in spite of acting decently as a spam filter?

-Nathan J. Yoder 22:43, 1 October 2007 (UTC)


Yes, they are incorrect. The correct formula is the "spamicity" or "spaminess" one, without the funny 2 factor.
But the article does not say his formulas are correct. It only acknowledges for the historical role of his initial article, which has done a lot for the spreading of bayesian filtering software.

ifile[edit]

I added the part about Jason Rennie's ifile program that was recently deleted. It certainly seems noteworthy to include the first apparent program to implement a Bayesian algorithm for mail filtering. If anybody can source more info on Rennie or this program I think it would help the article. Jkraybill (talk) 15:45, 26 June 2008 (UTC)

I reformatted the word "ifile" to Wikipedia's style standard for software titles, which is plain text. - Fenwayguy (talk) 21:43, 5 August 2008 (UTC)

Wrong formula, proposal for rewriting[edit]

The formula presented in this article is simply wrong. The formulas used in Bayes filtering software derive from Baye's theorem, but not that immediately. As a matter of fact, the Bayes probability formula is used TWICE in bayesian filtering of spam : 1 - a first time to compute the probablity that a given word is part of a spam 2 - a second time to combine the various probabilities from every word.

I have edited the article to make the maths correct. Now Paul Graham's formula is mentioned (and it IS correct and can be deduced from Baye's theorem), and the probabilities combining methods are explicited.

For the sake of conciseness, I have not detailed the exact demonstrations of the formulas presented here, but believe me, they are quite straightforward. The Paul Graham formula, for example, is obtained considering that a message is either ham or spam, P(H U S) = 1.0, that's why an addition sign + appears. -—Preceding unsigned comment added by 213.41.243.33 (talkcontribs)

Thank you for your contributions. On Wikipedia, we prefer to respond on article talk pages, rather than on personal user talk pages.
OK, sorry.


I have reviewed the formulas and the original that was on the page is correct, just in a simplified form (P(words|spam) + P(words|ham) = P(words)), so I have added it back.
I am sorry, but this "original formula" \Pr(\mathrm{spam}|\mathrm{words}) = \frac{\Pr(\mathrm{words}|\mathrm{spam})\Pr(\mathrm{spam})}{\Pr(\mathrm{words})} is not used in bayesian filtering software.
One of its biggest problems is using the "words" (plural) event, while bayesian filtering software concentrates first on one word, then takes into account several words, in two very distinct phases.
Here are the reasons why no one would certainly ever use this "simplified formula":
This "simplified formula" is assymetrical: it priviledges the spam, and does not take into account the information we have about ham. This information is at least as important as the information on the spam. If your best friend sends you a joke about viagra, for example, his message should not be classified as spam, despite the presence of "viagra", because we know it comes from our best friend, whose name is an indicator of ham.
Pr(words|spam) would be the probability that "replica" AND "watches" AND "home" AND all other words appear in an incoming message, which would need a HUGE learning base to be accurate.
In real life, both numerator and denominator would be next or equal to zero. This expression would be very close, if not identical, to what is called in maths the "inderminated form" zero divided by zero. Calling for trouble...
Oh, but perharps the vague expression "words" meant a OR, not a AND? In that case, you would run into other problems because you have too little information. Both P(words|spam) and P(words) would be almost equal, and the whole formula would collapse to P(spam). The probability of this message to be spam would be approximated to general probability of any message to be spam, regardless of the contents of the message. There's no way to take an accurate decision out of such information.
To summarize, this formula is a very naive attempt to apply Bayes' theorem directly, while the problem is way more complex. It can not even be regarded as a "starting point", because the demonstrations of the two real formulas are not deduced from it.
Speaking about sources, I would love to see a reference to a serious article introducing this "original formula"...


It's preferred that we link to the naive bayes classifier page in the first paragraph, because it's more specific.
But wrong. The spam filtering software described in linux journal's article is not a naive bayesian classifier but is bayesian spam filtering software. As a mattar of facts, the reference to naive classifiers is too specific and only addresses a subset of bayesian software.
Please notice I have mentioned the "naive bayesian classifier" later in my contribution, for the subset of software that does assumptions about probabilistic independance of words.


I have checked Paul Graham's A Plan For Spam, as well as multiple sources from other authors of spam filters (some listed under external links), and everything points to his formula being incorrect. If you disagree, please find a source that states that, because I'm not about to go against multiple, independent sources saying he is wrong.
Spamicity's formula is correct, used with profit in many software, and I can prove it mathematically.
Original article from Paul Graham uses a slightly different formula, and yes, that one is flawed.
I suggest that, from now on, we focus on the spamicity formula instead of Paul Graham's formula. It's spamicity that I am speaking about in my contribution, and it's also spamicity that is used in the software I know about.


Here are my sources:
My first source is [1] and is already mentioned in the article.
My second source is in the Wikipedia itself, under the section "Alternative forms of Bayes' formula" in Bayes' theorem and demonstrates the first formula, from which we deduce the spamicity formula. I have added it to the article.
My third source is [2], and demonstrates the second formula about combining probabilities. I have added it now as a reference. It is worth noticing that this last source is a general article of applied mathematics, and that it has nothing to do with spam or learning procedures, and that fact makes it very general (and beautiful, if I may).
Tim Peters' post might be used for a demonstration of the second formula too, although I prefer the other demonstration, for it makes very evident the high number of assumptions that are made (it's not only a matter of independant events).


Other software has to implement a special version to correspond to Graham's because it is wrong.
Yes. The software, at least the one I examined, uses spamicity, and that is correct.
But, please, let's focus on my contribution, instead of Paul Graham's original article.


In the numerator Graham only uses P(word|spam) and doesn't include the second operand P(spam). Furthermore, he multiplies P(word|ham) by 2. Those two violate Bayes' theorem. -Nathan J. Yoder (talk) 22:06, 5 January 2009 (UTC)
Please look at the demonstration of the first formula under Bayes'_theorem#Alternative_forms_of_Bayes.27_theorem for the correct formula, from which we define spamicity.
P(Spam) factor disappears in the spamicity formula because it is assumed to be 0.5, and later this 0.5 value disappears when the fraction is simplified. I have explained the reason why this assumption can be made in my contribution.


For the factor 2 you are right, and that makes me wrong when I said that Paul Graham's formula was correct. I thank you for opening my eyes on this, I was assuming wrongly that Paul Graham and spamicity formula were the same.
You won't find this (wrong) factor 2 in my contribution. This number probably comes, as suggested in Tim Peters' post, from a need to compensate for a biased set of learned messages. I have added a warning against that problem in my contribution.


If you still have doubts about my contribution, I propose we solve this in a civilized way instead of undoing each other's work. Please write to me at the email address I gave you, and we will discuss all the details.
Please do not respond within someone's response. The practice on Wikipedia is to put your entire response after theirs, rather than what is sometimes done in email. Consider making an account; it is free and easy, and you can sign your responses with "~~~~" which will automatically insert your username. Changes on Wikipedia are discussed on Wikipedia itself, so that everyone can participate and understand what's going on, which is why we avoid personal email communication.
For the numerator, you use an assumption that isn't justified except when you don't have enough information for what P(spam) is. Many people have uneven distributions of spam vs. ham, so it is frequently not 50/50. It's true that you could feed just enough data to make it 50/50 (otherwise it's distorted), but you'd be ignoring extra useful data and there's no reason to assume that an implementation will necessarily do this. How many actually do this other than Graham's, anyway?
There is an important reason to cover this, as he basically started the whole trend and his A Plan for Spam is often cited, so it is important to point out that his is not technically correct.
As I explained before, P(words) = P(words|spam) + P(words|ham), so it does account for ham. P(words|X) is also the combined probability using the same formula from the naive bayes classifier page, so the two steps are combined. I don't know why you think P(words|spam) ~= P(words), as that assumes P(words|ham) ~= 0. It could possibly be rephrased or clarified, but it is correct as I've defined. Since you've presented it in way that is pretty clear, I'll just leave it at that avoid confusion.
As for whether or not to call these all naive bayes classifiers, the one model you describe in the Linux Journal is a hybrid of bayes and something else. Whether or not to include it in this article would be dependent on how people tend to describe his algorithm. If people generally call it a bayes spam filtering algorithm, I'd be inclined to include it in the article. Although if we do that, it should be clarified that the implemented methods are usually either native bayes, a naive bayes variation, or a hybrid of bayes and other principles. As far as I'm aware, most are some form of naive bayes. -Nathan J. Yoder (talk) 05:57, 10 January 2009 (UTC)
"Please do not respond within someone's response." => ok, I apologize if I don't know Wikipedia's social codes.
"Consider making an account" => OK. I chose the name "MathsPoetry", for there is beauty in the maths, and maths in the rhymes ;-).
About why personal email should be avoided => okay, that makes sense.
For the assumption that P(spam) = 0.5, there's a lot of literature explaining that the "coefficients" in the bayes' alternative formula can be regarded as the degrees of confidence you put into contradictory hypotheses, rather than real probabilities (by the way, this problem is far more general than just the spam filters context, it is one of the big areas of debate about Bayes formula). Choosing 50 % for this value, despite the evident fact that in reality we nowadays have something like 90 % incoming spam, is simply a way to avoid having a very suspicious filter that would think that any incoming message is very likely to be spam. This assumption is an artificial way to have a filter that is not biased towards spam nor ham, a neutral filter. Please consider the Bayes' alternative formula is similar to an average with coefficients, and perharps that will become clear to you. In my contribution, I did not try to hide the fact that this assumption is a bit hard to swallow. The fact is that most spam filtering software does this assumption, I am just trying to explain why. And yes, the very same explanations are made in the cited sources too, I did not invent them.
About why this "simplified formula" is just fantasy: before we go any further in this discussion, I would like you to define precisely the "words" event. Is it " 'replica' and 'watches' and 'home' and ... are in the message", or is it " 'replica' or 'watches' or 'home' are in the message" ? "words" event is just too vague.
You still don't cite your sources for this "simplified formula". I strongly suspect it has just been invented by one of the Wikipedia's authors. Before we go any further, I'd like you to cite your sources. I did cite mines when you requested it.
About the hybrid methods: it IS clear in my contribution that some software is hybrid. I acknowledge that it is arguable whether such software is really "bayesian" or not, that's why I used the word "might" in my contribution. I also made very clear that such software is NOT a naive bayesian classifier. I tried to give a definition of bayesian filters that lefts room for software using Bayes for phase 1, but not for phase 2. My understanding of being "bayesian" is that you use the bayes theorem, not that you do the so-called "naive" assumption about independance of events. This article isn't called "naive bayesian spam filtering", after all... But yes this is a matter of definitions, and a definition is nothing else than a conventional choice.
About your idea of mentioning Paul Graham's mistake in the article: do it if you want, but remind we are speaking about a living person, with a sensibility. I don't think it's a positive attitude to mention other people's errors. Everyone does mistakes, should they be marked forever with the seal of infamy because of these mistakes, or do we try to take the best of everyone and mention only the good things?
I apparently did not make my point why the "original formula" is unusable. OK, let's make it less abstract and do it with an example. After all, a formula isn't made to be contemplated, but to compute a resulting value, right? Here we go, starting with the learned messages:
Message 1 : "A rolex is what you need" - marked by user as "spam"
Message 2 : "Cheap rolexes for you to look great" - marked by the user as "spam"
Message 3 : "Hello John, let's a have a beer tonight" - marked by the user as "ham"
Message 4 : "Hello, I am the widow of a an African dictator and I'll give you two millions dollars" - marked by the user as "spam"
Message 5 : "A replica watch makes you look cool" - marked by the user as "spam"
Message 6 : "I'm sorry John, but I won't make it for the beer tonight" - marked by the user as "ham"
Suspected message : "We have cheap replica rolexes for you"
For the sake of simplicity, let's assume that "we", "a", "have", "for", "to", and other similar words are considered "neutral" and ignored.
Assuming "words" implies a "and", the "simplified formula" says : P(spam|"cheap" and "replica" and "rolexes") = P("cheap" and "replica" and "rolexes"|spam) x P(spam) / P("cheap" and "replica" and "rolexes").
According to the learned base, P(spam) = 4 / 6 = 0.67.
There is no spam message in the learning base that contains "cheap" and "replica" and "rolexes". P("cheap" and "replica" and "rolexes"|spam) = 0 / 6 = 0.
There is also no message at all in the learning base that contains "cheap" and "replica" and "rolexes". P("cheap" and "replica" and "rolexes") = 0 / 6 = 0.
Your formula evaluates to 0 x 0.67 / 0. How much is zero divided by zero?
With a bigger learning base, there's a chance that the suspected message was already received and marked manually as spam. In such a case, the "original formula" does not evaluate to 0 / 0, but to (a-very-small-number) / (another-very-small-number). However, there is a lot of inaccuracy implied by such a computation.
And even with a terabytes learning base, there's a non-neglectable chance that this formula stills evaluates to 0 / 0 : it is enough for that that the spam message is new spam.
The base problem is that with such a formula, all your filter is able to do is to recognize if it already received the suspected message, with the same words, and restitute the manual classification of the message you did at learning time.
If you use a "or" instead of a "and" to define "words", it won't give you any better results. Do I have to prove that too?
Also please notice that the information about ham was not used, as I said in a previous intervention. If the suspected message was "John, we have cheap replica rolexes for you", the information that "John" is in all learned ham messages would not have been taken into account.
Try the two formulas I gave in my contribution on the same data set, and you will see they work a lot better ;-).
MathsPoetry (talk) 18:17, 11 January 2009 (UTC)

Oh, and if you are wondering why the "simplified formula" is wrong, despite looking good, I have two explanations.

The first one is that Bayes' theorem precondition is that the denominator P(B) has no vanishing probability. Here, if "words" is defined as '"replica" AND "rolexes" AND "home" AND ...', P(words) has a vanishing probability. Therefore you can't use Bayes.

The second reason why it does not work is that the Bayes formula is very general, and you can take any event for A and B. That does not mean that A nor B are relevant for your problem. MathsPoetry (talk) 10:09, 12 January 2009 (UTC)

I think that you were right that the generally accepted definition for bayesian spam filtering is that it applies to naive classifiers. I have read again with attention the Linux Journal article, and they describe their "mixed method" only as "statistical", not as "bayesian". I'll fix the article accordingly and I apologize for having expressed my feelings on this point too strongly. MathsPoetry (talk) 12:40, 21 January 2009 (UTC)
Sorry for the slow response, I had been doing other things and forgot about this. P(words) is supposed to be the probability of any of the words occurring (OR), but you're right, that doesn't make much sense mathematically. I don't know who added that formula, but I think they just made it up. Regarding Paul Graham, it's not about insulting him or anything,it's just about informing the reader that this particular popular implementation has a flaw in it. Welcome to Wikipedia!  :) -70.21.21.117 (talk) 04:24, 13 February 2009 (UTC)
For the record, if you use P(word) instead of P(words), this "simplified formula" becomes relevant (and is a starting point for the well-known spamicity formula). All the problem was in one small "s" ...! Funnily enough, some translations of the English article, for example the Portuguese one, detected the problem and used the singular in the formula.
If you use "or", the filter will restitute the information "have i already met one of these words in a spam message?" which is way too vague (just the contrary of when you use a "and" and it becomes way too precise).
The conclusion is that, to move from one single word to many words, one needs to apply Bayes again. One single formula doesn't carry enough information, we are in front of a whole bayesian network, where Bayes needs to be applied at each node. Perharps that could be made clear in the article with a diagram. I like diagrams, they make reading a lot less abstract.
For Paul Graham, add this if you feel it's necessary...
Thanks for your welcoming words! This was my first Wikipedia contribution. This article has taught me to be less affirmative: it's not because there was an obvious mistake that everything was false ;-). Thanks to you too for all your remarks! MathsPoetry (talk) 07:14, 13 February 2009 (UTC)

Likelihoods not Posteriors[edit]

I am confused as to your application of combining probabilities. You suggest that p(spam)=p(s|w1)p(s|w2)/p(s|w1)p(s|w2)+(p1-(s|w1)1-p(s|w2)). This is independant bayes formula with no prior information. How have you used posterior probabilities in the place of priors? does p(s|w1) = p(w1|s)??? You seem to be suggesting this —Preceding unsigned comment added by 80.156.46.186 (talk) 10:47, 18 August 2010 (UTC)

To get a better chance to be read, please place your question at the end, and sign it.
A demonstration of the formula used to combine probabilities can be read here: [3]. Mathematically, it's rather "dirty cooking", a long demonstration that makes a lot of assumptions.
Here, we do not confuse priors with posteriors, i.e. we do not say that p(s|w1) = p(w1|s). The formula to compute the "probability that message is spam knowing it contains replica" knowing the "probability that message contains replica knowing it is spam" is easier to demonstrate and is detailed in a previous section, the one dedicated to the "general formula". Perhaps you missed that section?
I hope that helps. --MathsPoetry (talk) 14:31, 14 July 2011 (UTC)

How to join the different probabilities?[edit]

how is the total spam probability calculated? for example, if pr(spam | "viagra") = 0.9 and pr(spam | "hello") = 0.2 , how is the pr(spam | {"viagra", "hello"} ) calculated? —Preceding unsigned comment added by 89.180.43.219 (talk) 11:52, 17 August 2008 (UTC)

You're very right. It's computed with pr(spam) = 0.9 x 0.2 / (0.9 x 0.2 + 0.1 x 0.8) = 0.69, if the bayesian filtering software uses the naive approach. I have added the combining formulas to the maths section.
P.S. pr(spam | "hello") has great chances to be near 0.5, not 0.2, because "hello" will probably appear in many spams too. In no way "hello" is a word characteristic of ham ;-). Most bayesian software will simply ignore it. —Preceding unsigned comment added by 213.41.243.33 (talk) 21:36, 8 January 2009 (UTC)

Shouldn't the formula for multiple words be Pr(Spam|Words) instead of just Pr(Spam)? In the single word calculation the function Pr(Spam) is calculated by {Amount of Spam mail}/{Amount of mail (Spam+Ham)}. For using multiple words, Pr(Spam) should still mean the probability of a spam mail, compared to ham. Markovisch (talk) 11:09, 19 December 2011 (UTC)

I do not understand your question.
If you refer to the paragraph "combining the individual probabilities", the "formula for multiple words" is currently not written as "Pr(Spam)", it is written as "p". "p" is the probability that the message is spam, knowing its content (and hence the words it contains).
If I interpret loosely your question, I end up with the suspicion that you are trying to use the formula for a single word to the case with many words. If that is really what you intent to do, don't, it just does not work. The original wikipedia article was describing something like that, and it was plain wrong (and did not match any existing external source, btw). I know that some other languages than English just copied the English version of that time, so we may still be suffering from that old mess. --MathsPoetry (talk) 21:12, 19 December 2011 (UTC)
I've analysed the whole formula, of which my conclusion is that it will only work when both the amount of spam and the amount of ham have the same volume of e-mails. On the wiki-page I haven't read that the formula only works under this assumption, shouldn't this be added?
To make the formula valid for volumes of spam and ham which aren't at a same size, there should be a little edit: p = \frac{s \cdot h^N \cdot p_1 p_2 \cdots p_N}{s \cdot h^N \cdot p_1 p_2 \cdots p_N + s^N \cdot h \cdot (1 - p_1)(1 - p_2) \cdots (1 - p_N)} with 's' for the amount of spam e-mails and 'h' for the amount of ham e-mails.
--Markovisch (talk) 12:28, 30 December 2011 (UTC)
There was in the article a sentence that said that the spam corpus and ham corpus should be of same size. However, there was no external source for this assertion, and when I searched, after spending hours in the hope to find an external source for that, I had to admit I could not find any source for it, so I finally removed it. If you have an external source for this information, please re-add it, with a reference to your source. And thank you in advance for that. If you haven't any source for that, I'm afraid you'll have to abstain.
"I've analysed the whole formula, of which my conclusion is"... Problem is that Wikipedia content is not for personal research. So please don't put this information in the article (another reason for not adding it, apart from these methodology reasons, is that AFAIK no existing software uses your formula). I'm sorry if this sounds rigid and dumb, but this is just the rule: no personnal research on Wikipedia. But you are still free to present your research to one of the open source antispam software projects, or to the academic audience in an university paper.
Best wishes for the New Year. --MathsPoetry (talk) 15:56, 1 January 2012 (UTC)
I am currently working on my graudate internship in which I'm gonna make a little variant on the Bayesian Spam Filter. It's a database for digital sources of crimes, which is used by forensics. This database holds all kinds of digital sources, like photographs, videos, chatlogs and also e-mails. I'm making an application to sort important from the non-important e-mails for the investigation using the Spam Filter mechanics. Forensics mark the mail as important/non-important (spam/ham) which will provide the database of probabilities for words. In these investigations there may never be the assumption that important/non-important (spam/ham) mails hold the same volume. I used two other sources and analysed their formula's. Those two formula's included the wager when spam and ham have different volumes.
http://www.w2cinformatica.nl/compcrim/filter.html (Dutch website).
Used formula: p(spam|'aanbieding'n'!'n'twee'n'boeken') = \frac{x}{x+y}
With x = p('aanbieding'|spam)*p('!'|spam)*p('twee'|spam)*p('boeken'|spam)*p(spam)
With y = p('aanbieding'|ham)*p('!'|ham)*p('twee'|ham)*p('boeken'|ham)*p(ham)
http://en.wikipedia.org/wiki/Bayesian_inference
Here I used the formula for 'Multiple Observations'
Havent found any resources yet which describes the assumption about the same values of spam and ham or not. Happy Newyear -- Markovisch (talk) 14:51, 3 January 2012 (UTC)
I'm afraid we can't include any material that would have been derived by you.
Ah, BTW, the formula you propose can be simplified by s \cdot h
Best luck for your forensics research :-). --MathsPoetry (talk) 18:55, 6 January 2012 (UTC)
I think I just found a resource which claims that the amount of spam and ham should be at an equal size.
http://www.process.com/precisemail/bayesian_filtering.htm
"It’s important that you train the filter with approximately equal amounts of spam and non-spam messages, as we’re doing in this example. Many people (including several software vendors that really should know better) tend to train their Bayesian filter only on spam messages. There’s an old saying that goes something like: “If the only tool you have is a hammer, every problem looks like a nail.” With a Bayesian filter, if it’s only been trained with spam messages, every message looks like spam"
With the research I have done I can confirm the event said in the last sentence. -- Markovisch (talk) 14:15, 2 February 2012 (UTC)
I just found another resource which suggests filters should be trained with equal amounts of spam and ham.
http://spamassassin.apache.org/full/3.2.x/doc/sa-learn.html
"You should aim to train with at least the same amount (or more if possible!) of ham data than spam." -- Markovisch (talk) 14:25, 2 February 2012 (UTC)
Excellent. Thanks for having found that. Now we can add a sentence to the article, and then source it. I just did it. --MathsPoetry (talk) 18:45, 2 February 2012 (UTC)

We can't add own reasearch, but...[edit]

The "get external sources !" recommendation of wikipedia sometimes has strange consequences. For example, it is obvious to me (and to any mathematician, I suppose), that if p equals 0 or 1, ln(p) or ln(1 - p) won't be defined. So I'd like to add a sentence saying that the "logarithms" version of the combining formula (which uses both ln(p) and ln(1 - p)) won't work when "sure ham" and "sure spam" indicators are present. But if I add such a remark, it will be my own work, which is forbidden... --MathsPoetry (talk) 19:06, 2 February 2012 (UTC)

This problem is usually solved by Lidstone smoothing. Qwertyus (talk) 15:35, 12 July 2012 (UTC)

Move parts to Naive Bayes article[edit]

A lot of this article actually discusses the Naive Bayes classifier and related techniques. I suggest moving a lot of the material there to prevent confusing duplication; in fact, this article could serve as an example section (or running example) of the NBC, or maybe of document classification. Qwertyus (talk) 15:43, 12 July 2012 (UTC)

Given that noone has replied, I dont think we have concensus to do this. Op47 (talk) 22:45, 27 November 2012 (UTC)

Advantages disputed[edit]

The advantages section now states:

One of the main advantages of Bayesian spam filtering is that it can be trained on a per-user basis.

Advantage over what? The same holds true for the linear SVMs that have replaced NB in serious spam filters years ago. QVVERTYVS (hm?) 13:07, 29 May 2013 (UTC)