LWN.net Logo

Increasing BAYES_99 score can be very dangerous

Increasing BAYES_99 score can be very dangerous

Posted Mar 2, 2006 5:33 UTC (Thu) by fyodor (guest, #3481)
Parent article: A grumpy editor's bayesian followup

Jonathan does a great job with this and his other "grumpy editor" articles, but I cannot agree with his repeated suggestion to increase the BAYES_99 rule score for "allowing the bayesian filter to condemn mail on its own". The long description for that rule is "Bayesian spam probability is 99 to 100%". In other words, you may very well see 1% or more of your legitimate mail falling into that bucket.

Of about 1,850 non-spam mails I felt were important enough to keep in February, 36 of them (2%) were BAYES_99 false positives. And I'm probably missing some that were wrongly moved to my spam folder (which gets tens of thousands of messages per day, so I never check it). This includes many mails from non-hacker friends and neighbors, many legitimate sales-related queries, party invitations, Nmap questions, and all sorts of other things. I'm reasonably good about training it when I catch mistakes too. The more diverse your mail spool is, the tougher the filter's job will be. It may work great for on Jon's mail, but be sure to test it with your own before blithely following his advice and increasing the BAYES_99 score.

-Fyodor


(Log in to post comments)

Increasing BAYES_99 score can be very dangerous

Posted Mar 2, 2006 10:44 UTC (Thu) by nix (subscriber, #2304) [Link]

Nonetheless, the consensus among SA developers is that the high-probability BAYES_* rules *are* scored too low. They shouldn't be 5.0, but they should be higher than the default.

The likelihood is that the perceptron is being fooled by one simple effect. Before SA is released, its regex rules are especially effective: major spammers definitely tune their messages to avoid these rules in released SA versions, but they don't tune their messages to avoid random rules in the unreleased development tree, and of course new rules in the development tree are effective or they'd never have been added at all. As such, very spammy messages in the mass-check corpus (used for perceptron training) tend to hit a lot more rules than they do after SA has been released for a time. The perceptron can't know this: it assumes that in general messages which hit BAYES_99 hit a lot of other rules too. So it adjusts the score of BAYES_99 downwards to avoid the miniscule number of FPs tripped by that rule, since the real spams are being caught by the regex rules too.

But after that SA version has been released for a time, lots of spam highly-scored by Bayes misses the regex rules (as the spammers have tuned their messages to avoid them), so BAYES_99 suddenly has to carry a lot more weight.

As a consequence, the high-scoring BAYES_ rules are no longer perceptron-scored in SA-3.2-to-be: they're nailed to specific values (a score of 4.something, I think) and the perceptron must adjust all the regex rules around them.

Let's see if this works better. :)

(disclaimer: I am not an SA dev, just an interested observer. I know some SA devs are LWN subscribers, so presumably they can say if I'm talking rubbish.)

Increasing BAYES_99 score can be very dangerous - or not

Posted Mar 2, 2006 12:54 UTC (Thu) by kitterma (guest, #4448) [Link]

I actually increase the scores for BAYES_80 and higher to be sufficient to condem mail on it's own with a very low false positive rate.

I use the SA whitelist facility for domains that I regularly receive important mail from and are not routinely spoofed, so they don't get caught.

For other messages, good messages generally have some other negative scoring rules to save them from the Bayesian filter.

Scott K

Increasing BAYES_99 score can be very dangerous

Posted Mar 2, 2006 15:47 UTC (Thu) by corbet (editor, #1) [Link]

I suppose I should have emphasized that tweaking the rules always carries risks. That said, it's worth pointing out that every other bayesian filter reviewed will condemn mail with much less than a BAYES_99 level of assurance. Tweaking just BAYES_99 still results in a relatively conservative configuration.

Increasing BAYES_99 score can be very dangerous

Posted Mar 2, 2006 16:53 UTC (Thu) by pimlott (guest, #1535) [Link]

The long description for that rule is "Bayesian spam probability is 99 to 100%".
I strongly doubt that this description is statistically justified. I know just enough about Bayesian statistics to know that bogofilter, at least, takes numerous liberties with the math, and I suspect that spamassassin is the same. I'm sure nobody has rigorously analyzed the actual "Bayes-inspired" system that is implemented, so it's probably more honest to say that BAYES_99 is just "a high score".

Increasing BAYES_99 score can be very dangerous

Posted Mar 3, 2006 10:40 UTC (Fri) by Stephen_Beynon (✭ supporter ✭, #4090) [Link]

I disagree that increased BAYES_99 score is dangerous. I set the BAYES_99
rule to 5 points. I have set procmail to place messages between 5 and 10
points into a junk folder which I check weekly.
I have never seen BAYES_99 produce a false positive, and I have only seen
2 false positives from spamassassin in the last year. Both of these were
at the spam like end of mail I consider legitimate. One of these messages
was hit by the BAYES_90 rule and one was BAYES_80 (both of which are at
default weighting)

Stephen

Increasing BAYES_99 score can be very dangerous

Posted Apr 29, 2006 21:01 UTC (Sat) by gvc (guest, #37441) [Link]

I disagree. Here's a setup I've used for about 3 years (with no updates at all) and it works very well:

http://plg.uwaterloo.ca/~gvcormac/spamassassin

Note that spamassasin's self-training is broken and this setup force-feeds it its own judgements (which are later corrected by the user in case errors are noticed).

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds