LWN.net Logo

Increasing BAYES_99 score can be very dangerous

Increasing BAYES_99 score can be very dangerous

Posted Mar 2, 2006 10:44 UTC (Thu) by nix (subscriber, #2304)
In reply to: Increasing BAYES_99 score can be very dangerous by fyodor
Parent article: A grumpy editor's bayesian followup

Nonetheless, the consensus among SA developers is that the high-probability BAYES_* rules *are* scored too low. They shouldn't be 5.0, but they should be higher than the default.

The likelihood is that the perceptron is being fooled by one simple effect. Before SA is released, its regex rules are especially effective: major spammers definitely tune their messages to avoid these rules in released SA versions, but they don't tune their messages to avoid random rules in the unreleased development tree, and of course new rules in the development tree are effective or they'd never have been added at all. As such, very spammy messages in the mass-check corpus (used for perceptron training) tend to hit a lot more rules than they do after SA has been released for a time. The perceptron can't know this: it assumes that in general messages which hit BAYES_99 hit a lot of other rules too. So it adjusts the score of BAYES_99 downwards to avoid the miniscule number of FPs tripped by that rule, since the real spams are being caught by the regex rules too.

But after that SA version has been released for a time, lots of spam highly-scored by Bayes misses the regex rules (as the spammers have tuned their messages to avoid them), so BAYES_99 suddenly has to carry a lot more weight.

As a consequence, the high-scoring BAYES_ rules are no longer perceptron-scored in SA-3.2-to-be: they're nailed to specific values (a score of 4.something, I think) and the perceptron must adjust all the regex rules around them.

Let's see if this works better. :)

(disclaimer: I am not an SA dev, just an interested observer. I know some SA devs are LWN subscribers, so presumably they can say if I'm talking rubbish.)


(Log in to post comments)

Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds