LWN.net Logo

The best of both worlds - a hybrid approach?

The best of both worlds - a hybrid approach?

Posted Feb 23, 2006 2:08 UTC (Thu) by corbet (editor, #1)
In reply to: The best of both worlds - a hybrid approach? by wstearns
Parent article: The Grumpy Editor's guide to bayesian spam filters

The thing is...SpamAssassin is the hybrid approach. I just think it needs to learn to trust its bayesian filter a bit more. Certainly I've never had cause to regret raising the score on BAYES_99. Maybe all SA really needs is a rule that shorts out all the other tests if the bayesian filter is convinced that it has figured out the message. That, alone, might make SA a lot faster, at least in the client/server mode.


(Log in to post comments)

The best of both worlds - a hybrid approach?

Posted Feb 23, 2006 2:23 UTC (Thu) by wstearns (✭ supporter ✭, #4102) [Link]

There's the catch - Spamassassin is all or nothing. I've watched the mailing list for many years now, and every once in a while someone will suggest bypassing expensive checks if the score is high enough with the quick and easy ones. The answer always comes back no, when you use SA all the checks are used, even if the sender is in a whitelisted domain (that just subtracts 100 from the still-expensively-calculated score). The logic is that the expensive tests might reduce a high score, so we need to do all the checks.

Don't get me wrong - I still love, and use, and actively support SA, but the all-or-nothing approach means it needs beefy CPUs, lots of ram, and a relatively large amount of time to use all tests.

Making SA even better

Posted Feb 23, 2006 8:47 UTC (Thu) by zmi (guest, #4829) [Link]

One thing to mention when using SA is that there is
http://rulesemporium.com/ with lots of additional rules to filter certain
types of SPAM. With those, classification can be made almost perfect.

And with the additional RulesDuJour script found on
http://www.exit0.us/index.php?pagename=RulesDuJour you can even update
those rules automatically each night, adopting to new spam almost
immediately.

have fun without spam,
zmi

The best of both worlds - a hybrid approach?

Posted Feb 23, 2006 10:06 UTC (Thu) by nix (subscriber, #2304) [Link]

The developers agree. As of SA 3.2, the Bayesian scores are fixed and not trained by the perceptron.

The perceptron was persistently choosing overly low scores for the Bayesian filters, because *when SA's static regex rules work well*, choosing low scores for the high-probability Bayesian learner hits does indeed minimize FPs, as genuine spams tend to hit large numbers of static regex rules as well --- but those rules work less and less well after SA's release, and the perceptron cannot take that into account.

Hence the hardwiring of the Bayesian scores henceforward.

The best of both worlds - a hybrid approach?

Posted Feb 27, 2006 9:42 UTC (Mon) by Ross (subscriber, #4065) [Link]

One thing I've always wondered is why they do it "backwards". Presumably they could do away with a lot of manual tuning if they fed the individual test results into the stream as words. So you could have "SATEST=TEST_RESULT" incorporated into the Bayesian decision making. Two complicating factors would be how to protect it from spammers incorporating those tokens (maybe use a installation-specific "password" prepended to the token), and chunking the numeric results so that they match for similar inputs (obviously floating point numbers won't generally work well). Of course it wouldn't do anything to speed up processing.

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds