LWN.net Logo

The best of both worlds - a hybrid approach?

The best of both worlds - a hybrid approach?

Posted Feb 23, 2006 1:57 UTC (Thu) by wstearns (✭ supporter ✭, #4102)
Parent article: The Grumpy Editor's guide to bayesian spam filters

Good comparison, Jon - thanks.

If you'll allow me to paraphrase, SpamAssassin (bias alert - I
contribute to SpamAssassin) is accurate but slow. Other tools are faster,
but not quite as accurate. How about getting the speed of other tools and
the accuracy of SpamAssassin?

Picture procmail first handing the message off to a fast filter
such as bogofilter, CRM114, or DSPAM. Those are told to only score if
they're certain a message is ham or spam, which would probably mean
adjusting the thresholds for ham and spam and leaving a larger gray area
between those thresholds. When the above bayesian filter is not sure,
procmail then hands the message off to SpamAssassin (with bayes filtering
turned off but all the network checks active) for more in depth checks.

For the vast majority of the messages you get quick filtering.
When that bayes score is borderline, we check again with a slower but more
accurate tool. Wouldn't that be the best of both worlds - accurate and
almost as fast as the initial filter by itself?


(Log in to post comments)

The best of both worlds - a hybrid approach?

Posted Feb 23, 2006 2:07 UTC (Thu) by dlang (✭ supporter ✭, #313) [Link]

I thought that's exactly what SpamAssassin claims to do (it's the justification for leaving in all the other checks)

the problem with putting filters in series is that they can start paying more attention to the results of the earlier filters then they should (rather then evaluating the message itself)

David Lang

The best of both worlds - a hybrid approach?

Posted Feb 23, 2006 2:08 UTC (Thu) by corbet (editor, #1) [Link]

The thing is...SpamAssassin is the hybrid approach. I just think it needs to learn to trust its bayesian filter a bit more. Certainly I've never had cause to regret raising the score on BAYES_99. Maybe all SA really needs is a rule that shorts out all the other tests if the bayesian filter is convinced that it has figured out the message. That, alone, might make SA a lot faster, at least in the client/server mode.

The best of both worlds - a hybrid approach?

Posted Feb 23, 2006 2:23 UTC (Thu) by wstearns (✭ supporter ✭, #4102) [Link]

There's the catch - Spamassassin is all or nothing. I've watched the mailing list for many years now, and every once in a while someone will suggest bypassing expensive checks if the score is high enough with the quick and easy ones. The answer always comes back no, when you use SA all the checks are used, even if the sender is in a whitelisted domain (that just subtracts 100 from the still-expensively-calculated score). The logic is that the expensive tests might reduce a high score, so we need to do all the checks.

Don't get me wrong - I still love, and use, and actively support SA, but the all-or-nothing approach means it needs beefy CPUs, lots of ram, and a relatively large amount of time to use all tests.

Making SA even better

Posted Feb 23, 2006 8:47 UTC (Thu) by zmi (guest, #4829) [Link]

One thing to mention when using SA is that there is
http://rulesemporium.com/ with lots of additional rules to filter certain
types of SPAM. With those, classification can be made almost perfect.

And with the additional RulesDuJour script found on
http://www.exit0.us/index.php?pagename=RulesDuJour you can even update
those rules automatically each night, adopting to new spam almost
immediately.

have fun without spam,
zmi

The best of both worlds - a hybrid approach?

Posted Feb 23, 2006 10:06 UTC (Thu) by nix (subscriber, #2304) [Link]

The developers agree. As of SA 3.2, the Bayesian scores are fixed and not trained by the perceptron.

The perceptron was persistently choosing overly low scores for the Bayesian filters, because *when SA's static regex rules work well*, choosing low scores for the high-probability Bayesian learner hits does indeed minimize FPs, as genuine spams tend to hit large numbers of static regex rules as well --- but those rules work less and less well after SA's release, and the perceptron cannot take that into account.

Hence the hardwiring of the Bayesian scores henceforward.

The best of both worlds - a hybrid approach?

Posted Feb 27, 2006 9:42 UTC (Mon) by Ross (subscriber, #4065) [Link]

One thing I've always wondered is why they do it "backwards". Presumably they could do away with a lot of manual tuning if they fed the individual test results into the stream as words. So you could have "SATEST=TEST_RESULT" incorporated into the Bayesian decision making. Two complicating factors would be how to protect it from spammers incorporating those tokens (maybe use a installation-specific "password" prepended to the token), and chunking the numeric results so that they match for similar inputs (obviously floating point numbers won't generally work well). Of course it wouldn't do anything to speed up processing.

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds