LWN.net Logo

Advertisement

Smart VPS: 192 MB RAM, 10 GB disc space, 50 GB data transfer and Virtuozzo OS virtualization solution.

Advertise here

A grumpy editor's bayesian followup

A grumpy editor's bayesian followup

Posted Mar 2, 2006 14:59 UTC (Thu) by zmi (subscriber, #4829)
Parent article: A grumpy editor's bayesian followup

Hi, I got this reply on the spamassassin user mailing list
( users-subscribe@spamassassin.apache.org ):

Statisticaly speaking, 1% of BAYES_99 hits should be nonspam.In reality,
it does a lot better than that.

However, in the SA 3.1.0 set3 mass checks it still managed to match
about 21 messages in the nonspam test set:

OVERALL% SPAM% HAM% S/O RANK SCORE NAME
176869 123778 53091 0.700 0.00 0.00 (all messages)
60.712 86.7351 0.0396 1.000 0.90 3.50 BAYES_99

SA's scores aren't based on human assumptions about how the rules
behave. They are based on real-world testing and a perceptron
score-fitting system that accounts not only for the hit-rate of the
rule, but also for the combinations of rules that it tends to match
with. Often the reality is a lot more complex than you think.


(Log in to post comments)

A grumpy editor's bayesian followup

Posted Mar 2, 2006 15:24 UTC (Thu) by jmason (guest, #13586) [Link]

It's important to note that, without good training, BAYES_99 may indeed
fire regularly on nonspam mail -- that's the danger with user-trained
rules. In the *default* scenario, therefore, a score of 3.5 is reasonably
optimal. However, if good training is supplied, it's a good plan to
increase the BAYES_99 score to 5.0, or even more. (I think we might
mention that somewhere in the documentation -- I hope. ;)

Also, it's worth noting that "BAYES_99" doesn't really refer to a 1%
probability. SpamAssassin uses the Fisher Inverse Chi-Square Procedure
described at http://garyrob.blogs.com/whychi90.pdf , and as a result these
are no longer true probability values -- so don't expect to see
probabilistic distributions.

Great articles btw. The grumpy editor has outdone himself ;)

--j.

Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds