|
|
Subscribe / Log in / New account

A grumpy editor's bayesian followup

This article is part of the LWN Grumpy Editor series.
The Grumpy Editor's guide to bayesian spam filters was published one week ago. As has become traditional, it would seem, LWN readers have pointed out tools which evaded your editor's first pass. So here is the inevitable followup with a couple more filters and an updated table at the end.

SpamAssassin

One commenter complained last week about your editor having run SpamAssassin with the network tests enabled. The original reasoning had been that SpamAssassin, by its nature, comes with a large set of rules, and, for the purpose of the review, selectively disabling some of them was not appropriate. Still, the network tests do have a couple of important effects on the end result. As will be seen below, they make the filter much more effective; in your editor's experience, the source blacklists earn most of the credit there. But they also slow things down.

Your editor re-ran the test with network tests disabled, with the following results:

Batch: 1 2 3 4 5
Fn Fp T FnFpT Fn Fp T FnFpT Fn Fp T Size
SpamAssassin 8 0 1.1 301.1 5 0 1.1 301.0 2 0 1.0 10
SA untrained 32 0 0.6 901.0 18 0 1.0 1501.0 7 0 1.0 10
SA Local default 181 0 0.3 25900.3 271 0 0.3 22600.3 161 0 0.3 10
SA Local tweaked 53 0 0.3 4300.3 50 0 0.3 4400.3 37 0 0.3 10

(Last week's results have been included for comparison). The "default" results are actually a mistake on your editor's part, but, since they illustrate an interesting point, they have been included in the above table.

When SpamAssassin runs its bayesian filter on a message, it encodes the results as if a specific rule had fired. If the filter is absolutely convinced that the message is good, the score is adjusted by the value attached to the BAYES_00 rule. For obviously spam messages, BAYES_99 comes into play; there are several levels between the two as well. SpamAssassin, out of the box, assigns 3.5 points to BAYES_99. Since five points are required, by default, to condemn a message, the bayesian filter can never do that on its own. Any message, to be considered spam, must trigger some tests outside of the bayesian filter.

The "default" results, above, came about because your editor got a little over-zealous when clearing out the bayesian and whitelist databases for a new round of tests; so they use the default scoring for BAYES_99. The "tweaked" results, instead, have the score for that rule raised to 5.0 points, allowing the bayesian filter to condemn mail on its own. The difference in the results can be clearly seen from the table: spam filtering performance is vastly improved, with no false positives. With the default configuration, local-only SpamAssassin had the second-worst false negative rate of all the filters tested. Your editor is at a loss to understand why SpamAssassin comes configured to allow the bayesian filter to be bypassed so easily.

Back to the original point of running this test: putting SpamAssassin into the "local tests only" mode clearly worsens performance significantly, while also improving run time.

Popfile

A number of people were dismayed at the omission of popfile, a proxy-based filter coded in Perl. Popfile is intended to sit between the mail client and the POP or IMAP server; it filters mail before presenting it to the user. It includes a built-in web server which provides filtering statistics and allows the user to perform training.

Perhaps the most interesting feature in Popfile, however, is its approach to filtering. While the other filters reviewed are very much oriented around filtering spam, Popfile tries to be more general. So, instead of filtering into just two categories (plus the "unsure" result provided by a number of filters), popfile can handle an arbitrary number of categories. So it not only picks out the spam, but it can sort the rest of a mail stream based on whatever criteria the user might set. This approach makes Popfile a potentially more useful tool, but it has implications on its spam filtering performance, as will be seen from the testing results.

Your editor tested Popfile 0.22.4, using its standalone "pipe" and "insert" tools.

Batch: 1 2 3 4 5
Fn Fp T FnFpT Fn Fp T FnFpT Fn Fp T Size
Popfile 0 21 1.0 0161.1 1 24 1.0 0101.0 1 12 1.0 10
PF learn all 0 28 2.8 0283.5 0 44 4.2 0165.0 0 18 5.9 40

On one hand, Popfile was the most effective at removing spam of any of the filters reviewed; its false negative rate is almost zero. On the other hand, the false positive rate was high - unacceptably so. Popfile normally uses a "train on errors" approach; your editor ran a second test where the filter was trained on every message just to see if that would help with the false positive rate. Instead, that rate got worse, and the filter slowed down to a glacial pace. Clearly Popfile and comprehensive training were not meant to go together.

Your editor has a hypothesis explaining the behavior seen here. Bayesian filters which concern themselves only with spam have a built-in bias: false positives are bad and must be avoided. Popfile, instead, has no notion of a "false positive"; it only has various "buckets" into which mail can be sorted. The tool does not understand that some types of errors are worse than others. So, while most filters will err on the side of false negatives, Popfile just goes for whatever seems right. As a result, it catches more spam - and more of everything else.

From this experience, your editor has concluded that spam filtering should be done independently from any other sort of mail sorting. If bayesian filters are to be used for sorting of legitimate mail, it might be best to use two separate filters in series.

SpamOracle

SpamOracle is a straightforward Graham-style bayesian filter. It happens to be written in Caml, leading your editor to go looking for compilers; Fedora Extras came through nicely on that front. Initial training is easy and fast, and SpamOracle works well with procmail.

As a filter, however, it is not one of the more effective ones. Your editor ran two tests on SpamOracle v1.4, using train-on-errors and comprehensive training strategies.

Batch: 1 2 3 4 5
Fn Fp T FnFpT Fn Fp T FnFpT Fn Fp T Size
SpamOracle TOE 462 0 0.1 54600.1 445 0 0.1 46300.1 343 0 0.1 1.1
SpamOracle comp 461 0 0.2 51100.2 433 0 0.2 42000.2 339 0 0.3 2.6

As can be seen here, SpamOracle is fast, and it manages to avoid false positives altogether. Its filtering rate is poor, however, to the point that your editor would not want to have to depend on it to hold the spam stream at bay. Comprehensive training slowed the process down significantly, but did not improve the results in any appreciable way.

Thunderbird

There were some requests that Thunderbird be included in this evaluation. The problem is that Thunderbird's filter is buried deep within a monolithic graphical application, making it difficult to test in any sort of automated manner. Your editor, being the lazy person that he is, has no inclination to click through 15,000 messages to evaluate how well Thunderbird has classified them.

As it happens, your editor uses Thunderbird for a low-bandwidth mail account which receives a mere 100 spams per day or so. The Thunderbird interface is certainly convenient; there is a nice "junk" button for training the filter (though the way it toggles to "not junk" can be confusing). Thunderbird can be configured to automatically sideline spam into a folder, and to age messages out of that folder after a given time. False positives are rare, in your editor's experience, but the false negative rate is relatively high. It is also impossible, as far as your editor can tell, to get any information on the filter and how it makes its decisions.

Conclusion

Here is the updated table, with the new and old results:

Test False neg. False pos. Time Size
bogofilter 4065.5% 0.02 5
bogofilter -u 2683.0% 0.06 32
CRM114 140.1% 160.3% 0.06 24
CRM114 pretrain 140.2% 150.3% 0.06 24
DSPAM teft 500.6% 0.1 305
DSPAM toe 670.7% 150.3% 0.1 276
DSPAM tum 830.9% 0.1 305
Popfile 20.02% 831.4% 1.0 10
Popfile comp 00% 1442.4% 4.3 40
SpamAssassin 210.2% 1.1 10
SpamAssassin untrained 810.9% 0.9 10
SpamAssassin local default 109812.2% 0.3 10
SpamAssassin local tweaked 2272.5% 0.3 10
SpamBayes 1852.1% 10.02% 0.4 4
SpamBayes comp 2943.3% 0.8 16
SpamOracle TOE 225925.1% 0.1 1.1
SpamOracle comp 216424.0% 0.2 2.6
SpamProbe train 2222.5% 30.05% 0.1 81
SpamProbe receive 2572.9% 40.07% 0.7 201

There is little in the new results to change the conclusions arrived at last week. The filters which stand out are SpamAssassin (in some modes at least), and DSPAM. Most of the others demonstrated overly high error rates, either with false negatives (annoying) or false positives (unacceptable). Stay tuned, however; there is clearly a great deal of work being done in this area.


(Log in to post comments)

A grumpy editor's bayesian followup

Posted Mar 2, 2006 3:59 UTC (Thu) by clump (subscriber, #27801) [Link]

I just upgraded Spamassassin on a Debian Sarge box using a copy from backports.org. I am amazed at how effective the tool is.

Increasing BAYES_99 score can be very dangerous

Posted Mar 2, 2006 5:33 UTC (Thu) by fyodor (guest, #3481) [Link]

Jonathan does a great job with this and his other "grumpy editor" articles, but I cannot agree with his repeated suggestion to increase the BAYES_99 rule score for "allowing the bayesian filter to condemn mail on its own". The long description for that rule is "Bayesian spam probability is 99 to 100%". In other words, you may very well see 1% or more of your legitimate mail falling into that bucket.

Of about 1,850 non-spam mails I felt were important enough to keep in February, 36 of them (2%) were BAYES_99 false positives. And I'm probably missing some that were wrongly moved to my spam folder (which gets tens of thousands of messages per day, so I never check it). This includes many mails from non-hacker friends and neighbors, many legitimate sales-related queries, party invitations, Nmap questions, and all sorts of other things. I'm reasonably good about training it when I catch mistakes too. The more diverse your mail spool is, the tougher the filter's job will be. It may work great for on Jon's mail, but be sure to test it with your own before blithely following his advice and increasing the BAYES_99 score.

-Fyodor

Increasing BAYES_99 score can be very dangerous

Posted Mar 2, 2006 10:44 UTC (Thu) by nix (subscriber, #2304) [Link]

Nonetheless, the consensus among SA developers is that the high-probability BAYES_* rules *are* scored too low. They shouldn't be 5.0, but they should be higher than the default.

The likelihood is that the perceptron is being fooled by one simple effect. Before SA is released, its regex rules are especially effective: major spammers definitely tune their messages to avoid these rules in released SA versions, but they don't tune their messages to avoid random rules in the unreleased development tree, and of course new rules in the development tree are effective or they'd never have been added at all. As such, very spammy messages in the mass-check corpus (used for perceptron training) tend to hit a lot more rules than they do after SA has been released for a time. The perceptron can't know this: it assumes that in general messages which hit BAYES_99 hit a lot of other rules too. So it adjusts the score of BAYES_99 downwards to avoid the miniscule number of FPs tripped by that rule, since the real spams are being caught by the regex rules too.

But after that SA version has been released for a time, lots of spam highly-scored by Bayes misses the regex rules (as the spammers have tuned their messages to avoid them), so BAYES_99 suddenly has to carry a lot more weight.

As a consequence, the high-scoring BAYES_ rules are no longer perceptron-scored in SA-3.2-to-be: they're nailed to specific values (a score of 4.something, I think) and the perceptron must adjust all the regex rules around them.

Let's see if this works better. :)

(disclaimer: I am not an SA dev, just an interested observer. I know some SA devs are LWN subscribers, so presumably they can say if I'm talking rubbish.)

Increasing BAYES_99 score can be very dangerous - or not

Posted Mar 2, 2006 12:54 UTC (Thu) by kitterma (guest, #4448) [Link]

I actually increase the scores for BAYES_80 and higher to be sufficient to condem mail on it's own with a very low false positive rate.

I use the SA whitelist facility for domains that I regularly receive important mail from and are not routinely spoofed, so they don't get caught.

For other messages, good messages generally have some other negative scoring rules to save them from the Bayesian filter.

Scott K

Increasing BAYES_99 score can be very dangerous

Posted Mar 2, 2006 15:47 UTC (Thu) by corbet (editor, #1) [Link]

I suppose I should have emphasized that tweaking the rules always carries risks. That said, it's worth pointing out that every other bayesian filter reviewed will condemn mail with much less than a BAYES_99 level of assurance. Tweaking just BAYES_99 still results in a relatively conservative configuration.

Increasing BAYES_99 score can be very dangerous

Posted Mar 2, 2006 16:53 UTC (Thu) by pimlott (guest, #1535) [Link]

The long description for that rule is "Bayesian spam probability is 99 to 100%".
I strongly doubt that this description is statistically justified. I know just enough about Bayesian statistics to know that bogofilter, at least, takes numerous liberties with the math, and I suspect that spamassassin is the same. I'm sure nobody has rigorously analyzed the actual "Bayes-inspired" system that is implemented, so it's probably more honest to say that BAYES_99 is just "a high score".

Increasing BAYES_99 score can be very dangerous

Posted Mar 3, 2006 10:40 UTC (Fri) by Stephen_Beynon (subscriber, #4090) [Link]

I disagree that increased BAYES_99 score is dangerous. I set the BAYES_99
rule to 5 points. I have set procmail to place messages between 5 and 10
points into a junk folder which I check weekly.
I have never seen BAYES_99 produce a false positive, and I have only seen
2 false positives from spamassassin in the last year. Both of these were
at the spam like end of mail I consider legitimate. One of these messages
was hit by the BAYES_90 rule and one was BAYES_80 (both of which are at
default weighting)

Stephen

Increasing BAYES_99 score can be very dangerous

Posted Apr 29, 2006 21:01 UTC (Sat) by gvc (guest, #37441) [Link]

I disagree. Here's a setup I've used for about 3 years (with no updates at all) and it works very well:

http://plg.uwaterloo.ca/~gvcormac/spamassassin

Note that spamassasin's self-training is broken and this setup force-feeds it its own judgements (which are later corrected by the user in case errors are noticed).

A grumpy editor's bayesian followup

Posted Mar 2, 2006 14:59 UTC (Thu) by zmi (guest, #4829) [Link]

Hi, I got this reply on the spamassassin user mailing list
( users-subscribe@spamassassin.apache.org ):

Statisticaly speaking, 1% of BAYES_99 hits should be nonspam.In reality,
it does a lot better than that.

However, in the SA 3.1.0 set3 mass checks it still managed to match
about 21 messages in the nonspam test set:

OVERALL% SPAM% HAM% S/O RANK SCORE NAME
176869 123778 53091 0.700 0.00 0.00 (all messages)
60.712 86.7351 0.0396 1.000 0.90 3.50 BAYES_99

SA's scores aren't based on human assumptions about how the rules
behave. They are based on real-world testing and a perceptron
score-fitting system that accounts not only for the hit-rate of the
rule, but also for the combinations of rules that it tends to match
with. Often the reality is a lot more complex than you think.

A grumpy editor's bayesian followup

Posted Mar 2, 2006 15:24 UTC (Thu) by jmason (guest, #13586) [Link]

It's important to note that, without good training, BAYES_99 may indeed
fire regularly on nonspam mail -- that's the danger with user-trained
rules. In the *default* scenario, therefore, a score of 3.5 is reasonably
optimal. However, if good training is supplied, it's a good plan to
increase the BAYES_99 score to 5.0, or even more. (I think we might
mention that somewhere in the documentation -- I hope. ;)

Also, it's worth noting that "BAYES_99" doesn't really refer to a 1%
probability. SpamAssassin uses the Fisher Inverse Chi-Square Procedure
described at http://garyrob.blogs.com/whychi90.pdf , and as a result these
are no longer true probability values -- so don't expect to see
probabilistic distributions.

Great articles btw. The grumpy editor has outdone himself ;)

--j.

Thunderbird

Posted Mar 2, 2006 15:32 UTC (Thu) by thomask (guest, #17985) [Link]

Couldn't you test Thunderbird by setting up a local IMAP server and getting it to sort email on that? For quite a long time I used Thunderbird on a remote IMAP server, and had it set up to shift spam from the inbox to a "spam" folder on the server. Whilst I've never run an IMAP server myself, I guess it should be possible that way to track how messages get filtered.

Thunderbird

Posted Mar 2, 2006 15:43 UTC (Thu) by corbet (editor, #1) [Link]

Yes, working with a local imap server would help - but one would still have to manually train thunderbird on the 2000-message training corpus. It could be done if I had more time...

Thunderbird

Posted Mar 3, 2006 2:57 UTC (Fri) by Mithrandir (guest, #3031) [Link]

If you already have all your spam on an IMAP server sorted into folders, you should be able to train Thunderbird by going into the folder with ham, selecting all the messages and hitting shift-j (not junk). Go get coffee (or probably more accurately, lunch). Then go into the folder with spam, select all the messages, and hit j (junk).

Make sure you do this before you tell Thunderbird to move messages based on its own determination of whether it's spam or not. Otherwise it'll start moving messages around before it's been properly trained.

A grumpy editor's bayesian followup

Posted Mar 2, 2006 16:16 UTC (Thu) by glouis (guest, #526) [Link]

As my remark last week implied, comparison between Bayesian filters as they operate "out of the box" can be misleading. Bogofilter's performance can be improved by an order of magnitude if appropriate training and tuning precede the performance test. The fact that it works moderately well "out of the box" in a wide variety of configurations is convenient for the newbie, but should not encourage the user to remain a newbie.

A comparison among carefully-trained (and if need be, tuned) Bayesian spam filters would obviously be somewhat difficult to organize, but would do your readers a much better service, IMHO. (I'd love to see the results of such a comparison myself!)

Training and tweaking

Posted Mar 2, 2006 16:22 UTC (Thu) by corbet (editor, #1) [Link]

FWIW, the filters *were* "carefully trained." Just over 2000 messages were pulled out of the stream and used only for that purpose. They were well inspected to avoid mistraining the filter. How much more careful does one need to be?

I did avoid tweaking the various knobs exported by some of the filters, with the well-documented SpamAssassin exception. I believe that was the right choice: most users (even those who are not "newbies") are unlikely to mess with them, and the defaults should be reasonable.

Training and tweaking

Posted Mar 2, 2006 17:07 UTC (Thu) by glouis (guest, #526) [Link]

We could discuss this further here but I'd like to refer you to http://www.bgl.nu/bogofilter/tuning.html
which details a process by which bogofilter can be optimized.
The process is onerous and requires a considerably larger number of messages than the 2,000 each that you mention. It can also be well worth the effort for a prospective production user. I don't mean in any way to imply that you should have done all that, plus the equivalent for your other tested packages, by yourself; it would have taken you months. I just want to avoid leaving people with the impression that your table reflects the performance of which bogofilter is capable.

SpamAssassin and Thunderbird as united forces

Posted Mar 4, 2006 7:38 UTC (Sat) by dh (subscriber, #153) [Link]

Hi all,

for more than three years, I use SpamAssassin and Thunderbird as a team to
keep all the spam away from me. SpamAssassin investigates all mails while
they are delivered. Everything marked as spam goes straight into the trash
can. Ramdom checks have never found false positives, SpamAssassin is
really impressive in this way.

Mails which managed to reach my inbox, however, are checked a second time
by Thunderbird. Everything found to be spam is moved into a special
directory. I review that one from time to time as Thunderbird tends to
produce false positives on a low rate. After cleaning that folder, I use
it to train SpamAssassin using "sa-learn".

My experiences with this setup are very good. Normally, I do only see two
or three spam mails per day while several hundreds of them are filtered
and gone with the wind without bothering me. Furthermore, I am quite
confident that I never missed a non-spam mail (even though once or twice,
they were catched by Thunderbird in the first place).

Best regards,
Dirk

A grumpy editor's bayesian followup

Posted Mar 9, 2006 2:22 UTC (Thu) by stock (guest, #5849) [Link]

DSPAM the originating marktleader rules, without a shadow of a doubt:
[hubble:root]:(~)# dspam_stats -H   
stock:   
                TS Total Spam:              36268   
                TI Total Innocent:           5586   
                SM Spam Misclassified:         54   
                IM Innocent Misclassified:     20   
                SC Spam Corpusfed:           4817   
                IC Innocent Corpusfed:       5126   
                TL Training Left:               0   
                SR Spam Catch Rate:        99.85%   
                IR Innocent Catch Rate:    99.64%   
                OR Overall Rate/Accuracy:  99.82%   
   
[hubble:root]:(~)# dspam --version   
   
DSPAM Anti-Spam Suite 3.4.9 (agent/library)   
   
Copyright (c) 2002-2004 Network Dweebs Corporation   
http://www.nuclearelephant.com/projects/dspam/   

A grumpy editor's bayesian followup

Posted Mar 9, 2006 2:30 UTC (Thu) by stock (guest, #5849) [Link]

database size is not to bad either :
 
[hubble:root]:(/var/lib/mysql/dspamdb)# ll  
total 138712  
-rw-rw---- 1 mysql  mysql     8614 Sep  4  2005 dspam_preferences.frm 
-rw-rw---- 1 mysql  mysql        0 Sep  4  2005 dspam_preferences.MYD 
-rw-rw---- 1 mysql  mysql     1024 Sep  4  2005 dspam_preferences.MYI 
-rw-rw---- 1 mysql  mysql     8674 Sep  4  2005 dspam_signature_data.frm  
-rw-rw---- 1 mysql  mysql 64253584 Mar  9 03:21 dspam_signature_data.MYD  
-rw-rw---- 1 mysql  mysql   348160 Mar  9 03:21 dspam_signature_data.MYI 
-rw-rw---- 1 mysql  mysql     8948 Sep  4  2005 dspam_stats.frm  
-rw-rw---- 1 mysql  mysql      175 Mar  9 03:21 dspam_stats.MYD  
-rw-rw---- 1 mysql  mysql     2048 Dec  4 02:57 dspam_stats.MYI  
-rw-rw---- 1 mysql  mysql     8686 Sep  4  2005 dspam_token_data.frm 
-rw-rw---- 1 mysql  mysql 37192298 Mar  9 03:21 dspam_token_data.MYD 
-rw-rw---- 1 mysql  mysql 40018944 Mar  9 03:21 dspam_token_data.MYI  
[hubble:root]:(/var/lib/mysql/dspamdb)# 

TREC Spam Tests

Posted Apr 29, 2006 20:59 UTC (Sat) by gvc (guest, #37441) [Link]

Could I convince you to prepare your corpus so as to be used with the TREC spam evaluation toolkit? TREC ran a Spam Track in 2005 and will run another in 2006. It uses a standard interface to test filters so that the data can be kept private.

The toolkit web page has setups for eight open source filters, including most of the ones you tested. If you were to implement the toolkit you could run those filters and post comparative results, and also could test new filters.

I invite you and others to participate in TREC 2006. A letter of intent is needed right away (the official deadline has passed, but that's OK). "Normal" participation involves submitting a filter, and also running the same filter on a public corpus that will be released. The submitted filter is run on blind datasets. Therefore there are "special" participants who actually test the filters on their private data and send in the results.

A grumpy editor's bayesian followup

Posted Oct 3, 2006 19:52 UTC (Tue) by glardner (guest, #40903) [Link]

I am surprised you did not get a better result with POPFile. I have been using POPFile for about three years. Since November 2004 the stats are:

Messages classified 26,602
False positives 8 = 0.030%
False negatives 15 = 0.056%
Total false classifications 23 = 0.086%
Unclassified 167 = 0.628%

The total false classifications was lower before some new type of spam started arriving earlier this year; before that the total accuracy was 99.98% excluding messages unclassified; now it is just better than 99.91%.

I use POPFile in conjunction with Outlook; I have the Outclass user interface (from Vargonsoft) installed, which makes adjusting the training of POPFile very simple when it mis-categorises a message. I just wish there was an interface for POPFile similar to Outclass, but which works with Mozilla Thunderbird.

Gerard Lardner


Copyright © 2006, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds