| This article is part of the LWN Grumpy Editor series. |
One commenter complained last week about your editor having run SpamAssassin with the network tests enabled. The original reasoning had been that SpamAssassin, by its nature, comes with a large set of rules, and, for the purpose of the review, selectively disabling some of them was not appropriate. Still, the network tests do have a couple of important effects on the end result. As will be seen below, they make the filter much more effective; in your editor's experience, the source blacklists earn most of the credit there. But they also slow things down.
Your editor re-ran the test with network tests disabled, with the following results:
| Batch: | 1 | 2 | 3 | 4 | 5 | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Fn | Fp | T | Fn | Fp | T | Fn | Fp | T | Fn | Fp | T | Fn | Fp | T | Size | |
| SpamAssassin | 8 | 0 | 1.1 | 3 | 0 | 1.1 | 5 | 0 | 1.1 | 3 | 0 | 1.0 | 2 | 0 | 1.0 | 10 |
| SA untrained | 32 | 0 | 0.6 | 9 | 0 | 1.0 | 18 | 0 | 1.0 | 15 | 0 | 1.0 | 7 | 0 | 1.0 | 10 |
| SA Local default | 181 | 0 | 0.3 | 259 | 0 | 0.3 | 271 | 0 | 0.3 | 226 | 0 | 0.3 | 161 | 0 | 0.3 | 10 |
| SA Local tweaked | 53 | 0 | 0.3 | 43 | 0 | 0.3 | 50 | 0 | 0.3 | 44 | 0 | 0.3 | 37 | 0 | 0.3 | 10 |
(Last week's results have been included for comparison). The "default" results are actually a mistake on your editor's part, but, since they illustrate an interesting point, they have been included in the above table.
When SpamAssassin runs its bayesian filter on a message, it encodes the results as if a specific rule had fired. If the filter is absolutely convinced that the message is good, the score is adjusted by the value attached to the BAYES_00 rule. For obviously spam messages, BAYES_99 comes into play; there are several levels between the two as well. SpamAssassin, out of the box, assigns 3.5 points to BAYES_99. Since five points are required, by default, to condemn a message, the bayesian filter can never do that on its own. Any message, to be considered spam, must trigger some tests outside of the bayesian filter.
The "default" results, above, came about because your editor got a little over-zealous when clearing out the bayesian and whitelist databases for a new round of tests; so they use the default scoring for BAYES_99. The "tweaked" results, instead, have the score for that rule raised to 5.0 points, allowing the bayesian filter to condemn mail on its own. The difference in the results can be clearly seen from the table: spam filtering performance is vastly improved, with no false positives. With the default configuration, local-only SpamAssassin had the second-worst false negative rate of all the filters tested. Your editor is at a loss to understand why SpamAssassin comes configured to allow the bayesian filter to be bypassed so easily.
Back to the original point of running this test: putting SpamAssassin into the "local tests only" mode clearly worsens performance significantly, while also improving run time.
A number of people were dismayed at the omission of popfile, a proxy-based filter coded in Perl. Popfile is intended to sit between the mail client and the POP or IMAP server; it filters mail before presenting it to the user. It includes a built-in web server which provides filtering statistics and allows the user to perform training.
Perhaps the most interesting feature in Popfile, however, is its approach to filtering. While the other filters reviewed are very much oriented around filtering spam, Popfile tries to be more general. So, instead of filtering into just two categories (plus the "unsure" result provided by a number of filters), popfile can handle an arbitrary number of categories. So it not only picks out the spam, but it can sort the rest of a mail stream based on whatever criteria the user might set. This approach makes Popfile a potentially more useful tool, but it has implications on its spam filtering performance, as will be seen from the testing results.
Your editor tested Popfile 0.22.4, using its standalone "pipe" and "insert" tools.
| Batch: | 1 | 2 | 3 | 4 | 5 | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Fn | Fp | T | Fn | Fp | T | Fn | Fp | T | Fn | Fp | T | Fn | Fp | T | Size | |
| Popfile | 0 | 21 | 1.0 | 0 | 16 | 1.1 | 1 | 24 | 1.0 | 0 | 10 | 1.0 | 1 | 12 | 1.0 | 10 |
| PF learn all | 0 | 28 | 2.8 | 0 | 28 | 3.5 | 0 | 44 | 4.2 | 0 | 16 | 5.0 | 0 | 18 | 5.9 | 40 |
On one hand, Popfile was the most effective at removing spam of any of the filters reviewed; its false negative rate is almost zero. On the other hand, the false positive rate was high - unacceptably so. Popfile normally uses a "train on errors" approach; your editor ran a second test where the filter was trained on every message just to see if that would help with the false positive rate. Instead, that rate got worse, and the filter slowed down to a glacial pace. Clearly Popfile and comprehensive training were not meant to go together.
Your editor has a hypothesis explaining the behavior seen here. Bayesian filters which concern themselves only with spam have a built-in bias: false positives are bad and must be avoided. Popfile, instead, has no notion of a "false positive"; it only has various "buckets" into which mail can be sorted. The tool does not understand that some types of errors are worse than others. So, while most filters will err on the side of false negatives, Popfile just goes for whatever seems right. As a result, it catches more spam - and more of everything else.
From this experience, your editor has concluded that spam filtering should be done independently from any other sort of mail sorting. If bayesian filters are to be used for sorting of legitimate mail, it might be best to use two separate filters in series.
SpamOracle is a straightforward Graham-style bayesian filter. It happens to be written in Caml, leading your editor to go looking for compilers; Fedora Extras came through nicely on that front. Initial training is easy and fast, and SpamOracle works well with procmail.
As a filter, however, it is not one of the more effective ones. Your editor ran two tests on SpamOracle v1.4, using train-on-errors and comprehensive training strategies.
| Batch: | 1 | 2 | 3 | 4 | 5 | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Fn | Fp | T | Fn | Fp | T | Fn | Fp | T | Fn | Fp | T | Fn | Fp | T | Size | |
| SpamOracle TOE | 462 | 0 | 0.1 | 546 | 0 | 0.1 | 445 | 0 | 0.1 | 463 | 0 | 0.1 | 343 | 0 | 0.1 | 1.1 |
| SpamOracle comp | 461 | 0 | 0.2 | 511 | 0 | 0.2 | 433 | 0 | 0.2 | 420 | 0 | 0.2 | 339 | 0 | 0.3 | 2.6 |
As can be seen here, SpamOracle is fast, and it manages to avoid false positives altogether. Its filtering rate is poor, however, to the point that your editor would not want to have to depend on it to hold the spam stream at bay. Comprehensive training slowed the process down significantly, but did not improve the results in any appreciable way.
There were some requests that Thunderbird be included in this evaluation. The problem is that Thunderbird's filter is buried deep within a monolithic graphical application, making it difficult to test in any sort of automated manner. Your editor, being the lazy person that he is, has no inclination to click through 15,000 messages to evaluate how well Thunderbird has classified them.
As it happens, your editor uses Thunderbird for a low-bandwidth mail account which receives a mere 100 spams per day or so. The Thunderbird interface is certainly convenient; there is a nice "junk" button for training the filter (though the way it toggles to "not junk" can be confusing). Thunderbird can be configured to automatically sideline spam into a folder, and to age messages out of that folder after a given time. False positives are rare, in your editor's experience, but the false negative rate is relatively high. It is also impossible, as far as your editor can tell, to get any information on the filter and how it makes its decisions.
Here is the updated table, with the new and old results:
Test False neg. False pos. Time Size bogofilter 406 5.5% 0.02 5 bogofilter -u 268 3.0% 0.06 32 CRM114 14 0.1% 16 0.3% 0.06 24 CRM114 pretrain 14 0.2% 15 0.3% 0.06 24 DSPAM teft 50 0.6% 0.1 305 DSPAM toe 67 0.7% 15 0.3% 0.1 276 DSPAM tum 83 0.9% 0.1 305 Popfile 2 0.02% 83 1.4% 1.0 10 Popfile comp 0 0% 144 2.4% 4.3 40 SpamAssassin 21 0.2% 1.1 10 SpamAssassin untrained 81 0.9% 0.9 10 SpamAssassin local default 1098 12.2% 0.3 10 SpamAssassin local tweaked 227 2.5% 0.3 10 SpamBayes 185 2.1% 1 0.02% 0.4 4 SpamBayes comp 294 3.3% 0.8 16 SpamOracle TOE 2259 25.1% 0.1 1.1 SpamOracle comp 2164 24.0% 0.2 2.6 SpamProbe train 222 2.5% 3 0.05% 0.1 81 SpamProbe receive 257 2.9% 4 0.07% 0.7 201
There is little in the new results to change the conclusions arrived at
last week. The filters which stand out are SpamAssassin (in some modes at
least), and DSPAM. Most of the others demonstrated overly high error
rates, either with false negatives (annoying) or false positives
(unacceptable). Stay tuned, however; there is clearly a great deal of work
being done in this area.
Posted Mar 2, 2006 3:59 UTC (Thu) by clump (subscriber, #27801) [Link]
I just upgraded Spamassassin on a Debian Sarge box using a copy from backports.org. I am amazed at how effective the tool is.
Posted Mar 2, 2006 5:33 UTC (Thu) by fyodor (guest, #3481) [Link]
Jonathan does a great job with this and his other "grumpy editor" articles, but I cannot agree with his repeated suggestion to increase the BAYES_99 rule score for "allowing the bayesian filter to condemn mail on its own". The long description for that rule is "Bayesian spam probability is 99 to 100%". In other words, you may very well see 1% or more of your legitimate mail falling into that bucket.
Of about 1,850 non-spam mails I felt were important enough to keep in February, 36 of them (2%) were BAYES_99 false positives. And I'm probably missing some that were wrongly moved to my spam folder (which gets tens of thousands of messages per day, so I never check it). This includes many mails from non-hacker friends and neighbors, many legitimate sales-related queries, party invitations, Nmap questions, and all sorts of other things. I'm reasonably good about training it when I catch mistakes too. The more diverse your mail spool is, the tougher the filter's job will be. It may work great for on Jon's mail, but be sure to test it with your own before blithely following his advice and increasing the BAYES_99 score.
Posted Mar 2, 2006 10:44 UTC (Thu) by nix (subscriber, #2304) [Link]
Nonetheless, the consensus among SA developers is that the high-probability BAYES_* rules *are* scored too low. They shouldn't be 5.0, but they should be higher than the default.
The likelihood is that the perceptron is being fooled by one simple effect. Before SA is released, its regex rules are especially effective: major spammers definitely tune their messages to avoid these rules in released SA versions, but they don't tune their messages to avoid random rules in the unreleased development tree, and of course new rules in the development tree are effective or they'd never have been added at all. As such, very spammy messages in the mass-check corpus (used for perceptron training) tend to hit a lot more rules than they do after SA has been released for a time. The perceptron can't know this: it assumes that in general messages which hit BAYES_99 hit a lot of other rules too. So it adjusts the score of BAYES_99 downwards to avoid the miniscule number of FPs tripped by that rule, since the real spams are being caught by the regex rules too.
But after that SA version has been released for a time, lots of spam highly-scored by Bayes misses the regex rules (as the spammers have tuned their messages to avoid them), so BAYES_99 suddenly has to carry a lot more weight.
As a consequence, the high-scoring BAYES_ rules are no longer perceptron-scored in SA-3.2-to-be: they're nailed to specific values (a score of 4.something, I think) and the perceptron must adjust all the regex rules around them.
Let's see if this works better. :)
(disclaimer: I am not an SA dev, just an interested observer. I know some SA devs are LWN subscribers, so presumably they can say if I'm talking rubbish.)
Posted Mar 2, 2006 12:54 UTC (Thu) by kitterma (guest, #4448) [Link]
I actually increase the scores for BAYES_80 and higher to be sufficient to condem mail on it's own with a very low false positive rate.
I use the SA whitelist facility for domains that I regularly receive important mail from and are not routinely spoofed, so they don't get caught.
For other messages, good messages generally have some other negative scoring rules to save them from the Bayesian filter.
Scott K
Posted Mar 2, 2006 15:47 UTC (Thu) by corbet (editor, #1) [Link]
I suppose I should have emphasized that tweaking the rules always carries risks. That said, it's worth pointing out that every other bayesian filter reviewed will condemn mail with much less than a BAYES_99 level of assurance. Tweaking just BAYES_99 still results in a relatively conservative configuration.
Posted Mar 2, 2006 16:53 UTC (Thu) by pimlott (guest, #1535) [Link]
The long description for that rule is "Bayesian spam probability is 99 to 100%".I strongly doubt that this description is statistically justified. I know just enough about Bayesian statistics to know that bogofilter, at least, takes numerous liberties with the math, and I suspect that spamassassin is the same. I'm sure nobody has rigorously analyzed the actual "Bayes-inspired" system that is implemented, so it's probably more honest to say that BAYES_99 is just "a high score".
Posted Mar 3, 2006 10:40 UTC (Fri) by Stephen_Beynon (subscriber, #4090) [Link]
I disagree that increased BAYES_99 score is dangerous. I set the BAYES_99
Posted Apr 29, 2006 21:01 UTC (Sat) by gvc (guest, #37441) [Link]
I disagree. Here's a setup I've used for about 3 years (with no updates at all) and it works very well:
http://plg.uwaterloo.ca/~gvcormac/spamassassin
Note that spamassasin's self-training is broken and this setup force-feeds it its own judgements (which are later corrected by the user in case errors are noticed).
Posted Mar 2, 2006 14:59 UTC (Thu) by zmi (guest, #4829) [Link]
Hi, I got this reply on the spamassassin user mailing list
Statisticaly speaking, 1% of BAYES_99 hits should be nonspam.In reality,
it does a lot better than that.
However, in the SA 3.1.0 set3 mass checks it still managed to match
about 21 messages in the nonspam test set:
OVERALL% SPAM% HAM% S/O RANK SCORE NAME
176869 123778 53091 0.700 0.00 0.00 (all messages)
60.712 86.7351 0.0396 1.000 0.90 3.50 BAYES_99
SA's scores aren't based on human assumptions about how the rules
behave. They are based on real-world testing and a perceptron
score-fitting system that accounts not only for the hit-rate of the
rule, but also for the combinations of rules that it tends to match
with. Often the reality is a lot more complex than you think.
Posted Mar 2, 2006 15:24 UTC (Thu) by jmason (guest, #13586) [Link]
It's important to note that, without good training, BAYES_99 may indeed
Also, it's worth noting that "BAYES_99" doesn't really refer to a 1%
probability. SpamAssassin uses the Fisher Inverse Chi-Square Procedure
described at http://garyrob.blogs.com/whychi90.pdf , and as a result these
are no longer true probability values -- so don't expect to see
probabilistic distributions.
Great articles btw. The grumpy editor has outdone himself ;)
--j.
Posted Mar 2, 2006 15:32 UTC (Thu) by thomask (guest, #17985) [Link]
Couldn't you test Thunderbird by setting up a local IMAP server and getting it to sort email on that? For quite a long time I used Thunderbird on a remote IMAP server, and had it set up to shift spam from the inbox to a "spam" folder on the server. Whilst I've never run an IMAP server myself, I guess it should be possible that way to track how messages get filtered.
Posted Mar 2, 2006 15:43 UTC (Thu) by corbet (editor, #1) [Link]
Yes, working with a local imap server would help - but one would still have to manually train thunderbird on the 2000-message training corpus. It could be done if I had more time...
Posted Mar 3, 2006 2:57 UTC (Fri) by Mithrandir (guest, #3031) [Link]
If you already have all your spam on an IMAP server sorted into folders, you should be able to train Thunderbird by going into the folder with ham, selecting all the messages and hitting shift-j (not junk). Go get coffee (or probably more accurately, lunch). Then go into the folder with spam, select all the messages, and hit j (junk).
Make sure you do this before you tell Thunderbird to move messages based on its own determination of whether it's spam or not. Otherwise it'll start moving messages around before it's been properly trained.
Posted Mar 2, 2006 16:16 UTC (Thu) by glouis (guest, #526) [Link]
As my remark last week implied, comparison between Bayesian filters as they operate "out of the box" can be misleading. Bogofilter's performance can be improved by an order of magnitude if appropriate training and tuning precede the performance test. The fact that it works moderately well "out of the box" in a wide variety of configurations is convenient for the newbie, but should not encourage the user to remain a newbie.
A comparison among carefully-trained (and if need be, tuned) Bayesian spam filters would obviously be somewhat difficult to organize, but would do your readers a much better service, IMHO. (I'd love to see the results of such a comparison myself!)
Posted Mar 2, 2006 16:22 UTC (Thu) by corbet (editor, #1) [Link]
FWIW, the filters *were* "carefully trained." Just over 2000 messages were pulled out of the stream and used only for that purpose. They were well inspected to avoid mistraining the filter. How much more careful does one need to be?I did avoid tweaking the various knobs exported by some of the filters, with the well-documented SpamAssassin exception. I believe that was the right choice: most users (even those who are not "newbies") are unlikely to mess with them, and the defaults should be reasonable.
Posted Mar 2, 2006 17:07 UTC (Thu) by glouis (guest, #526) [Link]
We could discuss this further here but I'd like to refer you to http://www.bgl.nu/bogofilter/tuning.html
Posted Mar 4, 2006 7:38 UTC (Sat) by dh (subscriber, #153) [Link]
Hi all,
Posted Mar 9, 2006 2:22 UTC (Thu) by stock (guest, #5849) [Link]
DSPAM the originating marktleader rules, without a shadow of a doubt:[hubble:root]:(~)# dspam_stats -H
stock:
TS Total Spam: 36268
TI Total Innocent: 5586
SM Spam Misclassified: 54
IM Innocent Misclassified: 20
SC Spam Corpusfed: 4817
IC Innocent Corpusfed: 5126
TL Training Left: 0
SR Spam Catch Rate: 99.85%
IR Innocent Catch Rate: 99.64%
OR Overall Rate/Accuracy: 99.82%
[hubble:root]:(~)# dspam --version
DSPAM Anti-Spam Suite 3.4.9 (agent/library)
Copyright (c) 2002-2004 Network Dweebs Corporation
http://www.nuclearelephant.com/projects/dspam/
Posted Mar 9, 2006 2:30 UTC (Thu) by stock (guest, #5849) [Link]
database size is not to bad either :[hubble:root]:(/var/lib/mysql/dspamdb)# ll total 138712 -rw-rw---- 1 mysql mysql 8614 Sep 4 2005 dspam_preferences.frm -rw-rw---- 1 mysql mysql 0 Sep 4 2005 dspam_preferences.MYD -rw-rw---- 1 mysql mysql 1024 Sep 4 2005 dspam_preferences.MYI -rw-rw---- 1 mysql mysql 8674 Sep 4 2005 dspam_signature_data.frm -rw-rw---- 1 mysql mysql 64253584 Mar 9 03:21 dspam_signature_data.MYD -rw-rw---- 1 mysql mysql 348160 Mar 9 03:21 dspam_signature_data.MYI -rw-rw---- 1 mysql mysql 8948 Sep 4 2005 dspam_stats.frm -rw-rw---- 1 mysql mysql 175 Mar 9 03:21 dspam_stats.MYD -rw-rw---- 1 mysql mysql 2048 Dec 4 02:57 dspam_stats.MYI -rw-rw---- 1 mysql mysql 8686 Sep 4 2005 dspam_token_data.frm -rw-rw---- 1 mysql mysql 37192298 Mar 9 03:21 dspam_token_data.MYD -rw-rw---- 1 mysql mysql 40018944 Mar 9 03:21 dspam_token_data.MYI [hubble:root]:(/var/lib/mysql/dspamdb)#
Posted Apr 29, 2006 20:59 UTC (Sat) by gvc (guest, #37441) [Link]
Could I convince you to prepare your corpus so as to be used with the TREC spam evaluation toolkit? TREC ran a Spam Track in 2005 and will run another in 2006. It uses a standard interface to test filters so that the data can be kept private.The toolkit web page has setups for eight open source filters, including most of the ones you tested. If you were to implement the toolkit you could run those filters and post comparative results, and also could test new filters.
I invite you and others to participate in TREC 2006. A letter of intent is needed right away (the official deadline has passed, but that's OK). "Normal" participation involves submitting a filter, and also running the same filter on a public corpus that will be released. The submitted filter is run on blind datasets. Therefore there are "special" participants who actually test the filters on their private data and send in the results.
Posted Oct 3, 2006 19:52 UTC (Tue) by glardner (guest, #40903) [Link]
I am surprised you did not get a better result with POPFile. I have been using POPFile for about three years. Since November 2004 the stats are:
Messages classified 26,602
False positives 8 = 0.030%
False negatives 15 = 0.056%
Total false classifications 23 = 0.086%
Unclassified 167 = 0.628%
The total false classifications was lower before some new type of spam started arriving earlier this year; before that the total accuracy was 99.98% excluding messages unclassified; now it is just better than 99.91%.
I use POPFile in conjunction with Outlook; I have the Outclass user interface (from Vargonsoft) installed, which makes adjusting the training of POPFile very simple when it mis-categorises a message. I just wish there was an interface for POPFile similar to Outclass, but which works with Mozilla Thunderbird.
Gerard Lardner
Copyright © 2006, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds