The Grumpy Editor's guide to
bayesian spam filters was published one week ago. As has become
traditional, it would seem, LWN readers have pointed out tools which evaded
your editor's first pass. So here is the inevitable followup with a couple
more filters and an updated table at the end.
SpamAssassin
One commenter complained last week about your editor having run
SpamAssassin with the network tests enabled. The original reasoning had
been that SpamAssassin, by its nature, comes with a large set of rules,
and, for the purpose of the review, selectively disabling some of them was
not appropriate. Still, the network tests do have a couple of important
effects on the end result. As will be seen below, they make the filter
much more effective; in your editor's experience, the source blacklists
earn most of the credit there. But they also slow things down.
Your editor re-ran the test with network tests disabled, with the following
results:
| Batch: |
1 |
2 |
3 |
4 |
5 |
|
|
Fn |
Fp |
T |
Fn | Fp | T |
Fn |
Fp |
T |
Fn | Fp | T |
Fn |
Fp |
T |
Size |
| SpamAssassin |
8 |
0 |
1.1 |
3 | 0 | 1.1 |
5 |
0 |
1.1 |
3 | 0 | 1.0 |
2 |
0 |
1.0 |
10 |
| SA untrained |
32 |
0 |
0.6 |
9 | 0 | 1.0 |
18 |
0 |
1.0 |
15 | 0 | 1.0 |
7 |
0 |
1.0 |
10 |
| SA Local default |
181 |
0 |
0.3 |
259 | 0 | 0.3 |
271 |
0 |
0.3 |
226 | 0 | 0.3 |
161 |
0 |
0.3 |
10 |
| SA Local tweaked |
53 |
0 |
0.3 |
43 | 0 | 0.3 |
50 |
0 |
0.3 |
44 | 0 | 0.3 |
37 |
0 |
0.3 |
10 |
(Last week's results have been included for comparison). The "default"
results are actually a mistake on your editor's part, but, since they
illustrate an interesting point, they have been included in the above
table.
When SpamAssassin runs its bayesian filter on a message, it encodes the
results as if a specific rule had fired. If the filter is absolutely
convinced that the message is good, the score is adjusted by the value
attached to the BAYES_00 rule. For obviously spam messages,
BAYES_99 comes into play; there are several levels between the two
as well. SpamAssassin, out of the box, assigns 3.5 points to
BAYES_99. Since five points are required, by default, to condemn
a message, the bayesian filter can never do that on its own. Any message,
to be considered spam, must trigger some tests outside of the bayesian
filter.
The "default" results, above, came about because your editor got a little
over-zealous when clearing out the bayesian and whitelist databases for a
new round of tests; so they use the default scoring for BAYES_99.
The "tweaked" results, instead, have the score for that rule raised to 5.0
points, allowing the bayesian filter to condemn mail on its own. The
difference in the results can be clearly seen from the table: spam
filtering performance is vastly improved, with no false positives. With
the default configuration, local-only SpamAssassin had the second-worst
false negative rate of all the filters tested. Your
editor is at a loss to understand why SpamAssassin comes configured to
allow the bayesian filter to be bypassed so easily.
Back to the original point of running this test: putting SpamAssassin into
the "local tests only" mode clearly worsens performance significantly,
while also improving run time.
Popfile
A number of people were dismayed at the omission of popfile, a proxy-based filter
coded in Perl. Popfile is intended to sit between the mail client and the
POP or IMAP server; it filters mail before presenting it to the user. It
includes a built-in web server which provides filtering statistics and
allows the user to perform training.
Perhaps the most interesting feature in Popfile, however, is its approach
to filtering. While the other filters reviewed are very much oriented
around filtering spam, Popfile tries to be more general. So, instead of
filtering into just two categories (plus the "unsure" result provided by a
number of filters), popfile can handle an arbitrary number of categories.
So it not only picks out the spam, but it can sort the rest of a mail
stream based on whatever criteria the user might set. This approach makes
Popfile a potentially more useful tool, but it has implications on its spam
filtering performance, as will be seen from the testing results.
Your editor tested Popfile 0.22.4, using its standalone "pipe" and "insert"
tools.
| Batch: |
1 |
2 |
3 |
4 |
5 |
|
|
Fn |
Fp |
T |
Fn | Fp | T |
Fn |
Fp |
T |
Fn | Fp | T |
Fn |
Fp |
T |
Size |
| Popfile |
0 |
21 |
1.0 |
0 | 16 | 1.1 |
1 |
24 |
1.0 |
0 | 10 | 1.0 |
1 |
12 |
1.0 |
10 |
| PF learn all |
0 |
28 |
2.8 |
0 | 28 | 3.5 |
0 |
44 |
4.2 |
0 | 16 | 5.0 |
0 |
18 |
5.9 |
40 |
On one hand, Popfile was the most effective at removing spam of any of the
filters reviewed; its false negative rate is almost zero. On the other
hand, the false positive rate was high - unacceptably so. Popfile normally
uses a "train on errors" approach; your editor ran a second test where the
filter was trained on every message just to see if that would help with the
false positive rate. Instead, that rate got worse, and the filter slowed
down to a glacial pace. Clearly Popfile and comprehensive training were
not meant to go together.
Your editor has a hypothesis explaining the behavior seen here. Bayesian
filters which concern themselves only with spam have a built-in bias: false
positives are bad and must be avoided. Popfile, instead, has no notion of
a "false positive"; it only has various "buckets" into which mail can be
sorted. The tool does not understand that some types of errors are worse
than others. So, while most filters will err on the side of false
negatives, Popfile just goes for whatever seems right. As a result, it
catches more spam - and more of everything else.
From this experience, your editor has concluded that spam filtering should
be done independently from any other sort of mail sorting. If bayesian
filters are to be used for sorting of legitimate mail, it might be best to
use two separate filters in series.
SpamOracle
SpamOracle is
a straightforward Graham-style bayesian filter. It happens to be written
in Caml, leading your editor to go looking for compilers; Fedora Extras
came through nicely on that front. Initial training is easy and fast, and
SpamOracle works well with procmail.
As a filter, however, it is not one of the more effective ones. Your
editor ran two tests on SpamOracle v1.4, using train-on-errors and
comprehensive training strategies.
| Batch: |
1 |
2 |
3 |
4 |
5 |
|
|
Fn |
Fp |
T |
Fn | Fp | T |
Fn |
Fp |
T |
Fn | Fp | T |
Fn |
Fp |
T |
Size |
| SpamOracle TOE |
462 |
0 |
0.1 |
546 | 0 | 0.1 |
445 |
0 |
0.1 |
463 | 0 | 0.1 |
343 |
0 |
0.1 |
1.1 |
| SpamOracle comp |
461 |
0 |
0.2 |
511 | 0 | 0.2 |
433 |
0 |
0.2 |
420 | 0 | 0.2 |
339 |
0 |
0.3 |
2.6 |
As can be seen here, SpamOracle is fast, and it manages to avoid false
positives altogether. Its filtering rate is poor, however, to the point
that your editor would not want to have to depend on it to hold the spam
stream at bay. Comprehensive training slowed the process down
significantly, but did not improve the results in any appreciable way.
Thunderbird
There were some requests that Thunderbird be included in this evaluation.
The problem is that Thunderbird's filter is buried deep within a
monolithic graphical
application, making it difficult to test in any sort of automated manner.
Your editor, being the lazy person that he is, has no inclination to click
through 15,000 messages to evaluate how well Thunderbird has classified
them.
As it happens, your editor uses Thunderbird for a low-bandwidth mail
account which receives a mere 100 spams per day or so. The Thunderbird
interface is certainly convenient; there is a nice "junk" button for
training the filter (though the way it toggles to "not junk" can be
confusing). Thunderbird can be configured to automatically sideline spam
into a folder, and to age messages out of that folder after a given time.
False positives are rare, in your editor's experience, but the false
negative rate is relatively high. It is also impossible, as far as your
editor can tell, to get any information on the filter and how it makes its
decisions.
Conclusion
Here is the updated table, with the new and old results:
| Test |
False neg. |
False pos. |
Time |
Size |
| bogofilter |
406 | 5.5% |
| |
0.02 |
5 |
| bogofilter -u |
268 | 3.0% |
| |
0.06 |
32 |
| CRM114 |
14 | 0.1% |
16 | 0.3% |
0.06 |
24 |
| CRM114 pretrain |
14 | 0.2% |
15 | 0.3% |
0.06 |
24 |
| DSPAM teft |
50 | 0.6% |
| |
0.1 |
305 |
| DSPAM toe |
67 | 0.7% |
15 | 0.3% |
0.1 |
276 |
| DSPAM tum |
83 | 0.9% |
| |
0.1 |
305 |
| Popfile |
2 | 0.02% |
83 | 1.4% |
1.0 |
10 |
| Popfile comp |
0 | 0% |
144 | 2.4% |
4.3 |
40 |
| SpamAssassin |
21 | 0.2% |
| |
1.1 |
10 |
| SpamAssassin untrained |
81 | 0.9% |
| |
0.9 |
10 |
| SpamAssassin local default |
1098 | 12.2% |
| |
0.3 |
10 |
| SpamAssassin local tweaked |
227 | 2.5% |
| |
0.3 |
10 |
| SpamBayes |
185 | 2.1% |
1 | 0.02% |
0.4 |
4 |
| SpamBayes comp |
294 | 3.3% |
| |
0.8 |
16 |
| SpamOracle TOE |
2259 | 25.1% |
| |
0.1 |
1.1 |
| SpamOracle comp |
2164 | 24.0% |
| |
0.2 |
2.6 |
| SpamProbe train |
222 | 2.5% |
3 | 0.05% |
0.1 |
81 |
| SpamProbe receive |
257 | 2.9% |
4 | 0.07% |
0.7 |
201 |
There is little in the new results to change the conclusions arrived at
last week. The filters which stand out are SpamAssassin (in some modes at
least), and DSPAM. Most of the others demonstrated overly high error
rates, either with false negatives (annoying) or false positives
(unacceptable). Stay tuned, however; there is clearly a great deal of work
being done in this area.
(
Log in to post comments)