The Grumpy Editor's guide to
bayesian spam filters was published one week ago. As has become
traditional, it would seem, LWN readers have pointed out tools which evaded
your editor's first pass. So here is the inevitable followup with a couple
more filters and an updated table at the end.
SpamAssassin
One commenter complained last week about your editor having run
SpamAssassin with the network tests enabled. The original reasoning had
been that SpamAssassin, by its nature, comes with a large set of rules,
and, for the purpose of the review, selectively disabling some of them was
not appropriate. Still, the network tests do have a couple of important
effects on the end result. As will be seen below, they make the filter
much more effective; in your editor's experience, the source blacklists
earn most of the credit there. But they also slow things down.
Your editor re-ran the test with network tests disabled, with the following
results:
| Batch: |
1 |
2 |
3 |
4 |
5 |
|
|
Fn |
Fp |
T |
Fn | Fp | T |
Fn |
Fp |
T |
Fn | Fp | T |
Fn |
Fp |
T |
Size |
| SpamAssassin |
8 |
0 |
1.1 |
3 | 0 | 1.1 |
5 |
0 |
1.1 |
3 | 0 | 1.0 |
2 |
0 |
1.0 |
10 |
| SA untrained |
32 |
0 |
0.6 |
9 | 0 | 1.0 |
18 |
0 |
1.0 |
15 | 0 | 1.0 |
7 |
0 |
1.0 |
10 |
| SA Local default |
181 |
0 |
0.3 |
259 | 0 | 0.3 |
271 |
0 |
0.3 |
226 | 0 | 0.3 |
161 |
0 |
0.3 |
10 |
| SA Local tweaked |
53 |
0 |
0.3 |
43 | 0 | 0.3 |
50 |
0 |
0.3 |
44 | 0 | 0.3 |
37 |
0 |
0.3 |
10 |
(Last week's results have been included for comparison). The "default"
results are actually a mistake on your editor's part, but, since they
illustrate an interesting point, they have been included in the above
table.
When SpamAssassin runs its bayesian filter on a message, it encodes the
results as if a specific rule had fired. If the filter is absolutely
convinced that the message is good, the score is adjusted by the value
attached to the BAYES_00 rule. For obviously spam messages,
BAYES_99 comes into play; there are several levels between the two
as well. SpamAssassin, out of the box, assigns 3.5 points to
BAYES_99. Since five points are required, by default, to condemn
a message, the bayesian filter can never do that on its own. Any message,
to be considered spam, must trigger some tests outside of the bayesian
filter.
The "default" results, above, came about because your editor got a little
over-zealous when clearing out the bayesian and whitelist databases for a
new round of tests; so they use the default scoring for BAYES_99.
The "tweaked" results, instead, have the score for that rule raised to 5.0
points, allowing the bayesian filter to condemn mail on its own. The
difference in the results can be clearly seen from the table: spam
filtering performance is vastly improved, with no false positives. With
the default configuration, local-only SpamAssassin had the second-worst
false negative rate of all the filters tested. Your
editor is at a loss to understand why SpamAssassin comes configured to
allow the bayesian filter to be bypassed so easily.
Back to the original point of running this test: putting SpamAssassin into
the "local tests only" mode clearly worsens performance significantly,
while also improving run time.
Popfile
A number of people were dismayed at the omission of popfile, a proxy-based filter
coded in Perl. Popfile is intended to sit between the mail client and the
POP or IMAP server; it filters mail before presenting it to the user. It
includes a built-in web server which provides filtering statistics and
allows the user to perform training.
Perhaps the most interesting feature in Popfile, however, is its approach
to filtering. While the other filters reviewed are very much oriented
around filtering spam, Popfile tries to be more general. So, instead of
filtering into just two categories (plus the "unsure" result provided by a
number of filters), popfile can handle an arbitrary number of categories.
So it not only picks out the spam, but it can sort the rest of a mail
stream based on whatever criteria the user might set. This approach makes
Popfile a potentially more useful tool, but it has implications on its spam
filtering performance, as will be seen from the testing results.
Your editor tested Popfile 0.22.4, using its standalone "pipe" and "insert"
tools.
| Batch: |
1 |
2 |
3 |
4 |
5 |
|
|
Fn |
Fp |
T |
Fn | Fp | T |
Fn |
Fp |
T |
Fn | Fp | T |
Fn |
Fp |
T |
Size |
| Popfile |
0 |
21 |
1.0 |
0 | 16 | 1.1 |
1 |
24 |
1.0 |
0 | 10 | 1.0 |
1 |
12 |
1.0 |
10 |
| PF learn all |
0 |
28 |
2.8 |
0 | 28 | 3.5 |
0 |
44 |
4.2 |
0 | 16 | 5.0 |
0 |
18 |
5.9 |
40 |
On one hand, Popfile was the most effective at removing spam of any of the
filters reviewed; its false negative rate is almost zero. On the other
hand, the false positive rate was high - unacceptably so. Popfile normally
uses a "train on errors" approach; your editor ran a second test where the
filter was trained on every message just to see if that would help with the
false positive rate. Instead, that rate got worse, and the filter slowed
down to a glacial pace. Clearly Popfile and comprehensive training were
not meant to go together.
Your editor has a hypothesis explaining the behavior seen here. Bayesian
filters which concern themselves only with spam have a built-in bias: false
positives are bad and must be avoided. Popfile, instead, has no notion of
a "false positive"; it only has various "buckets" into which mail can be
sorted. The tool does not understand that some types of errors are worse
than others. So, while most filters will err on the side of false
negatives, Popfile just goes for whatever seems right. As a result, it
catches more spam - and more of everything else.
From this experience, your editor has concluded that spam filtering should
be done independently from any other sort of mail sorting. If bayesian
filters are to be used for sorting of legitimate mail, it might be best to
use two separate filters in series.
SpamOracle
SpamOracle is
a straightforward Graham-style bayesian filter. It happens to be written
in Caml, leading your editor to go looking for compilers; Fedora Extras
came through nicely on that front. Initial training is easy and fast, and
SpamOracle works well with procmail.
As a filter, however, it is not one of the more effective ones. Your
editor ran two tests on SpamOracle v1.4, using train-on-errors and
comprehensive training strategies.
| Batch: |
1 |
2 |
3 |
4 |
5 |
|
|
Fn |
Fp |
T |
Fn | Fp | T |
Fn |
Fp |
T |
Fn | Fp | T |
Fn |
Fp |
T |
Size |
| SpamOracle TOE |
462 |
0 |
0.1 |
546 | 0 | 0.1 |
445 |
0 |
0.1 |
463 | 0 | 0.1 |
343 |
0 |
0.1 |
1.1 |
| SpamOracle comp |
461 |
0 |
0.2 |
511 | 0 | 0.2 |
433 |
0 |
0.2 |
420 | 0 | 0.2 |
339 |
0 |
0.3 |
2.6 |
As can be seen here, SpamOracle is fast, and it manages to avoid false
positives altogether. Its filtering rate is poor, however, to the point
that your editor would not want to have to depend on it to hold the spam
stream at bay. Comprehensive training slowed the process down
significantly, but did not improve the results in any appreciable way.
Thunderbird
There were some requests that Thunderbird be included in this evaluation.
The problem is that Thunderbird's filter is buried deep within a
monolithic graphical
application, making it difficult to test in any sort of automated manner.
Your editor, being the lazy person that he is, has no inclination to click
through 15,000 messages to evaluate how well Thunderbird has classified
them.
As it happens, your editor uses Thunderbird for a low-bandwidth mail
account which receives a mere 100 spams per day or so. The Thunderbird
interface is certainly convenient; there is a nice "junk" button for
training the filter (though the way it toggles to "not junk" can be
confusing). Thunderbird can be configured to automatically sideline spam
into a folder, and to age messages out of that folder after a given time.
False positives are rare, in your editor's experience, but the false
negative rate is relatively high. It is also impossible, as far as your
editor can tell, to get any information on the filter and how it makes its
decisions.
Conclusion
Here is the updated table, with the new and old results:
| Test |
False neg. |
False pos. |
Time |
Size |
| bogofilter |
406 | 5.5% |
| |
0.02 |
5 |
| bogofilter -u |
268 | 3.0% |
| |
0.06 |
32 |
| CRM114 |
14 | 0.1% |
16 | 0.3% |
0.06 |
24 |
| CRM114 pretrain |
14 | 0.2% |
15 | 0.3% |
0.06 |
24 |
| DSPAM teft |
50 | 0.6% |
| |
0.1 |
305 |
| DSPAM toe |
67 | 0.7% |
15 | 0.3% |
0.1 |
276 |
| DSPAM tum |
83 | 0.9% |
| |
0.1 |
305 |
| Popfile |
2 | 0.02% |
83 | 1.4% |
1.0 |
10 |
| Popfile comp |
0 | 0% |
144 | 2.4% |
4.3 |
40 |
| SpamAssassin |
21 | 0.2% |
| |
1.1 |
10 |
| SpamAssassin untrained |
81 | 0.9% |
| |
0.9 |
10 |
| SpamAssassin local default |
1098 | 12.2% |
| |
0.3 |
10 |
| SpamAssassin local tweaked |
227 | 2.5% |
| |
0.3 |
10 |
| SpamBayes |
185 | 2.1% |
1 | 0.02% |
0.4 |
4 |
| SpamBayes comp |
294 | 3.3% |
| |
0.8 |
16 |
| SpamOracle TOE |
2259 | 25.1% |
| |
0.1 |
1.1 |
| SpamOracle comp |
2164 | 24.0% |
| |
0.2 |
2.6 |
| SpamProbe train |
222 | 2.5% |
3 | 0.05% |
0.1 |
81 |
| SpamProbe receive |
257 | 2.9% |
4 | 0.07% |
0.7 |
201 |
There is little in the new results to change the conclusions arrived at
last week. The filters which stand out are SpamAssassin (in some modes at
least), and DSPAM. Most of the others demonstrated overly high error
rates, either with false negatives (annoying) or false positives
(unacceptable). Stay tuned, however; there is clearly a great deal of work
being done in this area.
Comments (21 posted)
Back in January, Novell
announced that it was releasing
the "AppArmor" security framework under the GPL. AppArmor had been
developed by Immunix, and acquired by Novell last year. Novell makes a
number of claims about AppArmor, but the one at the top of the list appears
to be relative simplicity: AppArmor is said to be easier to understand, configure, and
maintain than SELinux.
Dan Walsh, a Red Hat developer working on SELinux, has criticized this move:
Couldn't Novell have spent their money on making SELinux easier to
use? No, [Novell] chooses to split the user and developer community. I
am not sure what their goals are, but I feel this hurts Linux and
the open source movement.
For years, critics have claimed that Linux would fragment much like Unix
did, and that would be the downfall of the system. So far, Linux has
steadfastly refused to fragment in this manner. But now we have a Linux
developer saying that the same thing is happening. Red Hat and Novell also
appear to be taking different approaches to 3D-enhanced window systems.
Novell is pushing Xgl, Red Hat has AIGLX, and Linux users are left
wondering when and how all that activity will yield better graphics support
for them. At this level, too, it looks like Linux might finally be heading
for a breakup.
Or is it? Perhaps we are simply seeing the development community at work.
With regard to SELinux, it is important to note that there is no real
consensus, yet, on how the security problem should be solved. SELinux is a
powerful system, beyond doubt; it allows the capabilities of users and
programs to be specified in great detail. But SELinux is also highly complex,
to the point that a large percentage of system administrators find
themselves unable to cope with it. The fedora-devel list just had a
discussion on how to get administrators to keep SELinux enabled on their
systems. One participant, who teaches administration courses, noted:
By no means is this limited to home users. I would say that the
*vast* majority of corporate admins just turn off SELinux. The story
behind how & why they learned to do that to begin with only vary in
details. It's almost always, "I had problems installing X or doing
Y and I found a document on the Internet that said that SELinux was
in the way and didn't work right anyway and was too complicated and
didn't do me any good and that I couldn't learn enough about it to
even understand what was happening, let alone deal with it, in less
than a month and ... well, so I just turn off SELinux and then I
don't have to deal with it."
The point here is not to criticize SELinux; that has been adequately done
elsewhere. Instead, the real point is there is not, at this time, any sort
of broad consensus that SELinux is the right tool for everybody's security
problems. It may turn out that the best solution is to put more effort
into making SELinux easier to deal with, but it seems premature to claim
that SELinux will be the answer to security problems on Linux. It
makes sense, in other words, to spend some time considering other
approaches - especially those which are already implemented and relatively
stable.
If SELinux is truly a superior solution, that will eventually become clear
and users will vote with their keyboards. But to claim, at this point,
that SELinux is the only solution and that looking at alternatives hurts
the community would be a mistake. This community thrives on choices, and, to an
extent, it thrives on competition between related projects. Since the
alternatives are all free software, users are able to choose what works
for them, and the best ideas (and code) can move from one project to another.
The process would be helped, however, if Novell would pull together the
AppArmor source and submit it properly for review and eventual merging into
the mainline kernel.
The story with Xgl and AIGLX is the same. There is no real consensus, yet,
on how 3D graphics will be best supported in the X window system. So two
groups have put together two different implementations, each with its
advantages. It is easy to present this story as a classic developer
flamewar, but that does not seem to match the reality of the situation. A
look at the X.org mailing list, for example, shows Xgl developer David
Reveman agreeing to adopt some interfaces
put forward by the AIGLX group. Over the long term, the development
community will almost certainly coalesce around the approach which seems to
work best, but, for now, it is too early to say which one (if either) will
be most successful.
If there is a problem here at all, it is that the distributors are being
quick to make products out of technology which may not be entirely ready
for prime time. Red Hat has operated this way for a very long time;
anybody who remembers being pushed into, for example, the ELF or glibc2
transitions by Red Hat Linux upgrades knows that some of that code was a
little rough around the edges then. But, by pushing that code out to the
users, Red Hat almost certainly accelerated the stabilization process.
What we are seeing now is that Novell wants to get into the same game and
put more leading technology into the traditionally conservative (by
comparison) SUSE distribution. When things work well, Novell will be able
to claim leading-edge features and the code will get wider testing, sooner.
There is nothing that requires Novell, as it moves SUSE Linux toward the
leading edge, to follow Red Hat's decisions on which approaches to adopt.
The risk is that each
distributor's user base will find itself locked in to a different set of
still-green technologies, making it harder for the development community to
settle on a single choice. In the cases of security policies and 3D
acceleration, however, the potential for lock-in seems low; most users will
not care about which approach they use, as long as the system works well.
So, most likely, those critics who have predicted the death by
fragmentation of Linux will have to wait a while longer yet.
Comments (52 posted)
There is nothing like the joy of running a development distribution.
Nowhere else can one find the same combination of huge updates (it's
amazing how often the X bitmap fonts seem to change), unstable software,
broken dependencies, and, for extra spice, the occasional blown-away
configuration file. Whether it's called sid, Dapper, Rawhide, Cooker, or
something else, a development distribution is a sure way to learn - usually
at inopportune times - about what is happening at the leading edge of the
development community.
Development distributions are also a good way to keep track of what
developers and packagers are doing. A development distribution is alive,
forever changing, forever interesting. It is a constant reminder that
Linux and free software are a process, not a product. When compared to the
vitality of a development distribution, stable releases seem flat and boring.
These distributions exist for a reason: having more people testing the
system will help the creation of more stable releases. So developers want
to have outsiders running the development version. But those developers
might, perhaps, prefer to do without users who don't know what they are
getting into. Consider, for example, this
note sent to the fedora-testers list:
I think somewhere along the way netizens appear to think that
Rawhide is stable (or at least for public consumption). I'd think
we need to discuss how we can provide more constructive information
for developers and send a clear message to non-testers that Rawhide
(a.k.a FC5 ) is not for general use.
Another participant responded with this
suggestion:
I can scream that the development tree will eat your children and
destroy not only your data but your neighbor's data until I'm blue
in the face... but for people who don't want to hear the
warning.. they will choose not to hear the warning... and the only
way for them to learn is to actually have rawhide eat their
data. So i say.. every week there should be a deliberate package
update in the development tree which destroys data. Thrown into the
package pool at random, with an appropriate changelog entry so
those of us who read the daily rawhide reports will know exactly
which package to exclude.
One can safely assume that this idea was offered in a tongue in cheek
mode. But the discussion as a whole does raise a question: who should be
running development distributions, and for what purposes?
Development releases routinely come with warnings about their explosive
nature and admonitions not to use them for any serious purpose. But the
fact is that the only way to find the problems with these distributions is
to use them for serious purposes. There is little to be learned by putting
the distribution on a test box, noting that the installer works, and
admiring the pretty desktop graphics. It's only through serious use that
one discovers, say, that the web server does not handle load as well as
before, that the compiler produces bogus code in certain situations, that
emacs feels pretty today, or
that the Wesnoth sound effects have stopped working. These are all things
which are best discovered before the release is shipped; having to put
together Wesnoth patches in a hurry to satisfy a service contract to a
large corporation is just a real pain.
So it is important to have "real users" working with development
distributions; those are the users who will come up with many of the
important bug reports. Discouraging them can be a counterproductive thing
to do. On the other hand, these users do need to know what they are
getting into. A development distribution will bite back, sooner or
later, and it's important to be able to put the pieces back together when
that happens. Testers who are not prepared when disaster strikes will not,
in the long run, be helpful to the development process.
This aspect of the free software development process is not often talked
about. But, without widespread testing in real-world environments,
software will not stabilize as well as it should. Proprietary companies
run closed beta programs to obtain this testing; the free software world,
for the most part, has moved away from that mode in favor of open
development repositories. Open development systems are a good thing, they
allow a wide variety of participants to try out the software. But these
development releases are not for everybody; finding the right way to
communicate that fact may be an ongoing challenge.
Comments (18 posted)
Page editor: Jonathan Corbet
Next page: Security>>