LWN.net Logo

Leading items

A grumpy editor's bayesian followup

This article is part of the LWN Grumpy Editor series.
The Grumpy Editor's guide to bayesian spam filters was published one week ago. As has become traditional, it would seem, LWN readers have pointed out tools which evaded your editor's first pass. So here is the inevitable followup with a couple more filters and an updated table at the end.

SpamAssassin

One commenter complained last week about your editor having run SpamAssassin with the network tests enabled. The original reasoning had been that SpamAssassin, by its nature, comes with a large set of rules, and, for the purpose of the review, selectively disabling some of them was not appropriate. Still, the network tests do have a couple of important effects on the end result. As will be seen below, they make the filter much more effective; in your editor's experience, the source blacklists earn most of the credit there. But they also slow things down.

Your editor re-ran the test with network tests disabled, with the following results:

Batch: 1 2 3 4 5
Fn Fp T FnFpT Fn Fp T FnFpT Fn Fp T Size
SpamAssassin 8 0 1.1 301.1 5 0 1.1 301.0 2 0 1.0 10
SA untrained 32 0 0.6 901.0 18 0 1.0 1501.0 7 0 1.0 10
SA Local default 181 0 0.3 25900.3 271 0 0.3 22600.3 161 0 0.3 10
SA Local tweaked 53 0 0.3 4300.3 50 0 0.3 4400.3 37 0 0.3 10

(Last week's results have been included for comparison). The "default" results are actually a mistake on your editor's part, but, since they illustrate an interesting point, they have been included in the above table.

When SpamAssassin runs its bayesian filter on a message, it encodes the results as if a specific rule had fired. If the filter is absolutely convinced that the message is good, the score is adjusted by the value attached to the BAYES_00 rule. For obviously spam messages, BAYES_99 comes into play; there are several levels between the two as well. SpamAssassin, out of the box, assigns 3.5 points to BAYES_99. Since five points are required, by default, to condemn a message, the bayesian filter can never do that on its own. Any message, to be considered spam, must trigger some tests outside of the bayesian filter.

The "default" results, above, came about because your editor got a little over-zealous when clearing out the bayesian and whitelist databases for a new round of tests; so they use the default scoring for BAYES_99. The "tweaked" results, instead, have the score for that rule raised to 5.0 points, allowing the bayesian filter to condemn mail on its own. The difference in the results can be clearly seen from the table: spam filtering performance is vastly improved, with no false positives. With the default configuration, local-only SpamAssassin had the second-worst false negative rate of all the filters tested. Your editor is at a loss to understand why SpamAssassin comes configured to allow the bayesian filter to be bypassed so easily.

Back to the original point of running this test: putting SpamAssassin into the "local tests only" mode clearly worsens performance significantly, while also improving run time.

Popfile

A number of people were dismayed at the omission of popfile, a proxy-based filter coded in Perl. Popfile is intended to sit between the mail client and the POP or IMAP server; it filters mail before presenting it to the user. It includes a built-in web server which provides filtering statistics and allows the user to perform training.

Perhaps the most interesting feature in Popfile, however, is its approach to filtering. While the other filters reviewed are very much oriented around filtering spam, Popfile tries to be more general. So, instead of filtering into just two categories (plus the "unsure" result provided by a number of filters), popfile can handle an arbitrary number of categories. So it not only picks out the spam, but it can sort the rest of a mail stream based on whatever criteria the user might set. This approach makes Popfile a potentially more useful tool, but it has implications on its spam filtering performance, as will be seen from the testing results.

Your editor tested Popfile 0.22.4, using its standalone "pipe" and "insert" tools.

Batch: 1 2 3 4 5
Fn Fp T FnFpT Fn Fp T FnFpT Fn Fp T Size
Popfile 0 21 1.0 0161.1 1 24 1.0 0101.0 1 12 1.0 10
PF learn all 0 28 2.8 0283.5 0 44 4.2 0165.0 0 18 5.9 40

On one hand, Popfile was the most effective at removing spam of any of the filters reviewed; its false negative rate is almost zero. On the other hand, the false positive rate was high - unacceptably so. Popfile normally uses a "train on errors" approach; your editor ran a second test where the filter was trained on every message just to see if that would help with the false positive rate. Instead, that rate got worse, and the filter slowed down to a glacial pace. Clearly Popfile and comprehensive training were not meant to go together.

Your editor has a hypothesis explaining the behavior seen here. Bayesian filters which concern themselves only with spam have a built-in bias: false positives are bad and must be avoided. Popfile, instead, has no notion of a "false positive"; it only has various "buckets" into which mail can be sorted. The tool does not understand that some types of errors are worse than others. So, while most filters will err on the side of false negatives, Popfile just goes for whatever seems right. As a result, it catches more spam - and more of everything else.

From this experience, your editor has concluded that spam filtering should be done independently from any other sort of mail sorting. If bayesian filters are to be used for sorting of legitimate mail, it might be best to use two separate filters in series.

SpamOracle

SpamOracle is a straightforward Graham-style bayesian filter. It happens to be written in Caml, leading your editor to go looking for compilers; Fedora Extras came through nicely on that front. Initial training is easy and fast, and SpamOracle works well with procmail.

As a filter, however, it is not one of the more effective ones. Your editor ran two tests on SpamOracle v1.4, using train-on-errors and comprehensive training strategies.

Batch: 1 2 3 4 5
Fn Fp T FnFpT Fn Fp T FnFpT Fn Fp T Size
SpamOracle TOE 462 0 0.1 54600.1 445 0 0.1 46300.1 343 0 0.1 1.1
SpamOracle comp 461 0 0.2 51100.2 433 0 0.2 42000.2 339 0 0.3 2.6

As can be seen here, SpamOracle is fast, and it manages to avoid false positives altogether. Its filtering rate is poor, however, to the point that your editor would not want to have to depend on it to hold the spam stream at bay. Comprehensive training slowed the process down significantly, but did not improve the results in any appreciable way.

Thunderbird

There were some requests that Thunderbird be included in this evaluation. The problem is that Thunderbird's filter is buried deep within a monolithic graphical application, making it difficult to test in any sort of automated manner. Your editor, being the lazy person that he is, has no inclination to click through 15,000 messages to evaluate how well Thunderbird has classified them.

As it happens, your editor uses Thunderbird for a low-bandwidth mail account which receives a mere 100 spams per day or so. The Thunderbird interface is certainly convenient; there is a nice "junk" button for training the filter (though the way it toggles to "not junk" can be confusing). Thunderbird can be configured to automatically sideline spam into a folder, and to age messages out of that folder after a given time. False positives are rare, in your editor's experience, but the false negative rate is relatively high. It is also impossible, as far as your editor can tell, to get any information on the filter and how it makes its decisions.

Conclusion

Here is the updated table, with the new and old results:

Test False neg. False pos. Time Size
bogofilter 4065.5% 0.02 5
bogofilter -u 2683.0% 0.06 32
CRM114 140.1% 160.3% 0.06 24
CRM114 pretrain 140.2% 150.3% 0.06 24
DSPAM teft 500.6% 0.1 305
DSPAM toe 670.7% 150.3% 0.1 276
DSPAM tum 830.9% 0.1 305
Popfile 20.02% 831.4% 1.0 10
Popfile comp 00% 1442.4% 4.3 40
SpamAssassin 210.2% 1.1 10
SpamAssassin untrained 810.9% 0.9 10
SpamAssassin local default 109812.2% 0.3 10
SpamAssassin local tweaked 2272.5% 0.3 10
SpamBayes 1852.1% 10.02% 0.4 4
SpamBayes comp 2943.3% 0.8 16
SpamOracle TOE 225925.1% 0.1 1.1
SpamOracle comp 216424.0% 0.2 2.6
SpamProbe train 2222.5% 30.05% 0.1 81
SpamProbe receive 2572.9% 40.07% 0.7 201

There is little in the new results to change the conclusions arrived at last week. The filters which stand out are SpamAssassin (in some modes at least), and DSPAM. Most of the others demonstrated overly high error rates, either with false negatives (annoying) or false positives (unacceptable). Stay tuned, however; there is clearly a great deal of work being done in this area.

Comments (21 posted)

Linux fragmenting at last?

Back in January, Novell announced that it was releasing the "AppArmor" security framework under the GPL. AppArmor had been developed by Immunix, and acquired by Novell last year. Novell makes a number of claims about AppArmor, but the one at the top of the list appears to be relative simplicity: AppArmor is said to be easier to understand, configure, and maintain than SELinux.

Dan Walsh, a Red Hat developer working on SELinux, has criticized this move:

Couldn't Novell have spent their money on making SELinux easier to use? No, [Novell] chooses to split the user and developer community. I am not sure what their goals are, but I feel this hurts Linux and the open source movement.

For years, critics have claimed that Linux would fragment much like Unix did, and that would be the downfall of the system. So far, Linux has steadfastly refused to fragment in this manner. But now we have a Linux developer saying that the same thing is happening. Red Hat and Novell also appear to be taking different approaches to 3D-enhanced window systems. Novell is pushing Xgl, Red Hat has AIGLX, and Linux users are left wondering when and how all that activity will yield better graphics support for them. At this level, too, it looks like Linux might finally be heading for a breakup.

Or is it? Perhaps we are simply seeing the development community at work.

With regard to SELinux, it is important to note that there is no real consensus, yet, on how the security problem should be solved. SELinux is a powerful system, beyond doubt; it allows the capabilities of users and programs to be specified in great detail. But SELinux is also highly complex, to the point that a large percentage of system administrators find themselves unable to cope with it. The fedora-devel list just had a discussion on how to get administrators to keep SELinux enabled on their systems. One participant, who teaches administration courses, noted:

By no means is this limited to home users. I would say that the *vast* majority of corporate admins just turn off SELinux. The story behind how & why they learned to do that to begin with only vary in details. It's almost always, "I had problems installing X or doing Y and I found a document on the Internet that said that SELinux was in the way and didn't work right anyway and was too complicated and didn't do me any good and that I couldn't learn enough about it to even understand what was happening, let alone deal with it, in less than a month and ... well, so I just turn off SELinux and then I don't have to deal with it."

The point here is not to criticize SELinux; that has been adequately done elsewhere. Instead, the real point is there is not, at this time, any sort of broad consensus that SELinux is the right tool for everybody's security problems. It may turn out that the best solution is to put more effort into making SELinux easier to deal with, but it seems premature to claim that SELinux will be the answer to security problems on Linux. It makes sense, in other words, to spend some time considering other approaches - especially those which are already implemented and relatively stable.

If SELinux is truly a superior solution, that will eventually become clear and users will vote with their keyboards. But to claim, at this point, that SELinux is the only solution and that looking at alternatives hurts the community would be a mistake. This community thrives on choices, and, to an extent, it thrives on competition between related projects. Since the alternatives are all free software, users are able to choose what works for them, and the best ideas (and code) can move from one project to another.

The process would be helped, however, if Novell would pull together the AppArmor source and submit it properly for review and eventual merging into the mainline kernel.

The story with Xgl and AIGLX is the same. There is no real consensus, yet, on how 3D graphics will be best supported in the X window system. So two groups have put together two different implementations, each with its advantages. It is easy to present this story as a classic developer flamewar, but that does not seem to match the reality of the situation. A look at the X.org mailing list, for example, shows Xgl developer David Reveman agreeing to adopt some interfaces put forward by the AIGLX group. Over the long term, the development community will almost certainly coalesce around the approach which seems to work best, but, for now, it is too early to say which one (if either) will be most successful.

If there is a problem here at all, it is that the distributors are being quick to make products out of technology which may not be entirely ready for prime time. Red Hat has operated this way for a very long time; anybody who remembers being pushed into, for example, the ELF or glibc2 transitions by Red Hat Linux upgrades knows that some of that code was a little rough around the edges then. But, by pushing that code out to the users, Red Hat almost certainly accelerated the stabilization process.

What we are seeing now is that Novell wants to get into the same game and put more leading technology into the traditionally conservative (by comparison) SUSE distribution. When things work well, Novell will be able to claim leading-edge features and the code will get wider testing, sooner. There is nothing that requires Novell, as it moves SUSE Linux toward the leading edge, to follow Red Hat's decisions on which approaches to adopt.

The risk is that each distributor's user base will find itself locked in to a different set of still-green technologies, making it harder for the development community to settle on a single choice. In the cases of security policies and 3D acceleration, however, the potential for lock-in seems low; most users will not care about which approach they use, as long as the system works well. So, most likely, those critics who have predicted the death by fragmentation of Linux will have to wait a while longer yet.

Comments (52 posted)

Testing the bleeding edge

There is nothing like the joy of running a development distribution. Nowhere else can one find the same combination of huge updates (it's amazing how often the X bitmap fonts seem to change), unstable software, broken dependencies, and, for extra spice, the occasional blown-away configuration file. Whether it's called sid, Dapper, Rawhide, Cooker, or something else, a development distribution is a sure way to learn - usually at inopportune times - about what is happening at the leading edge of the development community.

Development distributions are also a good way to keep track of what developers and packagers are doing. A development distribution is alive, forever changing, forever interesting. It is a constant reminder that Linux and free software are a process, not a product. When compared to the vitality of a development distribution, stable releases seem flat and boring.

These distributions exist for a reason: having more people testing the system will help the creation of more stable releases. So developers want to have outsiders running the development version. But those developers might, perhaps, prefer to do without users who don't know what they are getting into. Consider, for example, this note sent to the fedora-testers list:

I think somewhere along the way netizens appear to think that Rawhide is stable (or at least for public consumption). I'd think we need to discuss how we can provide more constructive information for developers and send a clear message to non-testers that Rawhide (a.k.a FC5 ) is not for general use.

Another participant responded with this suggestion:

I can scream that the development tree will eat your children and destroy not only your data but your neighbor's data until I'm blue in the face... but for people who don't want to hear the warning.. they will choose not to hear the warning... and the only way for them to learn is to actually have rawhide eat their data. So i say.. every week there should be a deliberate package update in the development tree which destroys data. Thrown into the package pool at random, with an appropriate changelog entry so those of us who read the daily rawhide reports will know exactly which package to exclude.

One can safely assume that this idea was offered in a tongue in cheek mode. But the discussion as a whole does raise a question: who should be running development distributions, and for what purposes?

Development releases routinely come with warnings about their explosive nature and admonitions not to use them for any serious purpose. But the fact is that the only way to find the problems with these distributions is to use them for serious purposes. There is little to be learned by putting the distribution on a test box, noting that the installer works, and admiring the pretty desktop graphics. It's only through serious use that one discovers, say, that the web server does not handle load as well as before, that the compiler produces bogus code in certain situations, that emacs feels pretty today, or that the Wesnoth sound effects have stopped working. These are all things which are best discovered before the release is shipped; having to put together Wesnoth patches in a hurry to satisfy a service contract to a large corporation is just a real pain.

So it is important to have "real users" working with development distributions; those are the users who will come up with many of the important bug reports. Discouraging them can be a counterproductive thing to do. On the other hand, these users do need to know what they are getting into. A development distribution will bite back, sooner or later, and it's important to be able to put the pieces back together when that happens. Testers who are not prepared when disaster strikes will not, in the long run, be helpful to the development process.

This aspect of the free software development process is not often talked about. But, without widespread testing in real-world environments, software will not stabilize as well as it should. Proprietary companies run closed beta programs to obtain this testing; the free software world, for the most part, has moved away from that mode in favor of open development repositories. Open development systems are a good thing, they allow a wide variety of participants to try out the software. But these development releases are not for everybody; finding the right way to communicate that fact may be an ongoing challenge.

Comments (18 posted)

Page editor: Jonathan Corbet
Next page: Security>>

Copyright © 2006, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds