|
|
Subscribe / Log in / New account

LWN.net Weekly Edition for September 12, 2002

Spam avoidance techniques

It is said that most free software comes about as the result of some developer scratching a personal itch. It's also said that very little innovative free software development is done; free software projects spend their time "chasing taillights" - catching up to the features offered by proprietary code. The field of spam filtering may well confirm one of those stereotypes while refuting the other. After all, if there is anything that truly itches, it's spam. But some of the free software being developed to combat spam is truly innovative.

Most spam filtering work has involved two techniques: testing mail against patterns indicative of spam and blocking mail from known sources of spam (and other likely sources, such as ISP dialup lines). Source-based blocking can be effective, but it also tends to block a fair amount of legitimate mail along with the spam. For example, some blacklists cause the blocking of mail from kernel.org, despite the fact that no spam originates there. Source-based blocking is unreliable enough that quite a few sites are unwilling to use it, despite a strong desire to be rid of spam.

Pattern matching has shown more promise. Early spam filtering was done with complex procmail scripts, but the current champion of pattern-based spam filtering can only be SpamAssassin. Using a detailed set of rules, SpamAssassin cleans out the trash to great effect. LWN has been using it for some months, and it has made life much easier - lwn@lwn.net gets a lot of spam. SpamAssassin has returned much of our time back to us to work on LWN, as well as keeping us from accidentally deleting mail from readers that tended to get buried in the spam.

One thing that SpamAssassin users tend to notice, however, is that its effectiveness decreases over time. Each new update blocks more spam - a recent upgrade freed us from a whole unpleasant class of Nigeria spam, for example. But pattern-based matching only works as well as its patterns, and they tend to go stale as spammers move on to new tricks. Keeping SpamAssassin effective requires a number of highly dedicated people to actually read all that spam and come up with new rules. Most SpamAssassin users are unlikely to be able (or willing) to write new rules themselves.

Recently, a new approach to spam filtering has attracted a lot of attention, thanks mostly to Paul Graham's paper A Plan for Spam. Rather than try to come up with an endless stream of clever patterns to detect spam, why not just look at the words spammers use? Each word can be assigned a probability that any message that contains it is spam; the probabilities for the words in any specific message are then combined using a Bayesian filter, yielding an overall probability estimate. If that estimate is high enough, the message is classified as spam.

At a first glance, going up against a tool as good as SpamAssassin with such a simple technique seems like a losing battle, but this approach has a number of advantages:

  • Development of the word-based rules can be automated - it is just a matter of feeding the filter enough spam and "ham" (legitimate mail) and letting it work out a probability factor for each word.

  • The filter can be made to follow shifting patterns in spam by passing it each message that it misclassifies. Users can not be expected to master regular expressions and write patterns, but they can be asked to hit a "this is spam" key in their mailer.

  • Each user's spam filter comes to reflect the mail that the user receives. Spam seems like the ultimate in indiscriminate marketing, but the fact is that different people can receive very different spam. An individually derived rule base should prove more effective than a "one size fits all" set of patterns.

  • Classification of mail with a Bayesian filter can be done relatively quickly.

All of the above is irrelevant, however, if the Bayesian approach does not succeed in actually filtering spam. To get a sense for the state of the art, we took 3000 messages received at lwn@lwn.net - a little under two weeks worth. 295 of those messages were real mail, and 2705 were spam. If one were to believe the bulk of our mail, one would conclude that about every part of our anatomy (even those we don't possess) is the wrong size, that we are so honest that people want to extract money from Africa via our bank account, that we're missing out on numerous hot stocks, that we have a strange attraction to domesticated animals, and that the purchase of something called the "TushyClean" would greatly improve our lives. Trust us, this exercise has not been fun, but no sacrifice is too great for our readers.

Once the messages were sorted, we fed them all to SpamAssassin and to bogofilter, a new Bayesian filter written by Eric Raymond. Bogofilter was tested twice: once after training with 15% of the 3000 messages, and once after being trained with the whole set. Then we ran both filters on 5000 recent postings from the linux-kernel list, twelve of which were spam (devfs flames were not counted). The results were:

FilterFalse
positives
False
negatives
Run time
(seconds)
-- 3000 lwn@lwn.net messages --
SpamAssassin 2 250 11,900
Bogofilter (15%) 0 517 108
Bogofilter (100%) 0 94 134
-- 5000 linux-kernel messages --
SpamAssassin 0 6 19,600
Bogofilter 0 4 251

False positives are legitimate mail classified as spam. These, of course, are bad news, since they can cause the loss of real mail. False negatives are spam that slip through - an annoyance. It is appropriate that spam filters tend to err toward false negatives, and both filters shown here do exactly that.

The results indicate that bogofilter requires a substantial amount of training before it reaches the level of effectiveness achieved by SpamAssassin. This training is best done with each individual user's mail, but most users are unlikely to have a few thousand nicely sorted messages sitting around to train their filters with. So bogofilter is likely to be frustrating for many users to adopt - it won't work well until the user has run "about one thousand" (according to Eric Raymond) messages through it.

That said, bogofilter is surprisingly effective for a tool that is so new and very much still in development. And the run time relative to SpamAssassin speaks for itself. Much of the difference there will be explained by the fact that bogofilter is coded in C, while SpamAssassin is in Perl. But bogofilter also owes its speed to a much faster algorithm.

The Bayesian filter idea is not new - see this 1998 paper on the Microsoft site, for example. But recently a great deal of effort has gone into expressing this approach in free software. Bogofilter is one example; another is the spambayes project, which has been set up as a testbed for variants on the Bayesian filter idea. It will be interesting to see where these projects go; they seem to be off to an interesting start. Taking on a tool as effective as SpamAssassin is a difficult challenge, but the free software world likes challenges.

Comments (28 posted)

Where free software should be required by law

RISKS 22.24 includes a detailed article by Rebecca Mercuri on the latest fun with the new voting systems in Florida. That state, of course, was the source of (ongoing) uncertainty in the 2000 U.S. presidential election, due, in part, to its ancient voting equipment. Since then, the voting machines have been upgraded to new, computer-based systems with touchscreen interfaces.

These systems are based on closed source code. There is no external audit trail, no way of verifying that they are recording votes as they were actually cast. Trade secret law forbids the inspection of the code in the systems. One just has to trust the vendor that the results are correct.

A primary election held there recently turned up a whole set of problems, ranging from basic usability issues to outright failure.

There has been a lot of interest recently in laws requiring governments to use free software in many or all situations. It remains unclear, to some people anyway, that such laws are really in the best interest of government, the governed, or the free software community. But, in the case of voting systems, the case seems clear: no part of the system that elects people into positions of power should be opaque. The creation of a free, transparent, verifyable electronic voting system should not be that hard a task for governments or the free software community. There is no excuse for using anything else.

Comments (10 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

  • Security: Multics security, 30 years later.
  • Kernel: Linus and SpamAssassin; speeding up rmap; other memory managment patches
  • Distributions: New distributions: DebianEdu, OpenZaurus, Simply GNUstep; Libranet GNU/Linux 2.7 released
  • Development: Jaberwocky IDE for LISP, AFPL Ghostscript 7.30 developer release, Conexant HSF softmodem driver, WaveSurfer 1.4.4, Mozilla 1.0.1, Pygame-1.5.3, Open Office 1.0.1 Alpha SDK, KOffice 1.2, Xft/Fontconfig release 2.0, Quanta Plus 3.0 PR2, PHP 4.2.3, GDB 5.3 branch.
  • Commerce: Winners of the Australian Open Source Awards Announced; Sun to Offer Alternative to Microsoft Desktop
  • Press: Hardwired PC copy protection, Perens leaves HP, $199 Lindows PCs, memory-resident embedded databases, SuSE 8.0 howto for Xbox, Sony Cocoon video recorder
  • Announcements: archLSB for Itanium, Singapore Linux Conf, PHPCon 2002, Python UK Conf, Ruby Conf 2002, Perl Meetup in London
  • Letters: Releasing software into the public domain; Industrial free software collaboration
Next page: Security>>

Copyright © 2002, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds