Spam avoidance techniques
Most spam filtering work has involved two techniques: testing mail against patterns indicative of spam and blocking mail from known sources of spam (and other likely sources, such as ISP dialup lines). Source-based blocking can be effective, but it also tends to block a fair amount of legitimate mail along with the spam. For example, some blacklists cause the blocking of mail from kernel.org, despite the fact that no spam originates there. Source-based blocking is unreliable enough that quite a few sites are unwilling to use it, despite a strong desire to be rid of spam.
Pattern matching has shown more promise. Early spam filtering was done with complex procmail scripts, but the current champion of pattern-based spam filtering can only be SpamAssassin. Using a detailed set of rules, SpamAssassin cleans out the trash to great effect. LWN has been using it for some months, and it has made life much easier - lwn@lwn.net gets a lot of spam. SpamAssassin has returned much of our time back to us to work on LWN, as well as keeping us from accidentally deleting mail from readers that tended to get buried in the spam.
One thing that SpamAssassin users tend to notice, however, is that its effectiveness decreases over time. Each new update blocks more spam - a recent upgrade freed us from a whole unpleasant class of Nigeria spam, for example. But pattern-based matching only works as well as its patterns, and they tend to go stale as spammers move on to new tricks. Keeping SpamAssassin effective requires a number of highly dedicated people to actually read all that spam and come up with new rules. Most SpamAssassin users are unlikely to be able (or willing) to write new rules themselves.
Recently, a new approach to spam filtering has attracted a lot of attention, thanks mostly to Paul Graham's paper A Plan for Spam. Rather than try to come up with an endless stream of clever patterns to detect spam, why not just look at the words spammers use? Each word can be assigned a probability that any message that contains it is spam; the probabilities for the words in any specific message are then combined using a Bayesian filter, yielding an overall probability estimate. If that estimate is high enough, the message is classified as spam.
At a first glance, going up against a tool as good as SpamAssassin with such a simple technique seems like a losing battle, but this approach has a number of advantages:
- Development of the word-based rules can be automated - it is
just a matter of feeding the filter enough spam and "ham" (legitimate
mail) and letting it work out a probability factor for each word.
- The filter can be made to follow shifting patterns in spam by
passing it each message that it misclassifies. Users can not be
expected to master regular expressions and write patterns, but they
can be asked to hit a "this is spam" key in their mailer.
- Each user's spam filter comes to reflect the mail that the user
receives. Spam seems like the ultimate in indiscriminate marketing,
but the fact is that different people can receive very different spam.
An individually derived rule base should prove more effective than a
"one size fits all" set of patterns.
- Classification of mail with a Bayesian filter can be done relatively quickly.
All of the above is irrelevant, however, if the Bayesian approach does not succeed in actually filtering spam. To get a sense for the state of the art, we took 3000 messages received at lwn@lwn.net - a little under two weeks worth. 295 of those messages were real mail, and 2705 were spam. If one were to believe the bulk of our mail, one would conclude that about every part of our anatomy (even those we don't possess) is the wrong size, that we are so honest that people want to extract money from Africa via our bank account, that we're missing out on numerous hot stocks, that we have a strange attraction to domesticated animals, and that the purchase of something called the "TushyClean" would greatly improve our lives. Trust us, this exercise has not been fun, but no sacrifice is too great for our readers.
Once the messages were sorted, we fed them all to SpamAssassin and to bogofilter, a new Bayesian filter written by Eric Raymond. Bogofilter was tested twice: once after training with 15% of the 3000 messages, and once after being trained with the whole set. Then we ran both filters on 5000 recent postings from the linux-kernel list, twelve of which were spam (devfs flames were not counted). The results were:
| Filter | False positives |
False negatives |
Run time (seconds) |
|---|---|---|---|
| -- 3000 lwn@lwn.net messages -- | |||
| SpamAssassin | 2 | 250 | 11,900 |
| Bogofilter (15%) | 0 | 517 | 108 |
| Bogofilter (100%) | 0 | 94 | 134 |
| -- 5000 linux-kernel messages -- | |||
| SpamAssassin | 0 | 6 | 19,600 |
| Bogofilter | 0 | 4 | 251 |
False positives are legitimate mail classified as spam. These, of course, are bad news, since they can cause the loss of real mail. False negatives are spam that slip through - an annoyance. It is appropriate that spam filters tend to err toward false negatives, and both filters shown here do exactly that.
The results indicate that bogofilter requires a substantial amount of training before it reaches the level of effectiveness achieved by SpamAssassin. This training is best done with each individual user's mail, but most users are unlikely to have a few thousand nicely sorted messages sitting around to train their filters with. So bogofilter is likely to be frustrating for many users to adopt - it won't work well until the user has run "about one thousand" (according to Eric Raymond) messages through it.
That said, bogofilter is surprisingly effective for a tool that is so new and very much still in development. And the run time relative to SpamAssassin speaks for itself. Much of the difference there will be explained by the fact that bogofilter is coded in C, while SpamAssassin is in Perl. But bogofilter also owes its speed to a much faster algorithm.
The Bayesian filter idea is not new - see this 1998
paper on the Microsoft site, for example. But recently a great deal of
effort has gone into expressing this approach in free software. Bogofilter
is one example; another is the spambayes
project, which has been set up as a testbed for variants on the Bayesian
filter idea. It will be interesting to see where these projects go; they
seem to be off to an interesting start. Taking on a tool as effective as
SpamAssassin is a difficult challenge, but the free software world likes
challenges.
