Spam avoidance techniques
[Posted September 11, 2002 by corbet]
It is said that most free software comes about as the result of some
developer scratching a personal itch. It's also said that very little
innovative free software development is done; free software projects spend
their time "chasing taillights" - catching up to the features offered by
proprietary code. The field of spam filtering may well confirm one of
those stereotypes while refuting the other. After all, if there is
anything that truly itches, it's spam. But some of the free software being
developed to combat spam is truly innovative.
Most spam filtering work has involved two techniques: testing mail against
patterns indicative of spam and blocking mail from known sources of spam
(and other likely sources, such as ISP dialup lines). Source-based
blocking can be effective, but it also tends to block a fair amount of
legitimate mail along with the spam. For example, some blacklists cause the blocking of mail from kernel.org,
despite the fact that no spam originates there. Source-based blocking is
unreliable enough that quite a few sites are unwilling to use it, despite a
strong desire to be rid of spam.
Pattern matching has shown more promise. Early spam filtering was done
with complex procmail
scripts, but the current champion of pattern-based spam filtering can
only be SpamAssassin. Using a
detailed set of rules, SpamAssassin cleans out the trash to great effect.
LWN has been using it for some months, and it has made life much easier -
lwn@lwn.net gets a lot of spam. SpamAssassin has returned much of
our time back to us to work on LWN, as well as keeping us from accidentally
deleting mail from readers that tended to get buried in the spam.
One thing that SpamAssassin users tend to notice, however, is that its
effectiveness decreases over time. Each new update blocks more spam - a
recent upgrade freed us from a whole unpleasant class of Nigeria spam, for
example. But pattern-based matching only works as well as its patterns,
and they tend to go stale as spammers move on to new tricks. Keeping
SpamAssassin effective requires a number of highly dedicated people to
actually read all that spam and come up with new rules. Most
SpamAssassin users are unlikely to be able (or willing) to write new rules
themselves.
Recently, a new approach to spam filtering has attracted a lot of
attention, thanks mostly to Paul Graham's paper A Plan for Spam. Rather
than try to come up with an endless stream of clever patterns to detect
spam, why not just look at the words spammers use? Each word can be
assigned a probability that any message that contains it is spam; the
probabilities for the words in any specific message are then combined using a
Bayesian filter, yielding an overall probability estimate. If that
estimate is high enough, the message is classified as spam.
At a first glance, going up against a tool as good as SpamAssassin with
such a simple technique seems like a losing battle, but this approach has a
number of advantages:
- Development of the word-based rules can be automated - it is
just a matter of feeding the filter enough spam and "ham" (legitimate
mail) and letting it work out a probability factor for each word.
- The filter can be made to follow shifting patterns in spam by
passing it each message that it misclassifies. Users can not be
expected to master regular expressions and write patterns, but they
can be asked to hit a "this is spam" key in their mailer.
- Each user's spam filter comes to reflect the mail that the user
receives. Spam seems like the ultimate in indiscriminate marketing,
but the fact is that different people can receive very different spam.
An individually derived rule base should prove more effective than a
"one size fits all" set of patterns.
- Classification of mail with a Bayesian filter can be done relatively
quickly.
All of the above is irrelevant, however, if the Bayesian approach does not
succeed in actually filtering spam. To get a sense for the state of the
art, we took 3000 messages received at lwn@lwn.net - a little under two
weeks worth. 295 of those messages were real mail, and 2705 were spam. If
one were to believe the bulk of our mail, one would conclude that about
every part of our anatomy (even those we don't possess) is the wrong size,
that we are so honest that people want to extract money from Africa via our
bank account, that we're missing out on numerous hot stocks, that we have a
strange attraction to domesticated animals, and that the purchase of
something called the "TushyClean" would greatly improve our lives. Trust
us, this exercise has not been fun, but no sacrifice is too great for our
readers.
Once the messages were sorted, we fed them all to SpamAssassin and to bogofilter, a new
Bayesian filter written by Eric Raymond. Bogofilter was tested twice: once
after training with 15% of the 3000 messages, and once after being trained
with the whole set. Then we ran both filters on 5000 recent postings from
the linux-kernel list, twelve of which were spam (devfs flames were not
counted). The results were:
False positives are legitimate mail classified as spam. These, of course,
are bad news, since they can cause the loss of real mail. False negatives
are spam that slip through - an annoyance. It is appropriate that spam
filters tend to err toward false negatives, and both filters shown here do
exactly that.
The results indicate that bogofilter requires a substantial amount of
training before it reaches the level of effectiveness achieved by
SpamAssassin. This training is best done with each individual user's mail,
but most users are unlikely to have a few thousand nicely sorted messages
sitting around to train their filters with. So bogofilter is likely to be
frustrating for many users to adopt - it won't work well until the user has
run "about one thousand" (according to Eric Raymond) messages through it.
That said, bogofilter is surprisingly effective for a tool that
is so new and very much still in development. And the run time relative to
SpamAssassin speaks for itself. Much of the difference there will be
explained by the fact that bogofilter is coded in C, while SpamAssassin is
in Perl. But bogofilter also owes its speed to a much faster algorithm.
The Bayesian filter idea is not new - see this 1998
paper on the Microsoft site, for example. But recently a great deal of
effort has gone into expressing this approach in free software. Bogofilter
is one example; another is the spambayes
project, which has been set up as a testbed for variants on the Bayesian
filter idea. It will be interesting to see where these projects go; they
seem to be off to an interesting start. Taking on a tool as effective as
SpamAssassin is a difficult challenge, but the free software world likes
challenges.
(
Log in to post comments)