Spam avoidance techniques
It is said that most free software comes about as the result of some
developer scratching a personal itch. It's also said that very little
innovative free software development is done; free software projects spend
their time "chasing taillights" - catching up to the features offered by
proprietary code. The field of spam filtering may well confirm one of
those stereotypes while refuting the other. After all, if there is
anything that truly itches, it's spam. But some of the free software being
developed to combat spam is truly innovative.
Most spam filtering work has involved two techniques: testing mail against
patterns indicative of spam and blocking mail from known sources of spam
(and other likely sources, such as ISP dialup lines). Source-based
blocking can be effective, but it also tends to block a fair amount of
legitimate mail along with the spam. For example, some blacklists cause the blocking of mail from kernel.org,
despite the fact that no spam originates there. Source-based blocking is
unreliable enough that quite a few sites are unwilling to use it, despite a
strong desire to be rid of spam.
Pattern matching has shown more promise. Early spam filtering was done
with complex procmail
scripts, but the current champion of pattern-based spam filtering can
only be SpamAssassin. Using a
detailed set of rules, SpamAssassin cleans out the trash to great effect.
LWN has been using it for some months, and it has made life much easier -
lwn@lwn.net gets a lot of spam. SpamAssassin has returned much of
our time back to us to work on LWN, as well as keeping us from accidentally
deleting mail from readers that tended to get buried in the spam.
One thing that SpamAssassin users tend to notice, however, is that its
effectiveness decreases over time. Each new update blocks more spam - a
recent upgrade freed us from a whole unpleasant class of Nigeria spam, for
example. But pattern-based matching only works as well as its patterns,
and they tend to go stale as spammers move on to new tricks. Keeping
SpamAssassin effective requires a number of highly dedicated people to
actually read all that spam and come up with new rules. Most
SpamAssassin users are unlikely to be able (or willing) to write new rules
themselves.
Recently, a new approach to spam filtering has attracted a lot of
attention, thanks mostly to Paul Graham's paper A Plan for Spam. Rather
than try to come up with an endless stream of clever patterns to detect
spam, why not just look at the words spammers use? Each word can be
assigned a probability that any message that contains it is spam; the
probabilities for the words in any specific message are then combined using a
Bayesian filter, yielding an overall probability estimate. If that
estimate is high enough, the message is classified as spam.
At a first glance, going up against a tool as good as SpamAssassin with
such a simple technique seems like a losing battle, but this approach has a
number of advantages:
- Development of the word-based rules can be automated - it is
just a matter of feeding the filter enough spam and "ham" (legitimate
mail) and letting it work out a probability factor for each word.
- The filter can be made to follow shifting patterns in spam by
passing it each message that it misclassifies. Users can not be
expected to master regular expressions and write patterns, but they
can be asked to hit a "this is spam" key in their mailer.
- Each user's spam filter comes to reflect the mail that the user
receives. Spam seems like the ultimate in indiscriminate marketing,
but the fact is that different people can receive very different spam.
An individually derived rule base should prove more effective than a
"one size fits all" set of patterns.
- Classification of mail with a Bayesian filter can be done relatively
quickly.
All of the above is irrelevant, however, if the Bayesian approach does not
succeed in actually filtering spam. To get a sense for the state of the
art, we took 3000 messages received at lwn@lwn.net - a little under two
weeks worth. 295 of those messages were real mail, and 2705 were spam. If
one were to believe the bulk of our mail, one would conclude that about
every part of our anatomy (even those we don't possess) is the wrong size,
that we are so honest that people want to extract money from Africa via our
bank account, that we're missing out on numerous hot stocks, that we have a
strange attraction to domesticated animals, and that the purchase of
something called the "TushyClean" would greatly improve our lives. Trust
us, this exercise has not been fun, but no sacrifice is too great for our
readers.
Once the messages were sorted, we fed them all to SpamAssassin and to bogofilter, a new
Bayesian filter written by Eric Raymond. Bogofilter was tested twice: once
after training with 15% of the 3000 messages, and once after being trained
with the whole set. Then we ran both filters on 5000 recent postings from
the linux-kernel list, twelve of which were spam (devfs flames were not
counted). The results were:
False positives are legitimate mail classified as spam. These, of course,
are bad news, since they can cause the loss of real mail. False negatives
are spam that slip through - an annoyance. It is appropriate that spam
filters tend to err toward false negatives, and both filters shown here do
exactly that.
The results indicate that bogofilter requires a substantial amount of
training before it reaches the level of effectiveness achieved by
SpamAssassin. This training is best done with each individual user's mail,
but most users are unlikely to have a few thousand nicely sorted messages
sitting around to train their filters with. So bogofilter is likely to be
frustrating for many users to adopt - it won't work well until the user has
run "about one thousand" (according to Eric Raymond) messages through it.
That said, bogofilter is surprisingly effective for a tool that
is so new and very much still in development. And the run time relative to
SpamAssassin speaks for itself. Much of the difference there will be
explained by the fact that bogofilter is coded in C, while SpamAssassin is
in Perl. But bogofilter also owes its speed to a much faster algorithm.
The Bayesian filter idea is not new - see this 1998
paper on the Microsoft site, for example. But recently a great deal of
effort has gone into expressing this approach in free software. Bogofilter
is one example; another is the spambayes
project, which has been set up as a testbed for variants on the Bayesian
filter idea. It will be interesting to see where these projects go; they
seem to be off to an interesting start. Taking on a tool as effective as
SpamAssassin is a difficult challenge, but the free software world likes
challenges.
Comments (27 posted)
Where free software should be required by law
RISKS 22.24 includes
a detailed article
by Rebecca Mercuri on the latest fun with the new voting systems in
Florida. That state, of course, was the source of (ongoing) uncertainty in
the 2000 U.S. presidential election, due, in part, to its ancient voting
equipment. Since then, the voting machines have been upgraded to new,
computer-based systems with touchscreen interfaces.
These systems are based on closed source code. There is no
external audit trail, no way of verifying that they are recording votes as
they were actually cast. Trade secret law forbids the inspection of the
code in the systems. One just has to trust the vendor that the results are
correct.
A primary election held there recently turned up a whole set of
problems, ranging from basic usability issues to outright failure.
There has been a lot of interest recently in laws requiring governments to
use free software in many or all situations. It remains unclear, to some
people anyway, that such laws are really in the best interest of
government, the governed, or the free software community. But, in the
case of voting systems, the case seems clear: no part of the system that
elects people into positions of power should be opaque. The creation of a
free, transparent, verifyable electronic voting system should not be that
hard a task for governments or the free software community. There is no
excuse for using anything else.
Comments (10 posted)
Page editor: Jonathan Corbet
Inside this week's LWN.net Weekly Edition
- Security: Multics security, 30 years later.
- Kernel: Linus and SpamAssassin; speeding up rmap; other memory managment patches
- Distributions: New distributions: DebianEdu, OpenZaurus, Simply GNUstep; Libranet GNU/Linux 2.7 released
- Development: Jaberwocky IDE for LISP, AFPL Ghostscript 7.30 developer release,
Conexant HSF softmodem driver, WaveSurfer 1.4.4, Mozilla 1.0.1,
Pygame-1.5.3, Open Office 1.0.1 Alpha SDK, KOffice 1.2,
Xft/Fontconfig release 2.0, Quanta Plus 3.0 PR2, PHP 4.2.3,
GDB 5.3 branch.
- Commerce: Winners of the Australian Open Source Awards Announced; Sun to Offer Alternative to Microsoft Desktop
- Press: Hardwired PC copy protection, Perens leaves HP, $199 Lindows PCs, memory-resident embedded databases, SuSE 8.0 howto for Xbox, Sony Cocoon video recorder
- Announcements: archLSB for Itanium, Singapore Linux Conf, PHPCon 2002, Python UK Conf, Ruby Conf 2002, Perl Meetup in London
- Letters: Releasing software into the public domain; Industrial free software collaboration
Next page:
Security>>