If there is one thing which is especially effective at making your editor
grumpy, it is spam. The incoming flood consumes bandwidth, threatens to
drown out the stream of real mail, and creates ongoing system
administration hassles. With current tools, it is possible to keep spam
from destroying the utility of the email system. But it's not always easy
In recent years, much of the development energy in anti-spam circles has
gone into bayesian filters. The bayesian approach was kicked off in 2002
with the publication of Paul Graham's A plan for spam. In its
simplest form, a bayesian filter keeps track of words found in email
messages and, for each word, a count of how many times that word appeared
in spam and in legitimate mail ("ham"). Over time, these statistics can be used to
examine incoming mail and come up with a probability that each is spam.
Bayesian filters have proved to be surprisingly effective. A well-trained
filter can catch a high proportion of incoming spam with a very low false
positive rate. The filter will adapt to each user's particular email
stream, which is an important feature. It should not be surprising that
different people have wildly different legitimate email. It turns out,
though, that the spam stream can also vary quite a bit. An account which
looks like it could belong to a woman, for example, will tend to get
messages offering to alter the sizes of different parts of the recipient's
anatomy than a man's account. So the ability to tune a filter to a
specific mail stream - ham and spam - will increase its accuracy.
There is quite a large selection of free bayesian filters out there. Your
editor decided to have a look around to see if there is any reason to
prefer one over the others. To that end, a number of characteristics were
To carry out the tests, your editor collected two piles of mail from his
personal stream; one was purely spam, and the other ham. Just over 1,000
messages from each pile were set aside to be used to train the filters.
Then, 6,000 hams and 9,000 spams were fed to each filter, with the filter's
verdict and processing time being recorded. Each mis-classified message
was immediately fed back to the filter to further train it. Multiple runs
were made with different parameters, but, in general, your editor
resisted the urge to go tweaking knobs. Some of these filters offer a vast
number of obscure parameters to set; one can only hope they come with
As a side note, your editor was surprised and dismayed at how difficult the
task of producing pure sets of spam and ham was. The process started with
mail sorted by SpamAssassin at delivery time. Your editor then passed over
the entire set, twice, reclassifying each message which was in the wrong
place. It was only after some early tests started reporting "false
positives" that it became clear how much spam still lurked in the "ham"
pile. It took more manual passes, and many passes with multiple filters,
to clean them all out. The developers who claim that their filters do a
better job than a human does are right - when that human is your editor, at
least. It also turns out that a few incorrectly classified messages can
badly skew the results; bayesian
filters are easily confused if you train them badly.
Anyway, the results will be presented in five batches of 1200 hams and 1800
spams. Nothing special was done between these batches; this presentation
is intended to show how the filter's behavior evolves as it is trained on
more messages. All of the results are also pulled together in a summary
table at the end of the article.
Bogofilter was originally
written by Eric Raymond shortly after Paul Graham's article was posted. It
has evolved over time, and has picked up a wider community of contributors
and maintainers. Bogofilter uses a modified version of the bayesian
technique, with a number of knobs to tweak. It is written in C and is
Training for bogofilter is somewhat complex; your editor was unable to
train it into a stable configuration by feeding it hams and spams
directly. The presence of several different training scripts in the source
tree's "contrib" directory suggests that others have had to put some work
into training as well. In the end, the contributed "trainbogo.sh" script
appeared to do a reasonable job, but it required about three runs to get
bogofilter into a stable state.
Bogofilter offers two approaches to ongoing training. By default, the
filter is not trained by new messages as it classifies them. People who
use bogofilter in this mode will set aside bogofilter's mistakes for
later training. When the -u option is provided, however,
bogofilter will train itself on all messages it feels sufficiently strongly
about. Use of -u makes retraining on mistakes even more important, or the
filter will become increasingly likely to misclassify mail. In general,
training a bayesian filter on its own output must be done with care. It
can help the filter to keep up with the spam stream as it evolves, but it
also is a positive feedback loop which can go badly wrong if not carefully
Your editor ran bogofilter (v1.01) in both modes (starting with a freshly trained
database in each case). Here's the results:
Legend: Fn is the number of false negatives (spam which makes
it through the filter); Fp is false positives (legitimate mail
misclassified as spam), and T is the processing time (clock, not CPU), in seconds per
message. The Size value at the end is the final size of the word
database, in MB.
Here we see that bogofilter without the -u option tends toward
around 50 missed spams out of a set of 1800 (a 2.8% error rate), but with
no false positives at all.
If bogofilter self-trains itself, the false negative rate drops to closer
to 2.2%. As we will see, these results are not as good as some of the
other filters reviewed.
Without self-training, bogofilter requires a roughly constant 0.02 seconds
of time to classify a message; with self-training, that time increases as
the word database grows.
Bogofilter is clearly fast - the fastest of all the filters reviewed here.
One of the ways in which it gains that speed
is to not bother with attachments in the mail. The web page says
"Experience from watching the token streams suggests that spam with
enclosures invariably gives itself away through cues in the headers and
Bogofilter is intended to be integrated as a simple mail filter, optimally
invoked via procmail. It can place a header in processed messages (making
life easy for procmail) and also returns the spam status in its exit code
(making life easy for grumpy editor testing scripts). Bogofilter has
options for dumping out the word database and, for a given message, listing
the words which most influenced how that message was classified. Nothing
special has been done to make retraining easy; most users will probably
create folders of mistakes and occasionally feed them to the filter.
An interesting - if intimidating - offering is CRM114, subtitled "the
controllable regex mutilator." While the main use of CRM114 appears to be
spam filtering, it has a wider scope; it can be trained, for example, to
filter interesting lines out of logfiles. According to the project page:
Criteria for categorization of data can be by satisfaction of
regexes, by sparse binary polynomial matching with a Bayesian Chain
Rule evaluator, a Hidden Markov Model, or by other means.
This tool comes with a 275-page book in PDF format, and needs every one of
those pages. Setting up CRM114 is not for the faint of heart; it involves
the manual creation of database files and the editing of a long configuration file
which, while not quite up to sendmail.cf standards, is still one
of the more challenging files your editor has encountered in some time.
Once all of that is done, however, the "crm" executable can be hooked into
a procmail recipe in the usual way without too much trouble.
The CRM114 documentation recommends against any sort of
initial training of the filter. The developers are strong believers in the
"train on errors" approach, saying that there are "mathematically
complicated" reasons why pre-training leads to worse results. For users
who don't get the hint, they do provide a way to perform pre-training:
If you really feel you must start by preloading some sample spam,
copy your most recent 100Kbytes or so of your freshest spam and
nonspam into two files in the current directory. These files MUST
be named "spamtext.txt" and "nonspamtext.txt" They should NOT
contain any base64 encodes or "spammus interruptus", straight ASCII
text is preferred. If they do contain such encodes, decode them by
hand before you execute this procedure.
The prospect of hand-decoding binary spam attachments is likely to put off
most people who were pondering pre-training their filters - and, one
assumes, that is the desired result. Of course, one can also use the normal
training commands to feed messages into the system in a pre-training mode,
but the documentation doesn't say that.
While filter training can be done on the command line, users can also
retrain the filter by forwarding errors back to themselves. The message
must be edited to include a training command and password; the developers
also recommend removing anything which shouldn't be part of the training.
Strangely, users are also told to remove the markup added by CRM114 itself
- something which, one would think, could be handled automatically.
Your editor tested the 20060118 "BlameTheReavers" release of CRM114. The
first test was done without training, as recommended; then, just to be
stubborn, a test was run with a pre-trained filter.
Some things jump out immediately from those numbers. CRM114 is quite
fast. It is also quite effective very early on; the first 3000 messages
were processed with exactly one false positive and one false negative -
starting with an untrained filter. On
the other hand, its performance appears to worsen over time, and, in
particular, the false positive rate grows in a discouraging way. The false
positives varied from vaguely spam-like messages (Netflix updates, for
example) to things like kernel patches. Your editor concludes that CRM114
operates as a very aggressive and quick-learning filter, but that it is
also relatively unstable.
DSPAM is a
GPL-licensed filter written in C. It is clearly aimed at large
installations - places with dedicated administrators and, possibly,
relatively unsophisticated users. As a result, it has a few features not
found in other systems. For example, it has a web interface with
statistics, facilities to allow users to manage filter training, and
"pretty graphs to dazzle CEOs." Users who don't want to train the filter
through a web page can forward mistakes to a special address instead.
There are several ways to hook DSPAM into
the mail system, including a command-line filter, a POP proxy, and an SMTP
front-end which can be put between the net and the mail delivery agent.
There are several choices of backend storage (SQLite, Berkeley DB,
PostgreSQL, MySQL, Oracle, and more), and a number of
different filtering techniques. The filter can also run in a client-server
mode, much like SpamAssassin.
DSPAM is also a package with a dual-license option; companies interested in
shipping the software without providing source can purchase a separate
license from the developer.
The system is intended to require relatively little maintenance. It has a
set of tools, meant to be run from cron, which handle much of the
routine housekeeping. Among other things, DSPAM will automatically trim
its word list - getting rid of terms which have not been seen for a while
and which have little influence on message scoring.
Initial training can be performed using the dspam_train utility;
it uses a train-on-errors approach. Thereafter, DSPAM offers several
training modes. The recommended "teft" mode trains on every message
passing through the system. There is a train-on-errors mode, and a "tum"
mode ("train until mature") which emphasizes relatively new and uncommon
words. Your editor ran DSPAM (in the standalone, command-line mode) using
all three training schemes, with the following results:
So DSPAM shows strong spam detection in all three modes with a mid-range
execution time; it is much slower than bogofilter, but much faster than
some of the alternatives. The comprehensive training mode would appear to
be the most effective; the TUM mode increases the false negative rate
slightly, and the TOE mode introduces false positives. Note that the DSPAM
database is quite large; to a great extent, this volume is taken up by a
directory full of message hashes used to keep track of which messages have
been used to train the filter.
which is written in Perl, is unique among
the filters tested in that it combines a bayesian filter with a large set
of heuristic scoring rules. The filter, in essence, is just another rule
which gets mixed in with the rest. The rule database takes a great deal of
effort (on the part of the SpamAssassin developers) to maintain, and
testing messages against all of those rules makes SpamAssassin relatively
slow. There is a huge advantage to this approach, however: SpamAssassin
works well starting with the first message it sees, and it is able to train
its own bayesian filter using the results from the rules.
Another nice feature in SpamAssassin is its word list maintenance. Most
bayesian filters seem to grow their word lists without bound. Since spam
can contain a great deal of random nonsense (actually, much of your
editor's ham does as well), the word list can quickly fill up with tokens
which are highly unlikely to ever help in classifying messages.
Documentation for some other filters suggests occasionally dumping the word
list and starting over. SpamAssassin, instead, will occasionally (and
tokens which have not been seen for some time. So the word list stays
within bounds. In general, SpamAssassin is relatively good at minimizing
the need for the user to perform maintenance tasks.
The sa-learn utility is used for most bayesian filter tasks. It
can retrain the filter on mistakes, dump out word information, and force
cleanup operations. SpamAssassin can be run in a client/server mode, which
improves performance on busy systems. The client/server mode can
also help to put a bound on SpamAssassin's memory use, which can be a
little frightening. Standalone SpamAssassin on a small-memory system can
create severe thrashing.
Your editor ran two sets of tests with SpamAssassin 3.1.0, running in the
client/server mode, with network blacklist tests enabled. (Before somebody
asks: the test was run on a standalone system to avoid any possible
contamination by your editor's regular mail stream). Exactly one
scoring tweak was made: the score for BAYES_99 (invoked when the bayesian
filter is absolutely sure that the message is spam) was set to 5.0,
enabling the filter to condemn messages on its own. That change helps to
emphasize the bayesian side of SpamAssassin, and, in your editor's
experience, makes it more effective. The first test
involved a pre-trained database, as was done with the other filters. The
second test, instead, started with an empty bayesian database in an effort
to see how well the tool trains itself. Here's the results:
The results here show that SpamAssassin filters up to 99.9% of incoming
spam, at the cost of significant amounts of CPU time. The
untrained run shows higher error rates, but does eventually converge on
something similar to the pre-trained version. But, at over one second per
message, each testing run (comprising 15,000 messages) took a rather long
SpamAssassin operates as a filter, adding a header to messages as they pass
through. That header can be used in procmail recipes; the thunderbird
mail agent is also set up to optionally use the SpamAssassin header.
is a filter
written in Python. The SpamBayes hackers have, perhaps more than some of
the other filter developers, made tweaks to the bayesian algorithm in an
attempt to improve performance. Those hackers have also put more effort
into mail system integration than some; as a result, SpamBayes comes with
an Outlook plugin, POP and IMAP proxy servers, and a filter for Lotus
Notes. It is still possible to use SpamBayes as a command-line filter with
There is a separate script (sb_mboxtrain.py) which is used to
train the filter. Your editor followed the instructions and found it
seemingly easy to use - it nicely understands things like MH and Maildir
folders. However, when used as documented, sb_mboxtrain.py
happily (and silently) puts the resulting word database in an undisclosed
location, and filtering works poorly. Adding a few options to make the
database location explicit took care of the problem.
SpamBayes 1.0.4 was tested in two modes: retraining just on errors, and
training on all messages.
|SB train all
SpamBayes takes a while to truly train itself, but it does eventually get
to a 98.9% filtering rate - better than some, but not truly amazing. The
word database remains relatively small, but
processing time is significant - especially if comprehensive training is
used. Everything gets worse with comprehensive training, however - the spam
detection rate drops while processing time increases. SpamBayes is able to
avoid false positives in both modes, however.
SpamProbe is a filter
written in C++ and released under the Q Public License. Unlike most
filters, which record statistics on individual words, SpamProbe is also
able to track pairs of words (DSPAM can do that too). SpamProbe looks at
text attachments, discarding other types of attachments with one exception:
there is a simple parser for GIF images. This parser
creates various words describing images in a message (based on sizes, color
tables, GIF extensions, etc.) and uses them in evaluating each message.
SpamProbe is packaged as a single command with a vast number of options.
There is an "auto-train" mode for getting the filter trained in the first
place. There are two filtering modes which the author calls "train" and
"receive." Both will filter the message; the "train" mode only updates the
word database "if there was insufficient confidence in the message's
score," while "receive" always updates the database. The author recommends
"train" mode; your editor tested SpamProbe 1.4a in both modes:
SpamProbe's "receive" mode demonstrates that, with bayesian filters, more
training is not always better. The added training slows down processing
significantly, to the point that SpamProbe is almost as slow as
SpamAssassin, but the end results are worse than those obtained without
comprehensive training. SpamProbe has
a significant false positive rate in either mode, but the "receive"
mode makes it worse.
mode, SpamProbe generates vast amounts of disk traffic, rather more than
was observed with the other filters.
Unlike most other filters, SpamProbe does not insert a header in filtered
mail. Instead, it emits a single line giving its verdict; the author then
suggests using a tool like formail to create a header using that
score. So integration of SpamProbe is a little harder than with some other
Here is a summary table combining all of the filter runs described above:
|SpamBayes train all
In the above table, the "false positives" columns were left blank for tests
in which there were none. Since false positives will be the bane of any
spam filter, it is good if they stand out.
One should, of course, take all of the above figures with a substantial
grain of salt. They reflect performance on your editor's particular mail
stream; things could be very different with somebody else's mail. Still,
your editor's mail stream is varied enough that, perhaps, a few conclusions
can be drawn.
One of those would be that SpamAssassin is still hard to beat. It is, by
far, the slowest of the filters, but it is highly effective with a minimum
amount of required setup and maintenance on the user's part. For the most
part, it Just Works, and it works quite well. In situations where an
administrator is setting things up for a large group of users, DSPAM may
well be indicated. The broad flexibility of that tool make it easy to
integrate into just about any mail system, and the web interface makes its
operation relatively transparent to users. Just be sure you have a big
disk for its databases.
CRM114 is an interesting project; its combination of technologies has the
potential to make it the most accurate of all the filters. It has the look
of a hardcore power tool. This tool,
however, is not ready for prime time at this point. It is a major hassle
to set up, and, for your editor at least, keeping the filter stable was a
challenge. The other three filters all have their strong points, but none
of them had the level of spam detection that your editor would like to see.
There are, of course, other filters out there as well. Some of the
graphical mail clients have started to integrate their own filters. There
is a great convenience in having a "junk" button handy, but the integrated
filters sacrifice transparency and, in your editor's (admittedly limited)
experience, they do not seem to develop the same level of accuracy. There
is also ifile,
which is intended to be a more general mail classifier. That tool is no
longer under development, however.
In the end, none of the filters reviewed is perfect - it would be nice to
see no spam at all. But some of them are surprisingly close. Think back,
for a minute, to the days when were complaining about getting a dozen spams
per day - or per week. Who would have thought that we would be able to
cope with thousands of spams per day and still deal with our mail? The
developers of these filters have, in a significant way, saved the net, and
your editor thanks them.
Comments (58 posted)
The growing success of free software has led to a widening of the
culture clash between "open" and "closed" to include other domains. One recent
skirmish, for example, concerned a particularly important kind of digital
code the sequence of the human genome and whether it would be
proprietary, owned by companies like Celera
, or freely
available. Openness prevailed
, but in
another arena scholarly publishing advocates of free (as in both
beer and freedom) online access to research papers are still fighting the
battles that open source won years ago. At stake is nothing less than
control of academia's treasure-house of knowledge.
The parallels between this movement - what has come to be known as open
access and open source are striking. For both, the ultimate
wellspring is the Internet, and the new economics of sharing that it
enabled. Just as the early code for the Internet was a kind of proto-open
source, so the early documentation the RFCs offered an example
of proto-open access. And for both their practitioners, it is recognition
not recompense that drives them to participate.
Like all great movements, open access has its visionary the RMS figure
- who constantly evangelizes the core ideas and ideals. In 1976, the
Hungarian-born cognitive scientist Stevan Harnad founded a
scholarly print journal that offered what he called open peer
commentary, using an approach remarkably close to the open source
development process. The problem, of course, was that the print medium was
unsuited to this kind of interactive development, so in 1989 he launched a
Usenet/Bitnet magazine called Psycoloquy, where the feedback
process of the open peer commentary could take place in hours rather than
weeks. Routine today, but revolutionary for scholarly studies back then.
Harnad has long had an ambitious vision of a new kind of scholarly sharing
(rather as RMS does with code): one of his early papers is entitled Post-Gutenberg
Galaxy: The Fourth Revolution in the Means of Production of
Knowledge, while a later one is called bluntly: A Subversive Proposal
for Electronic Publishing. Meanwhile, the aims of the person who could be
considered open access's Linus to Harnad's RMS, Paul
Ginsparg, a professor of physics, computing and information science at
Cornell University, were more modest.
At the beginning of the 1990s, Ginsparg wanted a quick and dirty solution
to the problem of putting high-energy physics preprints (early
versions of papers) online. As it turns out, he set up what
became the arXiv.org preprint repository on
16 August, 1991 nine days before Linus made his fateful I'm doing
a (free) operating system (just a hobby, won't be big and professional like
gnu) for 386(486) AT clones posting.
But Ginsparg's links with the free software world go back much further.
Ginsparg was already familiar with the GNU manifesto in 1985, and, through
his brother, an MIT undergraduate, even knew of Stallman in the 1970s.
Although arXiv.org only switched to GNU/Linux in 1997, it has been using
Perl since 1994, and Apache since it came into existence. One of Apache's
founders, Rob Hartill, worked for Ginsparg at the Los Alamos National
Laboratory, where arXiv.org was first set up (as an FTP/email server at
xxx.lanl.org). Other open source programs crucial to arXiv.org include
TeX, GhostScript and MySQL.
In 1994, Harnad espoused the idea of self-archiving in his Subversive
Proposal, whereby academics put a copy of their papers online locally
(originally on FTP servers) as well as publishing them in hardcopy
journals. The spread of repositories soon led to interoperability issues.
The 1999 Open Archives
Initiative (in which Ginsparg was a leading figure) aimed to deal with
this by defining a standard way of exposing an article's metadata so that
it could be harvested efficiently by search engines.
Beyond self-archiving - later termed green open access by Harnad
lies publishing in fully open online journals (gold open
access). The first open access magazine publisher, BioMed Central a kind of Red
Hat of the field appeared in 1999. In 2001 the Public Library of Science (PLoS) was
launched; PLoS is a major publishing initiative inspired by the examples
of arXiv.org, the public genomics databases and open source software, and
which was funded by
the Gordon and Betty Moore Foundation (to the tune of $9 million over five
Just as free software gained the alternative name open source at
the Freeware Summit in
1998, so free open scholarship (FOS), as it was called until then by the
main newsletter that covered it - written by Peter Suber,
professor of philosophy at Earlham College - was renamed open
access as part of the Budapest Open Access
Initiative in December 2001. Suber's newsletter turned into Open Access News
and became one of the earliest blogs; it remains the definitive record of
the open access movement, and Suber has become its semi-official chronicler (the Eric
Raymond of open access - without the guns).
After the Budapest meeting (funded by speculator-turned-philanthropist George Soros, who played
the role taken by Tim O'Reilly at the Freeware Summit), several other major
declarations in support of open access were made, notably those at Bethesda
and Berlin (both 2003). Big research institutions started actively
supporting open access rather as big companies like IBM and HP did
with open source earlier. Key early backers were the Howard Hughes Medical
Institute (2002) in the US and the Wellcome Trust (2003) in the UK, the
largest private funders of medical research in their respective countries.
Both agreed to pay the page charges that gold open access titles
need in order to provide the content free to readers typically $1000
per article. This is not as onerous as it sounds: the annual subscription for
a traditional scientific journal can run to $20,000 (even though the
authors of the papers receive nothing for their work). For a major
research institution, the cumulative cost adds up to millions of dollars a
year in subscriptions. This annual tax is very like the licensing fees in
the proprietary software world. What an institution saves by refusing
to pay these exorbitant subscriptions as the libraries at Cornell,
Duke, Harvard and Stanford Universities have done in the US it can use
to fund page charges, just as companies can use monies saved on software
licensing costs to pay for the support and customization they need.
With all this activity, governments started getting interested in open
access, and so did the big publishers, worried by the potential loss of
revenue (the Microsoft of the scientific publishing world, the Anglo-Dutch
has had operating profits of over 30%). The UK House of Commons Science
and Technology committee published a lengthy report recommending obligatory
open access for publicly-funded research: it was ignored by the UK
government because of pressure from British publishing houses. In 2004,
the US NIH issued a draft of its own plans for open access support and
was forced to water them down because of fierce lobbying from science
Given the many similarities between the respective aims of open source and
open access, it is hardly surprising that there are direct links between
them. In 2002, MIT released its DSpace digital repository application
under a BSD license, while Eprints, the main archiving
software used for creating institutional repositories, went open source
under the GPL. As the latter's documentation
The EPrints software has been developed under
GNU/Linux. It is intended to work on any GNU system. It may well work on
other UNIX systems too. Other systems people have got EPrints up and
running on include Solaris and MacOSX. There are no plans for a version to
run under Microsoft Windows.
There is a commercial, supported version
too. Open Journal Systems is another
journal management and publishing system released under the GPL.
As the mainstream open source projects mature, the applications used by the
open access movement could well prove increasingly attractive to coders who
are looking for a challenge and an area where they can make a significant
contribution not just to free software, but also to widening free
access to knowledge itself.
Glyn Moody writes about open source and open access at opendotdotdot.
Comments (17 posted)
Page editor: Jonathan Corbet
Inside this week's LWN.net Weekly Edition
- Security: A new Linux worm; New vulnerabilities in bluez, CASA, gnupg, heimdal, metamail, tar, ...
- Kernel: The kevent interface; Sysfs and a stable kernel ABI.
- Distributions: Creating a Live CD with Kadischi; Openwall 2.0 and early releases of Fedora Core, SUSE and [K/Edu]Ubuntu; Kaboot
- Development: Urwid, a Console UI library for Python, Democracy Internet TV,
new versions of Sussen, Speex, Midgard, mnoGoSearch, jack_capture,
mp3splt-gtk, Rivendell, RUNA WFE, Accelerated Indirect GL X, GARNOME,
XCircuit, KMyMoney, Wine, OpenEHR, FreeMED, Rosegarden, OO.o, OmegaT,
Ultimate++, flpsed, GNU HC11/HC12.
- Press: RIAA: CD ripping not fair use, USPTO quality initiative, Libre Graphics
Meeting, Jim Starkey moves to MySQL AB, Oracle buys open-source, California
looks at open voting, EU Council and data retention, Asterisk on OpenWrt,
Banshee review, Linux on Mactel.
- Announcements: Open Group announces API sets, Sony BMG Settles, PostgreSQL Resource Directory,
Bleepfest: London, KDE DevRoom at FOSDEM, FOSS Means Business: Belfast,
KDE localization, Annodex Media validation service,
l.o.s.s. open-source sound project.