By Jonathan Corbet
January 6, 2010
SpamAssassin is crucial
infrastructure, at least for some of us. So it was with some dismay that
your editor, while performing a quick New Year's Day disaster check, noted
that SpamAssassin had not made the adjustment to 2010 in good form. The
bug was straightforward and easy to fix, but it merits a closer look for
what it reveals about our infrastructure and how we support it.
The task assigned to SpamAssassin, of course, is to look over incoming
email and assign a score to each message indicating how likely that message
is to be spam. It does this job surprisingly well; your editor currently
receives around 5,000 spams per day - one every 17 seconds or so - but it's
a bad day if two dozen of those get past SpamAssassin and show up in the
inbox. Put simply: without
SpamAssassin, your editor's email address would simply be unusable. All it
takes is a five-minute window without spamd running to see what life would
be like if the incoming mail stream had to be dealt with in its full,
unfiltered glory. This is mission-critical software, so any faults which
turn up in it tend to be of great concern.
The core of SpamAssassin is a vast set of
rules looking for spammy characteristics in incoming email. The rules
match anything that the developers think might indicate spam; some of
the tests include:
- The presence of a rot13-encoded email address.
- Large numbers of blank lines.
- The originating address is in any of a number of network blacklists.
- Discussion of medication in a number of forms.
- HTML messages with huge fonts.
- The presence of URLs registered to known spammers.
...and so on. Each matching rule adds a numeric score to the message; when
the process is complete, the scores are added up to yield a total
spamminess value. The bayesian recognizer also gets a chance to look at
the message and add a score of its own. At the conclusion of this process,
any message with a score of 5.0 or higher (by default) is considered to be
spam.
Some years ago, a SpamAssassin developer noticed that some unwanted mail
came in with dates far in the future. These messages almost certainly
represent an attempt by spammers to take advantage of mail clients which
sort messages by date; a far-future date should show up at the top of the
list. To deal with these messages, said developer wrote a rule matching
any date from the year 2010 or afterward. At the time, 2010 was some years
in the future, so the rule seemed to make sense. Surely somebody would fix
it long before that distant year arrived.
The scores assigned to rules in SpamAssassin are not random, but neither
are they assigned by the rule authors. Instead, the project uses a "perceptron"
program to determine which combination of scores performs best against a
large body of spam and "ham" email. When this tool was run, legitimate email from
2010 was indeed a rare thing, so the rule turned out to be a very good
positive indicator for spam. As a result, it was assigned a score which,
in some situations, could be as high as 3.5.
As of January 1, mail with 2010 dates suddenly became rather more common.
With the year-2010 rule now firing on every message, the SpamAssassin
threshold was, in effect, lowered from 5.0 to as low as
1.5. That, in turn, caused a fair amount of legitimate email to be
classified as spam, a most unwelcome development. Your editor, receiving
5,000 spams every day, has long since stopped scanning the spam folder for
false positives; even if they exist (which they almost never do), they
represent a needle which is almost impossible to find in a haystack that
large. So email classified as spam is, for all practical purposes, simply
lost.
As described in Justin
Mason's weblog, the year-2010 problem was noted by a SpamAssassin
developer in 2008. The rule was duly fixed in the project's repository,
and promptly forgotten about. What the SpamAssassin developers did not do
was any of (1) informing the user community of the rule change,
(2) making a new major release with the fixed rule, or
(3) distributing the rule fix through the sa-update
channel, which exists for just this purpose. So everybody was caught
by surprise - users, distributors, Internet service providers, and the
SpamAssassin developers themselves.
All told, the harm caused by this problem was relatively small and mostly
recoverable. It is a very small blot on SpamAssassin's long record of
making email usable for large numbers of people. But it highlights a few
points which are worthy of note:
- Even those of us who are not running financial exchanges have
critical infrastructure based on free software. When something goes
wrong with that infrastructure, it can hurt our businesses, social
lives, and more.
- Software which plays a crucial part on our operations should really
have a mechanism in place to get important fixes to users quickly.
But, just as importantly, that project has to take great care to
ensure that important fixes get routed into that channel.
SpamAssassin developers had fixed the 2010 problem a long time ago, but
that was not helpful for users, who had no way of knowing about the
problem or its fix. In the kernel realm, it has taken
some years to build the discipline of looking over patches and
considering them for stable kernel updates; there's probably still a
fair number of important fixes which do not get to stable kernel users
because nobody thinks to route them to the stable kernel maintainers.
- Important software requires a certain amount of development and review
time. So it's discouraging to read in Justin's weblog that his
SpamAssassin work happens in his scarce spare time, and that the
project is, in general, short of active developers. Your editor
suspects that the truth of the matter is this: SpamAssassin is long
past its period of rapid development. At this point, it works well,
to the point that there's not a lot of work to be done. So the
interested developers have gone on to other projects.
It would appear that what SpamAssassin needs is some dedicated maintenance
talent which is
not dependent on evening hours put in by developers committed to other
projects. Typically that is the sort of work that requires a paying
customer. Given how many people and companies rely on this software, it
seems like it should be possible to find the money to motivate somebody to
put more time into SpamAssassin maintenance. The hard part is collecting
and administering those funds; that's not something that the free software
community has yet reliably become good at doing.
(
Log in to post comments)