For many of us, SpamAssassin is all that stands between us and an inbox
clogged to the gills with unwanted e-mail. With the much-anticipated 3.0
release just around the corner, we decided to see what anti-spam fighters
would have to work with in the near future. To that end, we touched base
with SpamAssassin developers Theo Van Dinter and Craig Hughes. Hughes left
the project recently, but was heavily involved in the development of 3.0
and still has his finger on the pulse of SpamAssassin development.
What's different from the current release, and why the version jump? Both
Van Dinter and Hughes noted some important technical improvements in the
3.0 release. Hughes said that the most important feature for 3.0 is its
modularity. The 3.0 release is "more modular, easier to write plugins
for...easier to plug in other pieces of functionality that aren't
distributed with the core package," said Hughes. He noted that prior
to 3.0, it was difficult to add in custom code for functions that were not
part of SpamAssassin.
Both Hughes and Van Dinter also noted the replacement of SpamAssassin's
"genetic algorithm" with a "perceptron learner" for score generation. Van
Dinter noted that the new score generation is vastly improved, taking the
average time from "[around] 14 hours to less than five minutes per scoreset
(there are four)." Van Dinter also told LWN that the message/mime parser
for SpamAssassin has been rewritten "essentially from
scratch."
Another big improvement for 3.0 is improved scalability. The new version
supports installations with larger numbers of mailboxes, with preferences
stored in an SQL database or LDAP server. The primary focus there,
according to Hughes, was for large ISPs that wanted to use SpamAssassin
without having a Unix login or home directory for every user.
While there are plenty of technical improvements in SpamAssassin, Hughes
also noted that there's a non-technical rationale for the bump to
3.0. SpamAssassin is in the process of becoming a top-level project of the
Apache Software Foundation. This also means a licensing change for the
project, which was quite a bit of work according to Hughes:
It's going to be using the Apache License instead of using Perl's
licensing, and we've gone through a tremendously long, laborious, tedious
even, process of sourcing every line of code...making sure that every
author really did have the rights to publish it.
Hughes said that the project met little resistance in switching from the
former licensing scheme -- which allowed licensing under either the GPL or
the Perl Artistic License -- to the Apache Software
License. Hughes said that "only a handful" of developers
said they wouldn't allow their code to be relicensed, as well as "two
or three we couldn't contact." The end result, he said, was that
nothing substantial had to be removed due to licensing issues.
Because of the nature of the project, we were also curious how SpamAssassin
manages to stay ahead of spammers. According to Van Dinter, it's not so
much staying ahead as an "arms race" between SpamAssassin and spammers:
We filter, they mutate, we start filtering the mutation, they mutate
again. Lather, Rinse, Repeat. I'm actually not really involved in the rules
(I work on the back-end code more than anything else,) but it basically
comes down to looking at the spam that's coming in, seeing which ones
aren't caught, and figuring out how to catch them in the future. There are
also other useful data points unrelated to the messages themselves. For
instance, verifying that the sender isn't forged via SPF (
Sender Policy Framework) and utilizing the
information provided by SenderBase.
Hughes told LWN that there are two things that help SpamAssassin stay ahead
of spammers:
One is that you only have to stay ahead of most spammers. There may be one
percent that may be particularly good [at getting by SpamAssassin] but if
you can block 99 percent of it, it doesn't matter that much...we're not
shooting to be perfect, we're shooting to be as good as we can without
trying to squeeze out that last one percent.
The other thing is the sheer complexity of SpamAssassin. It's not just a
Bayesian filter, it's not just looking up things in RBLs...it's all those
things together. It's actually very, very non-trivial for a human to be
able to craft a message that's a piece of spam and get through...to defeat
all of the system requires a great deal of work, or a lot of luck.
Another piece of good news for SpamAssassin enthusiasts, is that it
shouldn't be hard to upgrade. According to Hughes, it "should be
simple, as long as you're not doing anything really funky" in terms
of tweaking and customizing the SpamAssassin code. He noted that the 3.0
release is designed to recognize file format changes, and to automatically
upgrade user files that are in the old format.
If the SpamAssassin 3.0 meta-bug
dependency
tree is any indication, there's not much left to do before the
3.0-final release. Hughes said that the project "looks like it's on
target" to meet the June 30 release date. Users are
encouraged to help test SpamAssassin prior to the final release.
(
Log in to post comments)