|
|
Subscribe / Log in / New account

A look at SpamAssassin 3.0

June 2, 2004

This article was contributed by Joe 'Zonker' Brockmeier.

For many of us, SpamAssassin is all that stands between us and an inbox clogged to the gills with unwanted e-mail. With the much-anticipated 3.0 release just around the corner, we decided to see what anti-spam fighters would have to work with in the near future. To that end, we touched base with SpamAssassin developers Theo Van Dinter and Craig Hughes. Hughes left the project recently, but was heavily involved in the development of 3.0 and still has his finger on the pulse of SpamAssassin development.

What's different from the current release, and why the version jump? Both Van Dinter and Hughes noted some important technical improvements in the 3.0 release. Hughes said that the most important feature for 3.0 is its modularity. The 3.0 release is "more modular, easier to write plugins for...easier to plug in other pieces of functionality that aren't distributed with the core package," said Hughes. He noted that prior to 3.0, it was difficult to add in custom code for functions that were not part of SpamAssassin.

Both Hughes and Van Dinter also noted the replacement of SpamAssassin's "genetic algorithm" with a "perceptron learner" for score generation. Van Dinter noted that the new score generation is vastly improved, taking the average time from "[around] 14 hours to less than five minutes per scoreset (there are four)." Van Dinter also told LWN that the message/mime parser for SpamAssassin has been rewritten "essentially from scratch".

Another big improvement for 3.0 is improved scalability. The new version supports installations with larger numbers of mailboxes, with preferences stored in an SQL database or LDAP server. The primary focus there, according to Hughes, was for large ISPs that wanted to use SpamAssassin without having a Unix login or home directory for every user.

While there are plenty of technical improvements in SpamAssassin, Hughes also noted that there's a non-technical rationale for the bump to 3.0. SpamAssassin is in the process of becoming a top-level project of the Apache Software Foundation. This also means a licensing change for the project, which was quite a bit of work according to Hughes:

It's going to be using the Apache License instead of using Perl's licensing, and we've gone through a tremendously long, laborious, tedious even, process of sourcing every line of code...making sure that every author really did have the rights to publish it.

Hughes said that the project met little resistance in switching from the former licensing scheme -- which allowed licensing under either the GPL or the Perl Artistic License -- to the Apache Software License. Hughes said that "only a handful" of developers said they wouldn't allow their code to be relicensed, as well as "two or three we couldn't contact". The end result, he said, was that nothing substantial had to be removed due to licensing issues.

Because of the nature of the project, we were also curious how SpamAssassin manages to stay ahead of spammers. According to Van Dinter, it's not so much staying ahead as an "arms race" between SpamAssassin and spammers:

We filter, they mutate, we start filtering the mutation, they mutate again. Lather, Rinse, Repeat. I'm actually not really involved in the rules (I work on the back-end code more than anything else,) but it basically comes down to looking at the spam that's coming in, seeing which ones aren't caught, and figuring out how to catch them in the future. There are also other useful data points unrelated to the messages themselves. For instance, verifying that the sender isn't forged via SPF (Sender Policy Framework) and utilizing the information provided by SenderBase.

Hughes told LWN that there are two things that help SpamAssassin stay ahead of spammers:

One is that you only have to stay ahead of most spammers. There may be one percent that may be particularly good [at getting by SpamAssassin] but if you can block 99 percent of it, it doesn't matter that much...we're not shooting to be perfect, we're shooting to be as good as we can without trying to squeeze out that last one percent.

The other thing is the sheer complexity of SpamAssassin. It's not just a Bayesian filter, it's not just looking up things in RBLs...it's all those things together. It's actually very, very non-trivial for a human to be able to craft a message that's a piece of spam and get through...to defeat all of the system requires a great deal of work, or a lot of luck.

Another piece of good news for SpamAssassin enthusiasts, is that it shouldn't be hard to upgrade. According to Hughes, it "should be simple, as long as you're not doing anything really funky" in terms of tweaking and customizing the SpamAssassin code. He noted that the 3.0 release is designed to recognize file format changes, and to automatically upgrade user files that are in the old format.

If the SpamAssassin 3.0 meta-bug dependency tree is any indication, there's not much left to do before the 3.0-final release. Hughes said that the project "looks like it's on target" to meet the June 30 release date. Users are encouraged to help test SpamAssassin prior to the final release.

Index entries for this article
GuestArticlesBrockmeier, Joe


to post comments

A look at SpamAssassin 3.0

Posted Jun 3, 2004 0:50 UTC (Thu) by duncf (guest, #7051) [Link]

FWIW, that 3.0 "meta-bug" is no longer being maintained properly; search for bugs with a milestone of 3.0.0 instead at http://bugzilla.spamassassin.org.

A look at SpamAssassin 3.0

Posted Jun 3, 2004 7:26 UTC (Thu) by Cato (guest, #7643) [Link]

For those who don't use SpamAssassin, it's one of the best tools out there, because it uses such a wide variety of inputs: textual patterns, HTML trickery, forged headers, Bayesian filtering, Razor, and so on. It's also very customisable without touching the core code, even in current versions.

This means that someone I know can use her 15-year old spam-ridden email account and still see virtually no spam.

A look at SpamAssassin 3.0

Posted Jun 3, 2004 8:51 UTC (Thu) by alspnost (guest, #2763) [Link] (4 responses)

I'm really looking forward to 3.0, because for me, SpamAssassin just isn't working any more. It was fantastic for a while, but it looks like 2.63 has lost the arms race for now: it's gone from 95% success down to about 50% on my personal mail.

It sounds like this version should turn the tables again, so congratulations to the developers for what sounds like an excellent release in the makings. Until we can defeat spam at the source (er, um...), SpamAssassin is perhaps our biggest hope!

A look at SpamAssassin 3.0

Posted Jun 3, 2004 15:47 UTC (Thu) by einstein (subscriber, #2052) [Link] (3 responses)

I'd found a similar decline in effectiveness, but when I discovered and added the latest supplementary rulesets (backhair, popcorn, chickenpox, weeds etc) the effectiveness went back up to the 98-99% range.

Updating SpamAssassin

Posted Jun 3, 2004 17:03 UTC (Thu) by dowdle (subscriber, #659) [Link] (1 responses)

Where do I get all of those? I must be overlooking it... right in front
of my face kinda thing.

Updating SpamAssassin

Posted Jun 3, 2004 17:09 UTC (Thu) by corbet (editor, #1) [Link]

Have a look at the custom rulesets page. They made a significant difference here as well. Do also consider RulesDuJour; like all anti-spam measures, the effectiveness of these rulesets tends to decline over time and they need to be updated.

A look at SpamAssassin 3.0

Posted Jun 3, 2004 17:12 UTC (Thu) by dowdle (subscriber, #659) [Link]

Ok, found it... in the SpamAssassin Wiki.

http://wiki.apache.org/spamassassin/CustomRulesets

Sender confirmation

Posted Jun 3, 2004 17:34 UTC (Thu) by erwbgy (subscriber, #4104) [Link] (4 responses)

I find that requiring sender confirmation for unknown addresses cuts down on my spam significantly. I use qconfirm, but I'm sure there are others.

Any message from an unknown sender gets an automatic reply requesting that they confirm that they sent the message. Once they reply, the message ends up in my mailbox and the sender gets added to a whitelist. If no reply is received within a week then the message is discarded. Since spammers tend to use throw-away or invalid From addresses, they never reply. Humans generally have no problem in dealing with this though.

There are of course ways to automatically whitelist certain addresses or domains, and you can view and manage the currently queued messages. All in all, a very nice tool and a worthy addition to the anti-spam armoury.

- Keith

Sender confirmation

Posted Jun 3, 2004 18:41 UTC (Thu) by Stephen_Beynon (guest, #4090) [Link] (1 responses)

I hate it when people use sender confirmation. As many spammers forge
email from other "suckers" on there spam list I have in the past had ~100
requests to confirm spam I have never sent. That is to say nothing of
the bounces generated by non-existant accounts.

Sender confirmation

Posted Jun 3, 2004 20:06 UTC (Thu) by erwbgy (subscriber, #4104) [Link]

If you used sender confirmation then you wouldn't have problems with
other people using sender confirmation :-)

Sender confirmation

Posted Jun 3, 2004 20:37 UTC (Thu) by piman (guest, #8957) [Link]

Challenge-Response considered harmful: http://www.linuxmafia.com/faq/Mail/challenge-response.html

Sender confirmation

Posted Jun 7, 2004 17:46 UTC (Mon) by jae (guest, #2369) [Link]

Challenge Response. Great.

Needless to say, I might bother to reply, or I might not... most of the time I probably wouldn't...

Carthago delenda est

Posted Jun 9, 2004 11:28 UTC (Wed) by angdraug (subscriber, #7487) [Link] (7 responses)

Sigh. And what exactly will SpamAssassin gain from ASF membership except a GPL-incompatible license? It is incredible to hear that such a major decision was met with little resistance...

Carthago delenda est

Posted Jun 13, 2004 9:32 UTC (Sun) by stuart_hc (guest, #9737) [Link]

Good point. I wonder how many projects which previously linked with or were derived from SpamAssassin will have to maintain a fork of the old version.

The FSF explain why it is incompatible with the GPL:

This is a free software license but it is incompatible with the GPL. The Apache Software License is incompatible with the GPL because it has a specific requirement that is not in the GPL: it has certain patent termination cases that the GPL does not require. (We don't think those patent termination cases are inherently a bad idea, but nonetheless they are incompatible with the GNU GPL.)

Carthago delenda est sed Apache non Carthago est

Posted Jun 16, 2004 6:02 UTC (Wed) by crankysysadmin (guest, #19449) [Link] (5 responses)

For my part, I'm glad to hear SA is in the good company of the Apache folks. License schmicence. Since when is the Apache group evil? And IMNSHO there are plenty of licenses that are incompatible with the GPL that are quite OK, like the Creative Commons ones, the BSD license, etc. I like them all for different reasons. There's no sense in dogmatically insisting on the GPL when there are actual good reasons under varying circumstances to use other licenses.

Carthago delenda est sed Apache non Carthago est

Posted Jun 17, 2004 10:53 UTC (Thu) by angdraug (subscriber, #7487) [Link] (4 responses)

For my part, I'm glad to hear SA is in the good company of the Apache folks. License schmicence. Since when is the Apache group evil?

I'm afraid since before I've started to pay attention to them. Besides, there are no good and evil folks, there are good and evil deeds, and that is exactly this "schmicence" attitude that is a problem for me.

And IMNSHO there are plenty of licenses that are incompatible with the GPL that are quite OK, like the Creative Commons ones, the BSD license, etc.

I think your IMNSHO should be humbled down to IMHO: Creative Commons are far from being OK, they are not even DFSG-compliant, BSD without advertising clause is GPL-compatible and that is why it is OK.

There's no sense in dogmatically insisting on the GPL when there are actual good reasons under varying circumstances to use other licenses.

Each time GPL is discussed, someone always has to stich the "dogma" label on it. You know what? It's wrong, there are plenty of practical reasons to fight for GPL-compatibility.

In Red Hat 7.1, released in 2001, 63% of the software (counted by lines of code) was licensed under GPL and LGPL. In Sisyphus, current snapshot of ALT Linux, 77% of 5500 packages are licensed under GPL and LGPL. Now, isn't it important that it remains possible for two thirds of free software to be able to link and exchange code with the remaining third?

Carthago delenda est sed Apache non Carthago est

Posted Jun 17, 2004 11:28 UTC (Thu) by crankysysadmin (guest, #19449) [Link] (3 responses)

> Now, isn't it important that it remains possible for two thirds of free
> software to be able to link and exchange code with the remaining third?

Let me put my opinion this way: I'm not as idealistic as you (which need not be interpreted as a criticism) and therefore I'm glad that there are licenses out there that are not quite as permissive as the GPL but which are still worlds apart from the closed-source proprietary world.

More simply put: I think getting rid of all non-GPL licenses would, at this stage of the OSS struggle, result in less free software, FSVO 'free'.

However, I think the DFSG are great, and I think the GPL is great. I think they incorporate an ideal that still needs to be worked toward, and (important point) with which not everyone is fully comfortable (that's where the "dogma" would come in). Until that time let there be more free software, and let the developers of that software live with the harsh realities of not being able to swap code with absolutely everyone, and perhaps then they'll see the light.

Carthago delenda est sed Apache non Carthago est

Posted Jun 17, 2004 12:43 UTC (Thu) by angdraug (subscriber, #7487) [Link] (2 responses)

I'm not as idealistic as you

There, now you've put another false label on me, even if less negative than "dogma". Please, don't do that again.

I think getting rid of all non-GPL licenses would, at this stage of the OSS struggle, result in less free software, FSVO 'free'.

1. Don't put words in my mouth. I am talking GPL-incompatible here, not non-GPL. Can you see the difference?

2. Getting rid of GPL-incompatible licenses would result in one strong community, instead of the fracturing set of smaller communities sharing nothing but name, and confused even about that (OSS vs. free software).

However, I think the DFSG are great, and I think the GPL is great.

Now that intro brought up my spin-doctor alarm. Sorry, but I'm too used to hearing this exact phrase from folks who don't really mean it...

I think they incorporate an ideal (...) with which not everyone is fully comfortable (that's where the "dogma" would come in).

There, you did it again, now using both "idealism" and "dogma" labels in one sentence. Stick a label, rinse, repeat?

Of course there are people who are not comfortable with the idea of freedom. Otherwise, you wouldn't have to fight for it, right?

Carthago delenda est sed Apache non Carthago est

Posted Jun 17, 2004 14:07 UTC (Thu) by crankysysadmin (guest, #19449) [Link] (1 responses)

Whatever. No offense meant, I don't want to get in a flame war with you, nor did I intentionally put words in your mouth with my "I" statement. However, you're seemingly deliberately ignoring my point, which merits one more response from me. Let me see if this formulation pleases you:

I think getting rid of all GPL-incompatible licenses would, at this stage of the OSS struggle, result in less open source software, FSVO 'open'.

So in other words, my priority is on getting people who write code to make it somewhat free (in the sense of "open", not in the sense of "cost"), even if it isn't as free as the GPL. In that sense I feel I am also trying to increase freedom, because I currently see the alternative as being more closed-source, proprietary software. Obviously you are welcome to an opposing viewpoint.

Carthago delenda est sed Apache non Carthago est

Posted Jun 17, 2004 16:01 UTC (Thu) by angdraug (subscriber, #7487) [Link]

Ok, ok, apologies accepted.

But you're wrong in saying that I deliberately ignored your point: I've dedicated whole point "2. (...)" above to addressing it. Now that we have finished with "you said I said" games (I hope), lets get to the core of the matter.

So in other words, my priority is on getting people who write code to make it somewhat free (in the sense of "open", not in the sense of "cost"), even if it isn't as free as the GPL. In that sense I feel I am also trying to increase freedom, because I currently see the alternative as being more closed-source, proprietary software.

My experience proves otherwise. Once people decided to release the source under whatever license, they are already over the fence, in the sense that they're not very likely to back away from releasing it at all. If they are told (with polite and convincing arguments) that the license they've chosen is wrong, they are much more likely to release under a better license than not to release at all. OTOH, once considerable amount of time has passed since the release, it becomes much more difficult to change the license to a better one.

Thus, if the final goal is to get as much software as possible under as free licenses as possible, being careful about choosing a right license from the very start is more effective than the "schmicence" attitude.

And finally, please allow me to brainwash you about the words you use. After all, Orwell was right and our words do infuence our thoughts. "Open" is a wrong word as it implies that you can look, but not necessarily can touch (as in patents which are open just fine). Freedom should also include the right to touch and even to take it and walk away with it.


Copyright © 2004, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds