LWN.net Logo

Advertisement

E-Commerce & credit card processing - the Open Source way!

Advertise here

SpamAssassin 2.50

Last September, we compared SpamAssassin with a new set of spam filtering systems based on Bayesian analysis. The Bayesian approach, while seemingly "dumber," looked poised to depose SpamAssassin's complex set of rules as the most effective filtering system. Bayesian filters are fast, and they can learn from the actual mail received by each potential consumer of anatomical enlargements and Nigerian investment schemes. The combination of speed and adaptibility looked like a winning pair.

Of course, SpamAssassin has not stood still in the intervening months. The SpamAssassin team released version 2.50 a couple of weeks ago; this release is described as a "beta version." No true Linux user fears beta-quality software, so we decided to have a look.

The SpamAssassin 2.50 rule set is more impressive than ever. Several hundred rules detect forged headers, cellphone pitches, Nigerian scams, mortgage refinance pitches, mentions of Oprah, porn terms ("various types of feline"), suspicious HTML, "my wife Jody," 100% guarantees, etc. There is a growing set of rules aimed at catching spam in languages other than English. One can only marvel at the noble souls who examine that much spam closely enough to develop these tests.

As is usually the case with a new SpamAssassin release, the new rules are far more effective than the previous set - for a while at least. The normal pattern, though, is that spammers begin to figure out how to evade the latest rules, and the number of false negatives slowly rises over time.

The truly significant new feature of version 2.50 may help change that pattern, however. Included in this release is, of course, a Bayesian filter. SpamAssassin will now remember the terms it finds in both spam and "ham," and assign to each a probability that the message containing it is spam. The Bayesian filter is not actually used to classify mail until a sizeable body of both good and bad mail has been processed. Once that threshold has been reached, the Bayesian filter will simply assign points to a message like any other test.

In 2.50, the scoring is set up so that the Bayesian filter, by itself, cannot condemn a message as spam. Even if the filter says the probability of spam is 99%, a maximum of 4.3 points (out of the 5 needed by default) will be assigned. Still, the Bayesian filter is sufficient to tip the balance on many spams that would have otherwise been classified as real mail. We ran a quick test with 5000 messages from the LWN.net inbox; before training, SpamAssassin flagged 3,935 of them as spam. Afterwards, it caught 4,139 instead. That's 204 spams we would not have to see, and that can only be a good thing.

The combination of SpamAssassin's rule base and the Bayesian filter addresses one of the biggest weaknesses of the Bayesian approach: the filter must be trained. Everybody gets a unique mix of mail, and, believe it or not, a unique mix of spam. There is no set of Bayesian rules that will work for everybody. If you happen to have nicely sorted piles of spam and real mail sitting around, you can quickly train the SpamAssassin filter via the sa-learn tool. But most users will simply want the filter to work without an extra effort on their part.

SpamAssassin's rules can help make that happen. Even without the Bayesian filter, SpamAssassin can flag most of the spam in a user's mail stream. It can, thus, train the Bayesian filter by itself; that filter will eventually start catching spam which sneaks past the regular rules. SpamAssassin has become a tool which learns how to do its job better over time. As a result, we are free to spend out time dealing with more interesting things. It would be hard to find a better example of useful, user-friendly free software.


(Log in to post comments)

Language problem

Posted Mar 6, 2003 10:10 UTC (Thu) by kruemelmo (subscriber, #8279) [Link]

I always wondered if this approach could work for non-English speaking folks at all.

I haven't tried software with a Bayesian filter, however, I suspect that for me, it simply couldn't work, because my mail consists of (simplyfied to illustrate)

- 80% English spam
- 5% German spam
- 13% German "ham"
- 2% English ham.

After the training, I'd be certain that the filter would classify all of the non-German ham as Spam, wouldn't it? Or would it be clever enough?

Moritz

Language problem

Posted Mar 6, 2003 10:39 UTC (Thu) by arcticwolf (guest, #8341) [Link]

It probably wouldn't - not because it's clever, but rather because it's not and not trying to be, either ("it" being the bayesian approach in this case, not spamassassin as a whole). I can't back up these claims with numbers really, but I think it's safe to say that out of the few messages in german I get, about 95% are spam, while only 5 are legitimate (at most). bogofilter (which is a purely-bayesian filter no less, unlike spamassassin) has yet to misclassify so much as even a single one of those.

The care and feeding of spam & ham databases

Posted Mar 6, 2003 10:44 UTC (Thu) by arcticwolf (guest, #8341) [Link]

It might be worth noting that bogofilter, probably one of the most well-known and widely-used tools for bayesian mail filtering, also has the ability to automatically learn messages as ham/spam respectively as it processes them, so once you get past the initial training stage, it usually requires little (if any) user interaction to stay up-to-date, either.

Just a shameless plug for bogofilter. :)

SpamAssassin 2.50

Posted Mar 6, 2003 11:00 UTC (Thu) by nan (guest, #710) [Link]

I've been in London for 3 years now, but lived in Barcelona before, where a friend and I have been running an ISP for over 8 years now. My ham comes in Catalan (~10%), Spanish (~10%) and English (~80%). My spam comes in all three plus a few others. We've been running SA for a while now.

We use rules in our particular mail applications to move what should be considered spam into special folder for review (not yet trusting it fully). So far it picks up spam coming in all languages and we have yet to find ham marked as spam, in any language, in any of our accounts. We've never have been so free of spam... except before we started using the net ;)

SpamAssassin 2.50

Posted Mar 6, 2003 11:35 UTC (Thu) by peter_pilgrim (guest, #5448) [Link]

I would love Spam Assassin to attack the spammers.

I wish that it could hover over the [cyberspace] battlefield and grab over all the intelligence, figure by reverse ARP or perform link up with another toolset, or program and blast the spammers to hell.

This is my dream!

Actually I downloaded and installed Spam Assassin 2.50 a couple of weeks and trained with a list of missedspam messages, and lo and behold those messages just disappeared. I should also recommend that you also install Vipul's Razor tool at last, also in 2.50 there is a support for DCC to really tightening things up.

Now, we only need a Counterstrike Razor tool, unfortunately I am a Java developer so don't look for me to write it.

Bye

TarProxy

Posted Mar 6, 2003 11:58 UTC (Thu) by alspnost (subscriber, #2763) [Link]

The so-called "tar pit" might be the closest you can get for now:

http://www.martiansoftware.com/articles/spammerpain.html

Not really "blasting them to hell", but perhaps a good start for now. I'd be very interested to see if this a) works and b) takes off!

Learning

Posted Mar 6, 2003 13:26 UTC (Thu) by gerv (subscriber, #3376) [Link]

Does this not run the risk that, initially at least and maybe in the long term, the Bayesian filter merely ends up being an encoding of the SpamAssassin ruleset, but in a different format?

Gerv

Learning

Posted Mar 6, 2003 17:11 UTC (Thu) by pflugstad (subscriber, #224) [Link]

Probably, but the SpamAssassin ruleset is static and requires people to keep it up to date. It's also not specific to each person. The Bayesian filter will learn as you get more spam/ham and adapt to the particular spam you get. SpamAssassin can't keep up with that.

SpamAssassin Feature Suggestion

Posted Mar 10, 2003 7:16 UTC (Mon) by samj (subscriber, #7135) [Link]

Add a link to each message for training the bayesian filters, so as users can provide feedback to spamassassin from the message itself. This filters no doubt need continual monitoring so these could be left on indefinitely, and only appended to a> all spam and b> questionable negatives.

Copyright © 2003, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds