LWN.net Logo

Keeping spamassassin current

Longtime users of SpamAssassin know that it can do an outstanding job of identifying spam. They also know, however, that the effectiveness of any particular SpamAssassin release tends to decline over time as spammers figure out how to craft messages which get past the rules. The Bayesian filter buried inside SpamAssassin can help a lot; it catches a fair amount of spam which evades the rules, and it evolves over time to keep up with what the spammers are doing - especially if you make a point of training the filter with its mistakes. Even so, frustrating amounts of spam can get through.

The situation is not helped much by the fact that the SpamAssassin rule base seems to be evolving slowly in recent times. The SpamAssassin developers have too many other things to do, perhaps, or maybe they would rather see the work done by the filter. In any case, some users would certainly like to see the rules updated more frequently.

The maintenance of an up-to-the-second set of SpamAssassin rules could well be a business opportunity for somebody, if the licensing issues could be worked out. But SpamAssassin users should also be aware of the custom rulesets page hosted on the SpamAssassin Wiki. This is a place where additional rules can be found to deal with specific problems; some of them might cut your spam load considerably.

Currently available rulesets include:

  • One aimed at "pill spam." Those of us not looking to fill our prescriptions over the net may welcome this one.

  • "Bigevil" simply contains URLs found in spam; it's a sort of content-based blacklist.

  • There is a set of rules for filtering out virus warnings.

  • "Tripwire" looks for combinations of letters which do not appear in English text, normally.

Several others exist as well; there is also a "RulesDuJour" script which can be used to automatically keep up to date with the rulesets as they are maintained. The custom rulesets won't solve the spam problem, but they can help to keep a mailbox a bit cleaner.


(Log in to post comments)

Keeping spamassassin current

Posted Mar 4, 2004 2:55 UTC (Thu) by bronson (subscriber, #4806) [Link]

It's true that SpamAssassin effectiveness declines over time after a release. That time can be a matter of hours. SA has such wide adoption that it is a primary target of spammers.

A possible solution? Include NO rules -- only code. This isn't quite as strange as it might sound. You would download SA Perl modules, then install whatever ruleset best fits your frame of mind. Spammers would have a much harder time trying to make silver bullet spams then.

Adaptive weighting?

Posted Mar 4, 2004 3:02 UTC (Thu) by Ross (subscriber, #4065) [Link]

Why can't they do some kind of Baysian network based not on single words
from the text but on the output from the various SpamAssassin rules? Then
they wouldn't have to fine-tune the weights and individual users would get
weights that matched the spam in their inboxes better. This would also
make it more difficult for spammers to test against the rule base since
they wouldn't know which rules are weighted heavily and which lightly.

Adaptive weighting?

Posted Mar 4, 2004 4:34 UTC (Thu) by proski (subscriber, #104) [Link]

Sounds like an excellent idea! I hope you will share it with Spamassassin developers. Complex rules is Spamassassin's strength. The way how they are combined (addition) is spamassassin's weakness. Predictiveness of the score is another weakness. Let's get rid of weaknesses.

Adaptive weighting?

Posted Mar 4, 2004 6:09 UTC (Thu) by mkettler (guest, #3933) [Link]

SpamAssassin has had a bayesian filter, in addition to the rules, for the past 8 releases. The first version with a bayes subsytem was 2.50, released Feburary of 2003.

No need to share this with the sa-devs.. they clued in a long time ago.

Adaptive weighting?

Posted Mar 4, 2004 20:23 UTC (Thu) by skybrian (subscriber, #365) [Link]

It sounds like you misunderstood the point. Unless something changed since 2.50, the bayesian filter is just a separate set of rules. The weight on each rule (including the Bayesian rules) is static.

Adaptive weighting?

Posted Mar 4, 2004 10:33 UTC (Thu) by nix (subscriber, #2304) [Link]

This has been tried, and wasn't terribly effective.

(Justin posted some test results on this to the sa-dev list maybe a year ago.)

Keeping spamassassin current

Posted Mar 4, 2004 10:37 UTC (Thu) by nix (subscriber, #2304) [Link]

This is `so bad it's not even wrong', like saying `include no functionality, only code'. Many of the rules consist of code; many of the rest depend on subtle details of a particular SA implementation or on features only present in recent SA versions: and the whole lot is scored by a GA, so forms an integrated whole. (In SA 3.0, they'll be scored by a perceptron instead, which may actually be fast enough that releasing SA more often will be practical. As it is the GA run adds over a week to release times.)

Most of the ancillary rulesets have arbitrarily-assigned scores, so might actually reduce the effectiveness of SA as a whole (this is likely if SA is already spotting most to all of your spam, in which case adding large numbers of non-GA-scored rules is likely to increase FPs.)

See the SA Wiki page on independently releasing rules.

Keeping spamassassin current

Posted Mar 4, 2004 18:10 UTC (Thu) by Cato (subscriber, #7643) [Link]

I find that SpamAssassin takes many weeks to decline in effectiveness after a new release - for a long time I was running a very old release and it was fine. If you are able to write your own rules, or just adjust the existing scores to your spam, you should be fine. The rules are still important even when the Bayesian classifier is in use, because messages scoring over (by default) 15 are used to 'autolearn' spam - without useful rules you would need to spend some time telling SA which messages are spam.

It is very rare that someone manages to get a spam past SA these days, at least with my custom rules.

Keeping spamassassin current

Posted Mar 4, 2004 11:35 UTC (Thu) by larsga (guest, #2801) [Link]

Due to the volume of spam I receive (typically ~800 in one weekend) I've tried out several different solutions, in this order: spamcop, bogofilter, spamassassin, and spambayes. Spamcop turned out not to catch more than ~20%; bogofilter caught a good deal, but not enough. SpamAssassin didn't do very well for me (and also produced quite a few false positives), so I ended up ditching it. SpamBayes, however, has pretty much solved the problem for me. I have to do a little training now and then to keep it up to date, but that's about it.

In short, instead of trying to keep SpamAssassin current you might as well ditch it in favour of SpamBayes. All IMHO, of course.

spambayes (was Keeping spamassassin current)

Posted Mar 4, 2004 12:48 UTC (Thu) by metacircles (guest, #8895) [Link]

I've also been pretty impressed by spambayes. Since I last cleaned
out my spam folder (six days ago) I've received 8488 spams (about 43Mb), and maybe ten to twenty a day have leaked through it (I don't keep stats on them, I just train and/or delete them as I see them. It feels like about that many)

I have no idea about false positives because I lack any kind of motivation to ever check the spam folder: if people really want to get in touch with me I can only suggest they (a) write mail that looks less like spam, or (b) use a reliable contact method.

Is my spam filter false positive the sender's problem?

Posted Mar 6, 2004 1:39 UTC (Sat) by giraffedata (subscriber, #1954) [Link]

It's funny that in discussions about email systems, people quite often talk like it's a privilege to send email to them; "it's OK if it's hard to send me an email because the senders who don't play my game don't deserve to reach me."

I guess that's probably true for some recipients, but for most of us, we are probably hurt just as much by a failed email as the sender. I frequently get requests from people for help with software and documents I distribute, and my reply gets rejected because my mail server is erroneously on some black list or a name server isn't set up exactly according to convention, or something like that. And those are just the ones where the spam filter is courteous enough to send a bounce message; there are probably more I don't know about.

It irritates me if I spend 15 minutes writing an email and the recipient never gets it, but the real loser in this case is the recipient. And no, I don't go out of my way to find a way to get through the spam filter in these cases. There's just not enough in it for me.

Is my spam filter false positive the sender's problem?

Posted Mar 10, 2004 1:35 UTC (Wed) by showell (subscriber, #2929) [Link]

An interesting viewpoint but I must state the other side of the coin. In my job e-mail is out-dated but unfortunately not yet out-moded. I get no spam sitting behind a very good corporate firewall but the amount of real mail is so large that I cannot guarentee to read all of it.

I get annoyed with people who consider sending an e-mail as a way of dealing with a subject (ie if I have sent an e-mail then I have fulfilled my responsibility and have informed / handed off responsibility). No-one should consider a sent e-mail as read, even if we had zero spam on the net.

signed
Frustrated Manager

Keeping spamassassin current

Posted Mar 4, 2004 16:24 UTC (Thu) by RobSeace (subscriber, #4435) [Link]

I'll throw in another recommendation for SpamBayes... I use it on my home
E-mail address, which gets about 200 spams per day, and it works incredibly
well... And, it has a very difficult time there, because I get a lot of
particularly spammy-looking mail, which I actually WANT to receive there...
(Eg: receipts and shipping notices from online ordering places, various
mailing list posts, and even a few actual ads from businesses I really want
to see...) And, it still does a remarkable job of sorting the spam from the
non-spam... Every once in a while, it'll mess-up, but it's really pretty
rare, and on the whole it definitely makes my life a whole lot easier...

However, I have to disagree with you about bogofilter... I use that on my
work E-mail address, which gets a bit less spam (maybe 100-150 per day), and
it works just as great as SpamBayes, in my experience, if not even BETTER...
However, to be fair, it DOES have a much easier job, too: the only legit
messages I ever get at work are actual work-related messages, and never any
borderline spammy-looking messages like SpamBayes has to deal with at home...
So, perhaps that accounts for my perception of it... *shrug* All I can say
is that it seems nearly perfect under these conditions, anyway... I've
never even ONCE had a false-positive with it... And, it doesn't let many
spams ever slip through, either... And, it seems to learn INSTANTLY when
corrected, whereas my experience with SpamBayes is that it takes a while to
learn from mistakes, and may keep repeating them a few times before finally
catching on...

But, in any case, I recommend both SpamBayes and bogofilter quite highly...
People keep talking about the need for all these complex and silly solutions
to the scourge of spam, when really all they need to do is install a Bayesian
filter and train it well, and they'd not have many complaints after that...
(Of course, I wouldn't object to drawing and quartering of spammers, if
that were to be passed into law, either... ;-))

Keeping spamassassin current

Posted Mar 5, 2004 1:24 UTC (Fri) by larsga (guest, #2801) [Link]

I'm not sure whether my experience with bogofilter is generally applicable, since I had to set up my own scripts to get email from my mail client into the bogofilter database. It may be that clues (headers especially) get lost on the way, so it *is* possible that bogofilter actually is better than I made it sound like.

On the other hand, when I used it *had* to set up my own scripts, whereas with SpamBayes' POP proxy there was no need for that, so SpamBayes gets some extra points anyway. :)

Keeping spamassassin current

Posted Mar 4, 2004 15:10 UTC (Thu) by kh (subscriber, #19413) [Link]

I have wondered if the filters couldn't do a better job if the message was first piped through aspell or some other spell checker with the number of mispellings, number of unresolvable garbage words (with these actually then dropped for the inspection step), and corrected words then sent to spamassassin's filter for inspection. I think this would make poisoning of the filter more difficult in any case.

ie - message starts as:
Get qwr your V|agra sldfj fdy

Spamassassin receives:
1 misspelling
3 garbage
Get your Viagra

Keeping spamassassin current

Posted Mar 16, 2004 15:31 UTC (Tue) by Chris (guest, #20242) [Link]

Wow, nice to see poeple find our work!

Greetings, I'm Chris, a member of the Spamassassin Rule writers consortium. I just wanted to say hello. There is a great group of people working on keeping SA up to date. There are some incredible rulesets that will help out any SA user.

We are currenlty working on a big release of more rulesets. The 3 main places to check out are:

http://wiki.apache.org/spamassassin/CustomRulesets
http://www.exit0.us/

And of course my SARE (SpamAssassin Rule Emporeum) is linked from those. It should hopefully be moved to a new server soon. Which is why i'm not posting a direct URL.

Typical example is an email scoring 3.5 with stock SA will now score a 38.6 with a few rulesets! Pretty cool.

Glad we can help people out.

PS: On the spellcheck thing....Tripwire.cf ruleset covers a lot for English. We have a lot of obfuscation rules and some for Bayes poison. We are getting better everyday ;)

Copyright © 2004, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds