LWN.net Logo

Advertisement

E-Commerce & credit card processing - the Open Source way!

Advertise here

Open spam filtering rules considered harmful?

Readers of LWN know that we have long been a fan of SpamAssassin. Your editor, whose personal spam load is approaching 500 messages per day, would long have ceased to function without it. Network life in the 21st century requires either a well-hidden email address, or some sort of effective filtering.

SpamAssassin's extensive arsenal of tests has traditionally included checks for legitimate mail. In the past, mail which identified itself as having been created with certain free email agents or which contained a software patch was given some extra credit in the scoring process. Spammers have often found and exploited those tests; for a while, some of us were receiving mail which had been simultaneously "created" with mutt and evolution. The usual response to such activity has been to remove the tests in question.

Most recently, some spammers have started adding fake PGP signatures (in full HTML glory) to their output, in the hopes of slipping past SpamAssassin. The PGP signature test was removed some time ago, but the exploit was still enough to inspire this News.com article which, among other things, says:

The attack on the software's filtering process highlights the dangers of open-source projects, but it also reinforces the ability of projects with active development teams to quickly respond to such security holes.

The open nature of SpamAssassin's filtering is, thus, a "danger." Lest one become too concerned about the "dangers" involved in using SpamAssassin, however, there are a few things which should be kept in mind:

  • Prospective spam can be tested against any filter, open or closed. It would be surprising if spammers were not trying their products against SpamAssassin in this way. They also, most likely, maintain accounts with large ISPs and try to craft messages that get past the filters those ISPs employ as well.

  • SpamAssassin remains highly effective, even when spammers have had plenty of time to study its tests and work out ways to get around it. Open or not, SpamAssassin's rules are very good at identifying spam, and they appear to be hard to get around. Fighting spam is an arms race; it is surprising, actually, how rarely one has to upgrade SpamAssassin to keep it effective.

  • The bayesian filtering techniques used by SpamAssassin (and many other spam filtering systems) cannot be worked around in any easy way. A quick test on about 6400 messages which had accumulated in your editor's spam folder shows that the bayesian filter is the decisive test which condemns 15-25% of all incoming spam. Bayesian filters are highly individualized, and they are inaccessible to spammers. The algorithm is entirely open, but that is little comfort to those who would bury us in unwanted trash.

The real lesson from the PGP signature "exploit," most likely, is that negative tests will always be relatively easy for spammers to abuse. That will be why SpamAssassin 2.60 contains almost none of these tests.

The most important point, however, is entirely different. For many of us, email is a vital connection to the world. It is natural to be concerned about trusting a program to filter our incoming mail for us; mistakes can have real consequences. Would you really want to trust your mail to a hidden, proprietary filtering scheme? Don't you want to know what assumptions and biases have gone into the filtering decisions? Or, at least, don't you want that information to be available to those with the time and interest to check it out? Allowing a black box to pass judgment on one's incoming mail stream poses more dangers than an open, free system ever could.


(Log in to post comments)

Open spam filtering rules considered harmful?

Posted Oct 16, 2003 2:11 UTC (Thu) by mgh (subscriber, #5696) [Link]

Personally I have manually deleted more legitimate email because it "looked like" spam or got accidently caught in a multiple select that I suspect SpamAssassin ever would have.... It will be this that drives me to install and test SpamAssassin not volume of spam. Maybe I am lucky or because I don't release me email address ever except to people I trust, but I only get 4-5 pieces of spam per day (today).

Everything is relative after all, manual screening is not full proof either and while it might impart some "personal control" empowering feelings to the user I don't think it is likely to result in a better result.

Manual false positives

Posted Oct 16, 2003 2:14 UTC (Thu) by corbet (editor, #1) [Link]

That, actually, is what pushed us over the edge and made us start filtering: we kept catching ourselves deleting real mail. When there's ten spams for each legit message, it's way to easy to just lean on the "d" key and blast right through the good stuff. There comes a point where the computer really does do a better job.

Manual false positives

Posted Oct 16, 2003 13:45 UTC (Thu) by dkite (guest, #4577) [Link]

Hmm. I was going to comment that spammers efforts to work around the filters have made it
easier to manually filter. When I get a header like this:

Re[4]: Re[0]: ;09 5; ,,10 55 l 86,, Aмер3 ика9 нское про3изнош0 ение.

It isn't hard to decide.

Derek

Bayesian filtering isn't a panacea

Posted Oct 16, 2003 3:53 UTC (Thu) by rfunk (subscriber, #4054) [Link]

You say Bayesian filters cannot be worked around in any easy way, but in
my experience that's not entirely true.

I don't use SpamAssassin (yet), but I do use bogofilter, which is a
purely bayesian approach. I get around 100-130 spams a day, and these
days about 2-8 per day actually make it to my inbox.

Those 2-8 messages that bogofilter doesn't catch tend to use techniques
such as putting their payload in the html part of a multipart/alternative
message, and some innocuous book excerpt in the text part. Or they stick
lots of random words (or just random letters) in the message. Or they
use creative misspelling to avoid the words that would definitively flag
the message as spam. Some are just short HTML messages that try to load
their content from elsewhere. And a few actually read like a handcrafted
message, with little indication of their spammy nature.

I probably need to tune my bogofilter setup, and I may move to
spamassassin in order to avoid relying entirely on bayesian filtering.
Bayesian filtering is a wondering invention and certainly the single most
effectively spam content filter, but in a world where spammers creatively
adjust to the filters, it can't do the whole job on its own.

Bayesian filtering is not bad

Posted Oct 16, 2003 7:23 UTC (Thu) by djao (subscriber, #4263) [Link]

I must say, I use spamassassin and I don't have the particular problems you mention.

putting their payload in the html part of a multipart/alternative message, and some innocuous book excerpt in the text part.

A good bayesian filter (e.g. Paul Graham's) ignores innocuous words like book excerpts and only scores based on interesting words. Since a random book excerpt is unlikely to contain interesting negative scoring words (which are different for each person), the positive scoring words would flag this kind of mail as spam.

Or they stick lots of random words (or just random letters) in the message.

Since legit mail never has random letters, a good bayesian filter can instantly flag mail with random letters as spam. The random words problem is addressed above.

Or they use creative misspelling to avoid the words that would definitively flag the message as spam.

These are the easiest mails for spamassassin's bayesian filter to catch: legit mail never has misspelled spam terms, so anything containing lots of misspelled or m1sspe11ed spam words is easily flagged as spam. The only way a spammer could avoid this trap is to exclusively use misspellings that have never appeared in your spam before. The chances of a spammer succeeding at this task with all of his words is almost zero. When he fails, your bayesian filter automatically learns all of the new incoming misspellings as spam terms.

Some are just short HTML messages that try to load their content from elsewhere.

These are the kind of spam best blocked using non-bayesian techniques, I think. Spamassassin has many non-bayesian filters for this purpose. Also, not loading images except on request saves a lot on your eyes -- the number of legit mails containing html img tags is so small that it is no trouble to manually load their images.

And a few actually read like a handcrafted message, with little indication of their spammy nature.

These "dating service" spams, being the only spams that don't use a sales pitch, might well require improved bayesian filters to catch. But I think eliminating the non-dating-service spams is already good progress.

Bayesian filtering is not bad

Posted Oct 16, 2003 16:21 UTC (Thu) by rfunk (subscriber, #4054) [Link]

I never said bayesian filternig is bad. In fact I said it's the greatest single technique. It's just not the cure on its own.

I must say, I use spamassassin and I don't have the particular problems you mention.

Thus my speculation about possibly switching from bogofilter to spamassassin.

A good bayesian filter ignores innocuous words like book excerpts and only scores based on interesting words.

By "innocuous" I mean words that tend to appear in my real mail and not in the usual spam. The Bayesian concept as I understand it includes giving bonus points for words in the "good" pile. (Yes, that's way oversimplified.)

Since legit mail never has random letters, a good bayesian filter can instantly flag mail with random letters as spam.

Not bayesian, because the random letters won't (in the general case) already be in the database of spammy words. (If they're already there, then they must not be random!)

These are the easiest mails for spamassassin's bayesian filter to catch: legit mail never has misspelled spam terms, so anything containing lots of misspelled or m1sspe11ed spam words is easily flagged as spam. The only way a spammer could avoid this trap is to exclusively use misspellings that have never appeared in your spam before.

The whole point of the misspelling is to use forms that haven't appeared in spam before. Thus they get through. Just like random letters except that the eye gives them meaning (and the possibilities are more limited).

Also, not loading images except on request saves a lot on your eyes

Of course. Not to mention the "web bugs" that let them know when the message has been read. Never allow mail to load anything from the net.

And yes, of course I retrain the filter every time spam gets through.

Bayesian filtering

Posted Oct 19, 2003 9:56 UTC (Sun) by djao (subscriber, #4263) [Link]

Not bayesian, because the random letters won't (in the general case) already be in the database of spammy words.

There are only 26 letters. Many of these letters do not occur by themselves in natural English. Thus lots of individual random letters (as opposed to random words) is a pretty good spam sign.

The whole point of the misspelling is to use forms that haven't appeared in spam before. Thus they get through.

I have 20000 spams in my spam folder. The probability that a spammer invents a new message consisting entirely of unseen words is pretty small.

Both of the problems I quoted here would be solved if spam filters would learn to recognize that an unusually high number of previously unseen words is in an of itself a very good spam indicator. Maybe some filters already do.

Bayesian filtering sure SEEMS like a panacea to me...

Posted Oct 16, 2003 14:09 UTC (Thu) by RobSeace (subscriber, #4435) [Link]

Do you retrain it to recognize those messages that make it through as spam?
If not, that's your problem, right there... You have to correct it when it
makes a mistake, so it can learn from it... Otherwise, it's just going to
continue making the same mistake in the future... It's a learning system,
and if you train it properly initially, and KEEP it properly trained by
correcting all mistakes, then it just gets better and better, and ultimately
the mistakes all but disappear... I get between 100 and 200 spams per day
at 2 separate E-mail addresses... I run SpamBayes on one and bogofilter on
the other... Both of them catch all of the spams, almost every single day...
(With no false-positives at all, either...) Maybe ONE single spam per MONTH
will make it through either one and into my mailbox... And, after a simple
retraining to recognize that message as spam, it and all future similar ones
will then be caught, forcing the spammers to try to come up with some NEW
trick to make it past (which will then be learned by the filters as well,
of course)... As near as *I* can sure see, based on my own personal
experience, Bayesian filtering sure IS about as close to a panacea as one
is ever likely to see in the world of spam-filtering... But, of course,
if you don't keep your Bayesian filter properly trained and correct it when
it makes mistakes, then of course it can't work properly... But, that's
not the fault of the filter, but of the user, in that case...

Bayesian filtering isn't a panacea

Posted Oct 16, 2003 16:55 UTC (Thu) by edgewood (subscriber, #1123) [Link]

I use bogofilter with tristate classification, and feed the result into TMDA. Messages that aren't spam according to bogofilter (X-Bogosity: No) get passed directly into my inbox. Messages that are spam are held by TMDA, and ones that bogofilter is unsure about are either released by my TMDA whitelist or challenged, so that legitimate senders can respond to the challenge and have the message released.

It's been weeks since a single spam showed up in my inbox, and months since I've had to rescue a legitimate email from the TMDA pending queue.

Open spam filtering rules considered harmful?

Posted Oct 16, 2003 3:54 UTC (Thu) by smoogen (subscriber, #97) [Link]

s/PHP/PGP/

I got a little confused and needed to read it again to figure out where a PHP vuln had come in.

Open spam filtering rules considered harmful?

Posted Oct 16, 2003 4:08 UTC (Thu) by arcticwolf (guest, #8341) [Link]

The problem is really not so much SpamAssassin's rules' openness (as we all know well, security by obscurity does not work), but rather that a rule-based approach with static rules will always allow you to craft messages specifically designed to get around those rules.

For now, bayesian spam filtering seems to be the way to go; maybe something better will come along in the future, but it does seem to be an improvement over rule-based approaches. I used SpamAssassin for quite a while before switching to bogofilter, and while I always was happy with SpamAssassin's performance, bogofilter simply is even better (or, rather, has become even better after proper training).

If you want to see for yourself, give it a try - collect all your spam (and ham) over a couple of months and use that to train bogofilter (with a bit of procmail trickery, you can even use spamassassin to train bogofilter automatically). It's simply amazing.

Open spam filtering rules considered harmful?

Posted Oct 16, 2003 4:32 UTC (Thu) by Ross (subscriber, #4065) [Link]

I wish there was some way to include things like "the subject line
contained 20 consecutive spaces" into the baysian filtering. If there
was some way to merge SpamAssasin-like checks into an automatically
tuning filter I think we would have a winner.

There are also many ways to fool a baysian filter using different
encoding techniques, HTML-comments, extra whitespace, etc. A
preprocessing pass to normalize a message would be a nice addition.

Proposed addition to Bayes

Posted Oct 16, 2003 6:32 UTC (Thu) by eru (subscriber, #2753) [Link]

I wish there was some way to include things like "the subject line contained 20 consecutive spaces" into the baysian filtering.

Why not assign spamminess weights to general features like this, and inject them into the Bayesian analysis exactly like the presence of certain words? I.e. if a feature-detecting rule fires, it is like the presence of a word.

The actual features to be weighted need to be invented by humans, after looking at actual spam and imagining other plausible spam techiques. Some probable spam indicators I have seen:
- Is the message in HTML with lots of short HTML comments and/or invalid tags? (I would say a message with this feature is spam with 100% certainty)
- Is the message contents just a single image?
- Is the message in HTML with invisible text? (like white text on white background)
- Is there lots of consequtive white space in subject? (the feature you mentioned)
- Is there a space-separated sequence of several individual characters in the subject? (like "V I A G R A").
...

This is probably an obvious idea implemented somewhere, but here it is, if anyone is interested in trying it.

Proposed addition to Bayes

Posted Oct 16, 2003 14:49 UTC (Thu) by climent (subscriber, #7232) [Link]

This has been discussed on the mailing list, at some point, and it was probed that it gave more problems than benefits, since the rules are a deterministic way of providing info, while words are random tokens (from a bayesian point of view)

Anyway, check the archives to read the real reasons. I might have invented mine in this post, out of memory... ;)

Open spam filtering rules considered harmful?

Posted Oct 16, 2003 13:58 UTC (Thu) by RobSeace (subscriber, #4435) [Link]

In my experience (running SpamBayes at home and bogofilter at work), none
of these things "fool" a Bayesian filter very well... Sure, the very first
such spam that employs such a technique MIGHT get through (if it manages to
avoid other big tip-offs), but that's about it... All you have to do is
retrain your filter to recognize that message as spam, and in the future
similar messages should also start getting automatically caught... I don't
think anyone is arguing that you can have a purely static wordlist database
for your Bayesian filter, and expect it to always work with 100% accuracy...
But, if you set it to self-learning mode (so it trains on all messages as
it classifies them), and you correct ALL mistakes it makes (of which there
should be VERY few, in my experience), it tends to be about the best you can
really possibly hope for... And, I don't think adding any sort of fixed
content rule checks will ever possibly help things; in fact, they'll only
make things worse, because such things can be exploited since they're known
and not dynamically learned based on actual seen content... I personally see
maybe ONE spam get through either SpamBayes or bogofilter per month (out of
between 100 to 200 spams caught per DAY, by EACH one!)... And, after
properly retraining it to recognize that one as spam, and retesting it
again, it always properly handles it, as well as any future ones that use
the same trick that let it through... It's a learning system, and it's
really the ONLY way to go, as far as I can see...

SpamAssassin has Bayesian filtering!

Posted Oct 16, 2003 15:13 UTC (Thu) by mceesay (guest, #2806) [Link]

SpamAssassin has had Bayesian filtering for a while. sa-learn is the command you use to train the filter.

I found that SpamAssassin's effectiveness increased enormously after training.

There is great synergy between SA's rules and its Bayesian filter, giving us the best of both worlds.

Open spam filtering rules considered harmful?

Posted Oct 16, 2003 20:27 UTC (Thu) by iabervon (subscriber, #722) [Link]

You'd think that a fake PGP signature, while not easy for SpamAssassain to identify, would be easy for PGP to identify. Obviously, you'd want to have SpamAssassain ignore PGP signatures unless to had PGP to verify with, but PGP should do a pretty good job of taking care of signed messages with forged signatures or signatures that cannot be verified.

Dynamic rules update

Posted Oct 17, 2003 15:40 UTC (Fri) by jschrod (subscriber, #1646) [Link]

What we need, IMO, is a dynamic rules update service for SpamAssassin.

I don't want to check out the CVS version ever so often, and when I do it, I don't know if I need new module versions for new rules. But intermediate updated ruleset snapshots in shorter frequencies would be very nice.

Cheers, Joachim

Open spam filtering rules considered harmful?

Posted Oct 23, 2003 8:22 UTC (Thu) by akukula (guest, #3862) [Link]

I found the article at http://freshmeat.net/articles/view/964/ very interesting. A whole bunch of spam filters are tested and compared there. Just worth noting.

Copyright © 2003, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds