LWN.net Logo

Bayesian filtering isn't a panacea

Bayesian filtering isn't a panacea

Posted Oct 16, 2003 3:53 UTC (Thu) by rfunk (subscriber, #4054)
Parent article: Open spam filtering rules considered harmful?

You say Bayesian filters cannot be worked around in any easy way, but in
my experience that's not entirely true.

I don't use SpamAssassin (yet), but I do use bogofilter, which is a
purely bayesian approach. I get around 100-130 spams a day, and these
days about 2-8 per day actually make it to my inbox.

Those 2-8 messages that bogofilter doesn't catch tend to use techniques
such as putting their payload in the html part of a multipart/alternative
message, and some innocuous book excerpt in the text part. Or they stick
lots of random words (or just random letters) in the message. Or they
use creative misspelling to avoid the words that would definitively flag
the message as spam. Some are just short HTML messages that try to load
their content from elsewhere. And a few actually read like a handcrafted
message, with little indication of their spammy nature.

I probably need to tune my bogofilter setup, and I may move to
spamassassin in order to avoid relying entirely on bayesian filtering.
Bayesian filtering is a wondering invention and certainly the single most
effectively spam content filter, but in a world where spammers creatively
adjust to the filters, it can't do the whole job on its own.


(Log in to post comments)

Bayesian filtering is not bad

Posted Oct 16, 2003 7:23 UTC (Thu) by djao (subscriber, #4263) [Link]

I must say, I use spamassassin and I don't have the particular problems you mention.

putting their payload in the html part of a multipart/alternative message, and some innocuous book excerpt in the text part.

A good bayesian filter (e.g. Paul Graham's) ignores innocuous words like book excerpts and only scores based on interesting words. Since a random book excerpt is unlikely to contain interesting negative scoring words (which are different for each person), the positive scoring words would flag this kind of mail as spam.

Or they stick lots of random words (or just random letters) in the message.

Since legit mail never has random letters, a good bayesian filter can instantly flag mail with random letters as spam. The random words problem is addressed above.

Or they use creative misspelling to avoid the words that would definitively flag the message as spam.

These are the easiest mails for spamassassin's bayesian filter to catch: legit mail never has misspelled spam terms, so anything containing lots of misspelled or m1sspe11ed spam words is easily flagged as spam. The only way a spammer could avoid this trap is to exclusively use misspellings that have never appeared in your spam before. The chances of a spammer succeeding at this task with all of his words is almost zero. When he fails, your bayesian filter automatically learns all of the new incoming misspellings as spam terms.

Some are just short HTML messages that try to load their content from elsewhere.

These are the kind of spam best blocked using non-bayesian techniques, I think. Spamassassin has many non-bayesian filters for this purpose. Also, not loading images except on request saves a lot on your eyes -- the number of legit mails containing html img tags is so small that it is no trouble to manually load their images.

And a few actually read like a handcrafted message, with little indication of their spammy nature.

These "dating service" spams, being the only spams that don't use a sales pitch, might well require improved bayesian filters to catch. But I think eliminating the non-dating-service spams is already good progress.

Bayesian filtering is not bad

Posted Oct 16, 2003 16:21 UTC (Thu) by rfunk (subscriber, #4054) [Link]

I never said bayesian filternig is bad. In fact I said it's the greatest single technique. It's just not the cure on its own.

I must say, I use spamassassin and I don't have the particular problems you mention.

Thus my speculation about possibly switching from bogofilter to spamassassin.

A good bayesian filter ignores innocuous words like book excerpts and only scores based on interesting words.

By "innocuous" I mean words that tend to appear in my real mail and not in the usual spam. The Bayesian concept as I understand it includes giving bonus points for words in the "good" pile. (Yes, that's way oversimplified.)

Since legit mail never has random letters, a good bayesian filter can instantly flag mail with random letters as spam.

Not bayesian, because the random letters won't (in the general case) already be in the database of spammy words. (If they're already there, then they must not be random!)

These are the easiest mails for spamassassin's bayesian filter to catch: legit mail never has misspelled spam terms, so anything containing lots of misspelled or m1sspe11ed spam words is easily flagged as spam. The only way a spammer could avoid this trap is to exclusively use misspellings that have never appeared in your spam before.

The whole point of the misspelling is to use forms that haven't appeared in spam before. Thus they get through. Just like random letters except that the eye gives them meaning (and the possibilities are more limited).

Also, not loading images except on request saves a lot on your eyes

Of course. Not to mention the "web bugs" that let them know when the message has been read. Never allow mail to load anything from the net.

And yes, of course I retrain the filter every time spam gets through.

Bayesian filtering

Posted Oct 19, 2003 9:56 UTC (Sun) by djao (subscriber, #4263) [Link]

Not bayesian, because the random letters won't (in the general case) already be in the database of spammy words.

There are only 26 letters. Many of these letters do not occur by themselves in natural English. Thus lots of individual random letters (as opposed to random words) is a pretty good spam sign.

The whole point of the misspelling is to use forms that haven't appeared in spam before. Thus they get through.

I have 20000 spams in my spam folder. The probability that a spammer invents a new message consisting entirely of unseen words is pretty small.

Both of the problems I quoted here would be solved if spam filters would learn to recognize that an unusually high number of previously unseen words is in an of itself a very good spam indicator. Maybe some filters already do.

Bayesian filtering sure SEEMS like a panacea to me...

Posted Oct 16, 2003 14:09 UTC (Thu) by RobSeace (subscriber, #4435) [Link]

Do you retrain it to recognize those messages that make it through as spam?
If not, that's your problem, right there... You have to correct it when it
makes a mistake, so it can learn from it... Otherwise, it's just going to
continue making the same mistake in the future... It's a learning system,
and if you train it properly initially, and KEEP it properly trained by
correcting all mistakes, then it just gets better and better, and ultimately
the mistakes all but disappear... I get between 100 and 200 spams per day
at 2 separate E-mail addresses... I run SpamBayes on one and bogofilter on
the other... Both of them catch all of the spams, almost every single day...
(With no false-positives at all, either...) Maybe ONE single spam per MONTH
will make it through either one and into my mailbox... And, after a simple
retraining to recognize that message as spam, it and all future similar ones
will then be caught, forcing the spammers to try to come up with some NEW
trick to make it past (which will then be learned by the filters as well,
of course)... As near as *I* can sure see, based on my own personal
experience, Bayesian filtering sure IS about as close to a panacea as one
is ever likely to see in the world of spam-filtering... But, of course,
if you don't keep your Bayesian filter properly trained and correct it when
it makes mistakes, then of course it can't work properly... But, that's
not the fault of the filter, but of the user, in that case...

Bayesian filtering isn't a panacea

Posted Oct 16, 2003 16:55 UTC (Thu) by edgewood (subscriber, #1123) [Link]

I use bogofilter with tristate classification, and feed the result into TMDA. Messages that aren't spam according to bogofilter (X-Bogosity: No) get passed directly into my inbox. Messages that are spam are held by TMDA, and ones that bogofilter is unsure about are either released by my TMDA whitelist or challenged, so that legitimate senders can respond to the challenge and have the message released.

It's been weeks since a single spam showed up in my inbox, and months since I've had to rescue a legitimate email from the TMDA pending queue.

Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds