LWN.net Logo

The Grumpy Editor's guide to bayesian spam filters

The Grumpy Editor's guide to bayesian spam filters

Posted Feb 23, 2006 12:47 UTC (Thu) by shane (subscriber, #3335)
In reply to: The Grumpy Editor's guide to bayesian spam filters by tzafrir
Parent article: The Grumpy Editor's guide to bayesian spam filters

It's only a matter of time until spammers learn to retry three times.

Sure, but until then you drop 90% of spam without having to filter.

Even if spammers do start to retry 3 times, they can be quickly identified unless the sender keeps state about the sessions, and retries only after a retry message from the server. For a spammer this means increased memory/disk useage, and more complicated spambots.

Plus compromised hosts (which send a lot of spam) tend to get discovered, and taken off-line. If this happens before the re-send the mail never gets delivered.

In theory there is no difference between theory in practice. But in practice, there is.

- Jan L. A. van de Snepscheut

The bottom line is that today, it works great.


(Log in to post comments)

The Grumpy Editor's guide to bayesian spam filters

Posted Feb 24, 2006 1:06 UTC (Fri) by njhurst (guest, #6022) [Link]

And when they are resending 3 times there is a high probability that they hit one of your honeypots. (SPEWS)

Actually they don't bother with state

Posted Feb 24, 2006 2:32 UTC (Fri) by AnswerGuy (subscriber, #1256) [Link]

... they just retry everything three or four times at five minute intervals.

It's only an incremental extra effort to deliver multiple times regardless of whether the earlier copies reached you or not.

I get over a 1000 slices of spam per day through my personal inbox ... even after greylisting and some blacklisting (very limited blacklisting). SpamAssassing gets over 99% of that, but I still end up with 20 or so, per day, that make their way through SA. Often I get about 5 copies of any that do get through.

Periodically I go through the spam folder (and one "longlines" folder which contains spam with very long lines --- basically no standards compliant linefeeds at all, which have broken my procmail recipies for SA and YAVR in the past). (YAVR is an anti-virus recipe since SA normally doesn't catch those as "spam" per se).

I've only noticed a tiny handful of false positives dumped into my spam folder by SA. (Less than a half dozen in almost two years). This is a very unscientific measure since I am pretty heavy handed with the delete key when I do spot check the spam folder; and I'm only human with a limited about of time and energy to spend on rescuing mail sent by strangers who's content looks too "spammy."

I don't keep metrics on rejections ... nor on greylisting delivery deferrals that never get delivered. I only have one confirmed case of a bit of e-mail that was not spam but which got greylisted for 49000 seconds (wife was attempting a PayPal password change and their mail server didn't respect the conventional retry delays --- the Postfix greylisting daemon we're using punishes apparently attackers with an exponential back off). (She resolved that by simply whitelisting them and forcing another PW change; while also teaching the silly tech support person there all about proper MTA behavior, and greylisting over the phone).

So, greylisting helps a little ... but too many spammers have adapted and now simply, blindly try everything multiple times. (Also anyone that does successfully dump their spam on an open relay gets the delivery retries for free ... alll standards compliant MTAs do that, and most open relays are just old, unpatched, poorly configured copies of sendmail).

(I do NOT use ORBS type blocking ... I refuse to configure my MTAs to implement a set of dynamically changing policies that are set by strangers ... so I only add my own connection blocking sparingly ... so far).

The main thing that seems to limit my spam load is my own paltry bandwidth. My mailservers share bandwidth with DNS and web traffic (from my clients and servers) over a little old 144Kbps IDSL line. I have about the equivalent of two bonded 56K leased lines. Apparently a significant number of spam cannons time out and drop connections on such a slow link (they've got millions of other targets to get to).

JimD
(The Linux Gazette "Answer Guy" --- no I didn't pick the name; yes, my e-mail address is still the same: jimd@starshine.org --- and published monthly in several languages and countries around the world for several years).

reject on black, filter on grey

Posted Feb 24, 2006 13:53 UTC (Fri) by copsewood (subscriber, #199) [Link]

While greylisting makes sense for the reasons other contributors suggest, you might spend less time and have fewer manual false discards
going through your spam folder if you reject more of the very high probability spam at the MTA level. Checking manually what went into the spam folder is quicker and better on doubtful messages only or there will be too many in the spam folder to do this job accurately.

Also occasional genuine messages getting rejected by the MTA that accepts mail from across admin boundaries will result in a bounce to the sender, while not sending bounces to innocent victims, e.g. which happens if you reject at an internal incoming MTA. I set spamassassin score > 10 to MTA reject and > 7.5 to go to my spam folder via procmail. I also use the spamhaus DNSBL which currently rejects about 900 spams a week on my server and with which I have never seen a single false positive in about 2 years use, and use less reliable DNSBLs e.g. spews for spam folder filtering.

Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds