LWN.net Logo

E-mail filters not fooled by signed spam (News.com)

E-mail filters not fooled by signed spam (News.com)

Posted Oct 11, 2003 22:13 UTC (Sat) by arcticwolf (guest, #8341)
In reply to: E-mail filters not fooled by signed spam (News.com) by proski
Parent article: E-mail filters not fooled by signed spam (News.com)

Actually, if you really want a good spamfilter, try bogofilter (http://bogofilter.sf.net), and train it a lot, with lots of ham and spam. I've been using it for almost a year, and I can count both the number of false positives as well as the number of false negatives I had in the last three months on one hand each, while getting around 40 real mails and 70 pieces of spam each day.

No problem with people sending html mail, either (even those who send *only* html and no plain text), no need for kludges like "if it contains words like penis, it's (probably) spam", it also sorts out worm emails and the like, and it trains itself while categorizing mail (so I only have to interact when it does something wrong).

I can only recommend giving it a try. It needs an initial training period to give good results, and will take a while until it gives great results, but it's worth it.


(Log in to post comments)

E-mail filters not fooled by signed spam (News.com)

Posted Oct 12, 2003 17:40 UTC (Sun) by RobSeace (subscriber, #4435) [Link]

Indeed Bayesian filters are definitely the way to go... I use bogofilter at
work, and SpamBayes at home, and both work wonderfully... I see almost NO
spam at all anymore... And, the only things that have ever gotten tagged
as spam that I actually wanted to see have been spammy-looking messages from
businesses which I've ordered things from, or certain spammy-looking mailing
list posts, etc... And, after a bit of retraining, all is well even there...
(In the case of a couple mailing lists, I set up explicit pass-throughs for
the addresses in my ".procmailrc", to just skip the filtering completely,
and always let them through... But, I'm sure if I retrained enough, I
wouldn't have even needed to bother with that...) That's the only downside
to Bayesian filters: the initial training time you have to put in before
they become fully effective... But, it's definitely more than worth it...
Because, once you're over that initial hump, they just train themselves, and
you don't have to do much of anything, other than correct the very rare
mistakes it makes... The main problem I've found is people who don't save
all their legit E-mail, so they don't have a good sized corpus of legit
mail to train it on... (You can get large amounts of spam from several
sources, and all spam is pretty much alike, so no need for that to be
personalized... But, legit mail really does differ quite a bit from person
to person...) In that case, they have to just train as they go for a
while, until enough messages have been received... (Or, you CAN start them
off with an initial database trained from someone else's legit E-mail,
which I've found does work relatively ok, for the most part... But, it
definitely requires a bit of tweaking work, and is far more likely to lead
to some false-positives at least early on... But, it's probably better
than starting from scratch, at least from the user's perspective, as it'll
wipe out most of the spam, right from the start...)

E-mail filters not fooled by signed spam (News.com)

Posted Oct 16, 2003 4:59 UTC (Thu) by arcticwolf (guest, #8341) [Link]

You're right, the initial training period needed is a bit of a problem with bayesian approaches. However, I think it hardly can be avoided; the reason why a bayesian filter actually works well, after all, is that it learns to distinguish between what the *user* considers spam and what he/she/shi considers legitimate email. And that - obviously - means that pretraining is not possible; if you started distributing tools like bogofilter, for example, with premade token databases, then you'd just create another weak link in the chain that spammers could attack, similar to SpamAssassin rules etc.

Getting an initial database from a friend might work; however, I, personally, would be reluctant to give anyone my token databases. Maybe it's just paranoia, but I prefur to keep them just as "secret" as my email.

It might be an idea, maybe, to use a distributed token database instead of per-user ones (P2P-based?), but I personally do not think this would work: it not only would allow spammers to pollute the database, it would also take away the individuality of users' databases that actually makes the bayesian filtering approach more effective.

The best way to train a bayesian filter is probably to just grit one's teeth and do it the hard way - put up with the spam and manually classify it until the filter starts working reasonably well, or - if you get too much spam to do this - use a tool like SpamAssassin to create an initial token database.

As far as setting up procmail rules to bypass filtering for messages you know will get misclassified is concerned - that works, of course, but the more elegant approach is still to train the filter, and I am happy to be able to say that it has worked in my case, too. I have one friend whose emails were notorious for being classified as spam; since he rarely ever sends one, the filter didn't get much exposure to them, either, so it wouldn't learn much about them, but by now, it classifies them correctly and leaves me with no known false positives.

It's amazing, really.

E-mail filters not fooled by signed spam (News.com)

Posted Oct 12, 2003 19:32 UTC (Sun) by nix (subscriber, #2304) [Link]

Er, SA's Bayesian algorithm is (an enhancement of) bogofilter's.

I think that any single-method attack is likely to fail; to catch things you really need every method you can find: so body-content heuristics and statistical methods and network checks and header analysis combined will be stronger than any one on its own.

(The immune system uses the same approach; strength in depth.)

E-mail filters not fooled by signed spam (News.com)

Posted Oct 16, 2003 5:14 UTC (Thu) by arcticwolf (guest, #8341) [Link]

Actually, I think that body-content heuristics and header analysis can be viewed as being included in statistical analysis, at least as far as bayesian filtering is concerned. Outside of that, I agree that having both depth and breadth in your approach to spam is a good thing; but for now, bayesian filtering (as implemented by bogofilter - I don't have experience with other tools) seems to do the job so well that there's no need to worry, and with the filter training itself automatically as it classifies messages, only requiring user interaction for false positives or negatives, it seems that there is little that spammers can do, either.

In fact, more or less the only approach I can think of right now would be to change spam characteristics so drastically that the (bayesian) filters wouldn't catch them anymore; however, this would require not only a concerted action in which most spammers participate (otherwise, only a few pieces of spam would get through), it would also be effective only for a very short amount of time, until the filters' token databases have been updated.

What else could a spammer do? Try to make messages look as much as legitimate email as possible, I assume, but then again, this likely won't be effective - spam is, after all, ultimately about advertising, and a message that does not advertise products anymore in any way does not justify being sent. The filters *will* catch on, and the fact that they are completely dynamical in generation (no static rules) and specific to each user means you can't just attack them.

Or at least that's what common sense tells me. Maybe the future will show that there is a fundamental flaw not only in the existing tools, but in the bayesian approach in general, but I can't see it right now; and even if there is, a better technique will follow. Ultimately, the war against spam can only be won.

(and I probably shouldn't post comments this early in the morning - or, rather, this late at night -; I seem to get a bit overdramatic. oh well.)

Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds