LWN.net Logo

E-mail filters not fooled by signed spam (News.com)

E-mail filters not fooled by signed spam (News.com)

Posted Oct 16, 2003 4:59 UTC (Thu) by arcticwolf (guest, #8341)
In reply to: E-mail filters not fooled by signed spam (News.com) by RobSeace
Parent article: E-mail filters not fooled by signed spam (News.com)

You're right, the initial training period needed is a bit of a problem with bayesian approaches. However, I think it hardly can be avoided; the reason why a bayesian filter actually works well, after all, is that it learns to distinguish between what the *user* considers spam and what he/she/shi considers legitimate email. And that - obviously - means that pretraining is not possible; if you started distributing tools like bogofilter, for example, with premade token databases, then you'd just create another weak link in the chain that spammers could attack, similar to SpamAssassin rules etc.

Getting an initial database from a friend might work; however, I, personally, would be reluctant to give anyone my token databases. Maybe it's just paranoia, but I prefur to keep them just as "secret" as my email.

It might be an idea, maybe, to use a distributed token database instead of per-user ones (P2P-based?), but I personally do not think this would work: it not only would allow spammers to pollute the database, it would also take away the individuality of users' databases that actually makes the bayesian filtering approach more effective.

The best way to train a bayesian filter is probably to just grit one's teeth and do it the hard way - put up with the spam and manually classify it until the filter starts working reasonably well, or - if you get too much spam to do this - use a tool like SpamAssassin to create an initial token database.

As far as setting up procmail rules to bypass filtering for messages you know will get misclassified is concerned - that works, of course, but the more elegant approach is still to train the filter, and I am happy to be able to say that it has worked in my case, too. I have one friend whose emails were notorious for being classified as spam; since he rarely ever sends one, the filter didn't get much exposure to them, either, so it wouldn't learn much about them, but by now, it classifies them correctly and leaves me with no known false positives.

It's amazing, really.


(Log in to post comments)

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds