LWN.net Logo

E-mail filters not fooled by signed spam (News.com)

E-mail filters not fooled by signed spam (News.com)

Posted Oct 12, 2003 17:40 UTC (Sun) by RobSeace (subscriber, #4435)
In reply to: E-mail filters not fooled by signed spam (News.com) by arcticwolf
Parent article: E-mail filters not fooled by signed spam (News.com)

Indeed Bayesian filters are definitely the way to go... I use bogofilter at
work, and SpamBayes at home, and both work wonderfully... I see almost NO
spam at all anymore... And, the only things that have ever gotten tagged
as spam that I actually wanted to see have been spammy-looking messages from
businesses which I've ordered things from, or certain spammy-looking mailing
list posts, etc... And, after a bit of retraining, all is well even there...
(In the case of a couple mailing lists, I set up explicit pass-throughs for
the addresses in my ".procmailrc", to just skip the filtering completely,
and always let them through... But, I'm sure if I retrained enough, I
wouldn't have even needed to bother with that...) That's the only downside
to Bayesian filters: the initial training time you have to put in before
they become fully effective... But, it's definitely more than worth it...
Because, once you're over that initial hump, they just train themselves, and
you don't have to do much of anything, other than correct the very rare
mistakes it makes... The main problem I've found is people who don't save
all their legit E-mail, so they don't have a good sized corpus of legit
mail to train it on... (You can get large amounts of spam from several
sources, and all spam is pretty much alike, so no need for that to be
personalized... But, legit mail really does differ quite a bit from person
to person...) In that case, they have to just train as they go for a
while, until enough messages have been received... (Or, you CAN start them
off with an initial database trained from someone else's legit E-mail,
which I've found does work relatively ok, for the most part... But, it
definitely requires a bit of tweaking work, and is far more likely to lead
to some false-positives at least early on... But, it's probably better
than starting from scratch, at least from the user's perspective, as it'll
wipe out most of the spam, right from the start...)


(Log in to post comments)

E-mail filters not fooled by signed spam (News.com)

Posted Oct 16, 2003 4:59 UTC (Thu) by arcticwolf (guest, #8341) [Link]

You're right, the initial training period needed is a bit of a problem with bayesian approaches. However, I think it hardly can be avoided; the reason why a bayesian filter actually works well, after all, is that it learns to distinguish between what the *user* considers spam and what he/she/shi considers legitimate email. And that - obviously - means that pretraining is not possible; if you started distributing tools like bogofilter, for example, with premade token databases, then you'd just create another weak link in the chain that spammers could attack, similar to SpamAssassin rules etc.

Getting an initial database from a friend might work; however, I, personally, would be reluctant to give anyone my token databases. Maybe it's just paranoia, but I prefur to keep them just as "secret" as my email.

It might be an idea, maybe, to use a distributed token database instead of per-user ones (P2P-based?), but I personally do not think this would work: it not only would allow spammers to pollute the database, it would also take away the individuality of users' databases that actually makes the bayesian filtering approach more effective.

The best way to train a bayesian filter is probably to just grit one's teeth and do it the hard way - put up with the spam and manually classify it until the filter starts working reasonably well, or - if you get too much spam to do this - use a tool like SpamAssassin to create an initial token database.

As far as setting up procmail rules to bypass filtering for messages you know will get misclassified is concerned - that works, of course, but the more elegant approach is still to train the filter, and I am happy to be able to say that it has worked in my case, too. I have one friend whose emails were notorious for being classified as spam; since he rarely ever sends one, the filter didn't get much exposure to them, either, so it wouldn't learn much about them, but by now, it classifies them correctly and leaves me with no known false positives.

It's amazing, really.

Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds