Mail filtering in Thunderbird 1.5
Posted Oct 14, 2005 16:06 UTC (Fri) by
zblaxell (subscriber, #26385)
Parent article:
Mail filtering in Thunderbird 1.5
Bayesian filters learn to reject some class of input and accept some other class of input, so in theory they can separate "spam+phishing" and "ham" automatically. Unfortunately, Bayesian filters assume that the input classes use and reuse distinct sets of tokens.
When the input contains only "ham" tokens and "new" tokens--which would seem to describe a phishing scam fairly well--the Bayesian algorithm can't ever return a "spam" result. A competent phisher will send you an email that you would legitimately receive anyway (e.g. a notice from eBay, using words that eBay uses in their routine customer email communication) except for a few new hapaxes (words that appear exactly once for the first time in that email message), such as the URL for the phisher's site. Bayesian algorithms cannot identify such messages as spam since the mail contains only known non-spam tokens and unknown tokens, which result in either a "probably ham" response or a "unknown" response.
In some cases the Bayesian filter works on phishing scams anyway. For example, I don't have an eBay or PayPal account, and I get virtually zero mail from anyone who does, so at the moment my Bayesian filter thinks that you're a spammer or phisher if you merely mention those names. It's hard to tell exactly what a Bayesian filter (or any learning filter for that matter) has learned without taking it apart and looking at its data tables.
Some non-Bayesian approaches might work better on phishing. Actually it would be really nice to have a Grumpy Editor's guide to spam filters. There are a few out there--crm114 is a treasure trove of esoteric algorithms (and can be used for things like syslog analysis too), there is at least one project using Markov 2-word chains (good for those spams that quote lots of random text to look more like legitimate email), and all of the Bayesian classifiers have slightly different implementation details.
(
Log in to post comments)