, we compared
with a new set of spam
filtering systems based on Bayesian
analysis. The Bayesian approach, while seemingly "dumber," looked poised
to depose SpamAssassin's complex set of rules as the most effective
filtering system. Bayesian filters are fast, and they can learn from the
actual mail received by each potential consumer of anatomical enlargements
and Nigerian investment schemes. The combination of speed and adaptibility
looked like a winning pair.
Of course, SpamAssassin has not stood still in the intervening months. The
SpamAssassin team released version 2.50 a couple of weeks ago; this release
is described as a "beta version."
No true Linux
user fears beta-quality software, so we decided to have a look.
The SpamAssassin 2.50 rule
set is more impressive than ever. Several hundred rules detect forged
headers, cellphone pitches, Nigerian scams, mortgage refinance pitches,
mentions of Oprah, porn terms ("various types of feline"), suspicious HTML,
"my wife Jody," 100% guarantees, etc. There is
a growing set of rules aimed at catching spam in languages other than
English. One can only marvel at the noble souls who examine that much spam
closely enough to develop these tests.
As is usually the case with a new SpamAssassin release, the new
rules are far more effective than the previous set - for a while at least.
The normal pattern, though, is that spammers begin to figure out how to
evade the latest rules, and the number of false negatives slowly rises over
The truly significant new feature of version 2.50 may help change that
pattern, however. Included in this release is, of course, a Bayesian
filter. SpamAssassin will now remember the terms it finds in both spam and
"ham," and assign to each a probability that the message containing it is
spam. The Bayesian filter is not actually used to classify mail until a
sizeable body of both good and bad mail has been processed. Once that
threshold has been reached, the Bayesian filter will simply assign points
to a message like any other test.
In 2.50, the scoring is set up so that the Bayesian filter, by itself,
cannot condemn a message as spam. Even if the filter says the probability
of spam is 99%, a maximum of 4.3 points (out of the 5 needed by default)
will be assigned. Still, the Bayesian filter is sufficient to tip the
balance on many spams that would have otherwise been classified as real
mail. We ran a quick test with 5000 messages from the LWN.net inbox;
before training, SpamAssassin flagged 3,935 of them as spam. Afterwards,
it caught 4,139 instead. That's 204 spams we would not have to see, and
that can only be a good thing.
The combination of SpamAssassin's rule base and the Bayesian filter
addresses one of the biggest weaknesses of the Bayesian approach: the
filter must be trained. Everybody gets a unique mix of mail, and, believe
it or not, a unique mix of spam. There is no set of Bayesian rules that
will work for everybody. If you happen to have nicely sorted piles of spam
and real mail sitting around, you can quickly train the SpamAssassin filter
via the sa-learn tool. But most users will simply want the filter
to work without an extra effort on their part.
SpamAssassin's rules can help make that happen. Even without the Bayesian
filter, SpamAssassin can flag most of the spam in a user's mail stream. It
can, thus, train the Bayesian filter by itself; that filter will eventually
start catching spam which sneaks past the regular rules. SpamAssassin has
tool which learns how to do its job better over time. As a result, we are
free to spend out time dealing with more interesting things. It would be
hard to find a better example of useful, user-friendly free software.
to post comments)