SpamAssassin 3.4.1 released

By Jonathan Corbet
May 6, 2015

One occasionally sees articles suggesting that the volume of spam on the net is in decline, but nobody would be foolish enough to argue that the spam problem has gone away. Industrial-strength spam-filtering tools are still a necessity for anybody whose email address is known by more than about two other people. For the minority of us who have not given in and moved to Gmail, SpamAssassin tends to be the spam-filtering tool of choice. In recent years it sometimes seems like the spammers are moving more quickly than the SpamAssassin project, so the announcement of the Apache SpamAssassin 3.4.1 release on April 30 — the first in over a year — is naturally of interest. A version-number bump from 3.4.0 to 3.4.1 would not seem to indicate major changes, but, in truth, the SpamAssassin developers have been busy.

The "auto whitelist" (AWL) feature of SpamAssassin has long been one of that program's more annoying aspects. In theory it tracks the emails from each sender to get an overall sense of whether they are trustworthy; email from a trusted source will get a bonus score, while messages from apparent spammers will be penalized. The sad truth of the matter, in your editor's experience, is that a spammer need only get a small number of messages through to convince the AWL that everything else should be whitelisted. If SpamAssassin's other scoring mechanisms were perfect, this kind of AWL corruption would not be a problem — but then the AWL would not be needed at all. In a world where scoring is imperfect, the AWL often seems to make things worse.

In 3.4.1, the SpamAssassin developers have tried to address some of the problems with the AWL by replacing it with a new mechanism called TxRep. The basic idea remains the same: track each sender's activity and adjust the score of new messages toward the mean of what has been seen in the past. But a number of useful changes have been made in how this tracking is done, starting with an expansion of the set of data that is used. TxRep maintains reputation scores for the sending email address (as did the AWL), but also the sending domain name, the IP addresses of the originating system and the server that transferred the message, and the "HELO" string used by the last server. For any given message, each of these quantities is mixed in with its own (user-configurable, naturally) weight.

Another useful change is that the sa-learn utility (until now used only with the Bayesian filter) can be used to train TxRep, so the same command now works to update both filters. There is a "dilution" mechanism that causes newer messages to have more influence on a sender's score than older ones, making the system more responsive should, say, a spammer repent and start actually sending useful stuff (or should TxRep initially misjudge a sender). TxRep can be used to whitelist (or blacklist) senders or IP addresses outright — something that might be worth doing automatically for the most obvious of spam or for messages that have been explicitly classified by the recipient. There is also a mechanism to automatically whitelist the recipients of outgoing mail — though that could have undesired effects if one is prone to sending irate responses to spammers.

With these changes, TxRep should be able to avoid some of the worst AWL pitfalls, though the documentation still recommends against turning on auto-learning until SpamAssassin as a whole has been tuned well. But the whole thing still seems to be built around the idea that people can be spammers part of the time and senders of legitimate email at others. Perhaps your editor is an excessively unforgiving character, but it seems like the sender of known spam should not get off lightly with a gradual tweaking of a reputation score; once a spammer, always a spammer. Trust is hard to earn but easy to lose; the TxRep mechanism still doesn't quite reflect that fact.

The PDFInfo module, which has long existed outside of the SpamAssassin mainline, has now been merged; PDFInfo, as its name would suggests, looks for spammy PDF attachments. There is one other new module, URELocalBL, which allows blacklisting of spammy links using a local database.

SpamAssassin 3.4.1 can do a more thorough and careful job of normalizing all messages to the UTF-8 character set before applying rules. That should help to eliminate various tricks using strange character sets to get around the spam-checking rules.

An interesting addition to the Bayesian filter is the ability to hash MIME attachments and use the result as a filter token. If it works well, it should allow the filter to recognize often-repeated spam payloads as a whole. But, as the manual page notes, "not much experience has yet been gathered regarding its usefulness". It seems worth a try, in any case.

Beyond all of this work, of course, is the constant challenge of maintaining the rule base in the face of a changing spammer landscape. Spammers may now be more concerned with getting past Gmail's filters than SpamAssassin, but there are still signs that a subset of spam has been tested against SpamAssassin until the rules are unable to stop it. The Bayesian filter helps with that problem, but so does an ongoing effort to keep those rules current. It is thus unsurprising that a new SpamAssassin release contains a long list of rule changes that should help to keep its effectiveness up — until the spammers work around those as well.

Your editor has often heard the complaint that email is reaching a point of complete uselessness. Such claims overstate the reality — one need only watch how email keeps our development communities going to see that. But email has been under attack for many years, making life harder for both email users and those who are charged with running email systems. It is fair to say that SpamAssassin is one of a small set of tools that has helped email to survive the ongoing spammer onslaught, so it is good to see this tool continuing to evolve.

Index entries for this article
Security	Email/Spam prevention
Security	Spam

SpamAssassin 3.4.1 released

Posted May 7, 2015 11:15 UTC (Thu) by hickinbottoms (subscriber, #14798) [Link] (1 responses)

"But the whole thing still seems to be built around the idea that people can be spammers part of the time and senders of legitimate email at others. Perhaps your editor is an excessively unforgiving character, but it seems like the sender of known spam should not get off lightly with a gradual tweaking of a reputation score; once a spammer, always a spammer. Trust is hard to earn but easy to lose; the TxRep mechanism still doesn't quite reflect that fact."

I wonder whether this is designed around the case where malware that has harvested an address book then poses as that user to send mail to their known contacts. I've certainly seen this (fortunately rarely) amongst my contacts with the result that I occasionally get malware spam from friends. In such a case I wouldn't expect to blacklist them forever on a first-strike basis.

Great news to hear about the update, though, and I second that SA is helping to keep email very much alive and well.

SpamAssassin 3.4.1 released

Posted May 7, 2015 17:17 UTC (Thu) by smoogen (subscriber, #97) [Link]

I was expecting this was designed around the nature of mailing lists where sometimes the spam filter decides all email on a mailing list is SPAM if some amount of it was SPAM.

SpamAssassin 3.4.1 released

Posted May 7, 2015 12:48 UTC (Thu) by hmh (subscriber, #3838) [Link]

An interesting addition to the Bayesian filter is the ability to hash MIME attachments and use the result as a filter token. If it works well, it should allow the filter to recognize often-repeated spam payloads as a whole. But, as the manual page notes, "not much experience has yet been gathered regarding its usefulness." It seems worth a try, in any case.

I've long been using clamav plus several long-term and fast-response spam/phishing/malware signature databases to score MIME attachments that match known signatures, and it is extremely useful, as well as very fast.

This is not done directly in SpamAssassin, though. Amavisd-new is used, as I want to score based on the clamav results, as well as test each attachment against the signature database (and not just the whole email). It calls SA should the score (mapped using a table, keyed on regex matches on the "virus name" returned by clamav) not be enough to determine the message's fate.

SpamAssassin 3.4.1 released

Posted May 7, 2015 14:07 UTC (Thu) by madscientist (subscriber, #16861) [Link] (1 responses)

I don't know if I'm an outlier for some reason, but I've never had much luck with SA; it lets through a lot of spam, catches a lot of ham, and it seems generally kind of slow and doesn't seem to react much to training, which I would think would be the most important aspect of determining spamminess.

On the other hand, I've used bogofilter to absolutely amazing effect: after a day or two of training I get very little spam let through and even less ham caught. Really bogofilter is amazing.

SpamAssassin 3.4.1 released

Posted May 7, 2015 16:18 UTC (Thu) by RobSeace (subscriber, #4435) [Link]

I agree completely... That's been exactly my experience with both SA and bogofilter, as well... Bogofilter is damn near perfect, and when it does rarely screw up, usually just one manual correction is all it takes to train it...

SpamAssassin 3.4.1 released

Posted May 14, 2015 10:05 UTC (Thu) by ssokolow (guest, #94568) [Link]

"Industrial-strength spam-filtering tools are still a necessity for anybody whose email address is known by more than about two other people."

...or you could just install an instance of SpamGourmet and hand out a different revokable alias to each person. I get maybe 2 or 3 spam a year by treating e-mail addresses as revokable API keys.