SpamAssassin is back

By Jonathan Corbet
November 2, 2018

The SpamAssassin 3.4.2 release was the first from that project in well over three years. At the 2018 Open Source Summit Europe, Giovanni Bechis talked about that release and those that will be coming in the near future. It would seem that, after an extended period of quiet, the SpamAssassin project is back and has rededicated itself to the task of keeping junk out of our inboxes.

Bechis started by noting that spam filtering is hard because everybody's spam is different. It varies depending on which languages you speak, what your personal interests are, which social networks you use, and so on. People vary, so results vary; he knows a lot of Gmail users who say that its spam filtering works well, but his Gmail account is full of spam. Since Google knows little about him, it is unable to train itself to properly filter his mail.

Just like Gmail, SpamAssassin isn't the perfect filter for everybody right out of the box; it's really a framework that can be used to create that filter. Getting the best out of it can involve spending some time to write rules, for example. Most of the current rule base is aimed at English-language spam, which isn't helpful for people whose spam comes in other languages. Another useful thing to do is to participate in the MassCheck project, which can quickly evaluate the effectiveness of new rules on a large body of spam. In particular, MassCheck performs a nightly run to check the hit rate of rules to determine how those rules are performing in real installations. It can also check for overlap; if two rules always trigger on the same messages, there isn't really a need for both of them. This information feeds into the RuleQA database to give a picture of how the rules are working overall.

SpamAssassin is not just for email filtering, Bechis said; some sites are using it to detect spam submitted in web forms, for example.

So what is new in SpamAssassin? There has been a lot of work by the project's system administration team, he said, to update the infrastructure. That has resulted in the rebuilding of the MassCheck implementation from scratch. The 3.4.2 release contained fixes for four security bugs, and also an important workaround for a Perl bug that was only triggered on Red-Hat-based distributions. Startup time has been improved, and SSLv3 support has been removed. The "freemail antiforge" mechanism, which seeks to detect forged Gmail messages, has been improved. The geo-aware scoring system can adjust scores based on which continent the mail came from. The URILocalBL plugin, which can blacklist URLs based on information like where they are hosted, has seen a number of improvements.

3.4.2 Also saw the addition of the HashBL plugin, which can be used to block email addresses from domains that cannot be blocked wholesale. There is a new anti-phishing plugin that can filter on URLs commonly found in phishing emails. The new ResourceLimits plugin can put limits on the amount of CPU and memory used by SpamAssassin. And the FromNameSpoof plugin tries to detect attempts to confuse users about the source of an email using the full-name field.

Some future plugins include a couple that are aimed at detecting Microsoft Office attachments containing macros. There is one for checking URLs from URL-shortening services; it will filter based on the final destination of those URLs. The KAM.cf ruleset is an unofficial addition that can allow sites to respond more quickly to new spam campaigns, but at a cost of more false positive results. Also coming is a set of international channels that will carry signed rulesets designed for different parts of the planet.

The SpamAssassin 4.0 release can be expected around January, Bechis said. It will include full UTF-8 support that has been completely rewritten, with better detection of east-Asian languages. The TxRep plugin, which applies scores to messages depending on the reputation of the sender, is being improved and will be able use PostgreSQL 10. The Office macro and URL shortener plugins will be in this release, but another new plugin to check for suspicious URLs inside attachments will have to wait until 4.1.

Further in the future, the project plans to update its approach to machine learning. The current code is getting old, and there is interest in applying deep-learning techniques to the spam-detection problem. There was a Google Summer of Code project that attempted to make progress in that area but it didn't succeed, so more work is needed.

When asked about whether the SpamAssassin project had really slowed down as much as its release history suggests, Bechis conceded that it had. A number of people had left the project, and there were infrastructure problems that blocked the rule-generation process. But the situation has since improved, he said. The project has picked up a new set of developers and is moving forward again. Certainly the world can only benefit from better spam filtering.

The slides from this talk [PDF] are available.

[Thanks to the Linux Foundation, LWN's travel sponsor, for supporting my travel to the event.]

Index entries for this article
Conference	Open Source Summit Europe/2018

SpamAssassin is back

Posted Nov 3, 2018 14:44 UTC (Sat) by tome (subscriber, #3171) [Link]

Thanks for this report, Jonathan. I've had a nagging background anxiety about SA that was not strong enough to gain my direct attention but was annoying nonetheless. You and Bechis have dispelled it now.

SpamAssassin is back

Posted Nov 3, 2018 18:04 UTC (Sat) by rsidd (subscriber, #2582) [Link] (9 responses)

«he knows a lot of Gmail users who say that its spam filtering works well, but his Gmail account is full of spam. Since Google knows little about him, it is unable to train itself to properly filter his mail.»

I use Gmail and have the opposite problem: a non-trivial amount of non-spam ends up in spam. It has given me grief. I don't particularly care about spam reaching my inbox (my work mail has no filter) but I do care about the opposite. But I doubt SpamAssassin or anyone else can truly nail that problem.

SpamAssassin is back

Posted Nov 4, 2018 1:44 UTC (Sun) by jkingweb (subscriber, #113039) [Link] (6 responses)

Wouldn't disabling heuristic checks and relying entirely on DNS blocklists avoid false positives while catching a non-trivia amount of spam? With my MTA, the most spammy spam is stuff that is present in multiple blocklists.

SpamAssassin is back

Posted Nov 4, 2018 3:50 UTC (Sun) by wahern (subscriber, #37304) [Link]

I don't get a 0% false positive rate, but (1) I use relatively aggressive blackholes, and (2) I can't remember the last time a false positive happened--over the many years it has happened enough that I must remember to check my spam folder when expecting a message.

I rely on greylisting and DNS RBLs exclusively and am quite content. The junk that hits my my inbox is, strictly speaking, mostly legitimate--opt-out noise (e-delivery notices, marketing, etc) that I'm too lazy to deal with regularly but which will stop when I request--while the vast majority of spam either never hits my inbox or goes directly to my spam folder. I use spamd for greylisting, which fronts the SMTP server, and perform RBL checks from my ~/.forward pipeline.

I should mention that I use mutt as my MUA for personal e-mail. Suffice it to say that I find webmail interfaces, including GMail, obtuse and unwieldly. I don't want to go into a huge rant, but my contentment with the state of my inbox is partly due to my MUA. For work e-mail I prefer using Mail.app when permitted, which I also immensely prefer over GMail. I'm not a mutt or Mail.app fiend with an insane configuration--both are quite vanilla. But if you can't quickly and reliably *manually* winnow your inbox I don't think anything short of real AI will ever suffice.

Also I maintain a large IP address whitelist (~1000 entries ATM) for greylisting by flattening SPF records for large senders (top X% culled from inboxes). I used to use my own tool but OpenSMTPD recently added `smtpctl walk`. That makes greylisting tolerable since companies began using relays hosted on cloud services where sequential retransmits may only rarely occur from the same IP address, which occasionally resulted in mail being postponed for hours or days (though never bouncing AFAIK).

SpamAssassin is back

Posted Nov 4, 2018 14:07 UTC (Sun) by dskoll (subscriber, #1630) [Link] (4 responses)

No. Because a lot of spam comes from Hotmail and Gmail servers that are impractical to block based solely on IP address.

SpamAssassin is back

Posted Nov 4, 2018 14:57 UTC (Sun) by jkingweb (subscriber, #113039) [Link] (2 responses)

Interesting. That hasn't been my experience---which highlights just how true it is that everyone's spam is different.

SpamAssassin is back

Posted Nov 4, 2018 17:17 UTC (Sun) by dskoll (subscriber, #1630) [Link] (1 responses)

I used to run a service filtering mail for a couple of hundred thousand mailboxes, so I'm sure I saw a lot of stuff that people wouldn't see if they're just looking at their personal email.

Filtering just your own personal email is easy because you can do whatever you want and tune things exactly to your taste. Filtering for lots of other people is hard.

SpamAssassin is back

Posted Nov 7, 2018 2:48 UTC (Wed) by k8to (guest, #15413) [Link]

Someone I know who worked at postini had a lot of data showing gmail as the largest producer of spam at the time. Then Google bought them and refused to believe the data. *sigh*

SpamAssassin is back

Posted Nov 5, 2018 17:28 UTC (Mon) by wahern (subscriber, #37304) [Link]

https://msbl.org/ebl.html - DNSBL for drop box accounts, queried by canonicalizing and hashing sender, such as from Reply-To header.

SpamAssassin is back

Posted Nov 4, 2018 20:41 UTC (Sun) by ballombe (subscriber, #9523) [Link]

In my experience, a well configured spamassassin has less false positive than manual filtering,
as soon as the spam/ham ratio is large enough.

SpamAssassin is back

Posted Nov 7, 2018 2:49 UTC (Wed) by k8to (guest, #15413) [Link]

Amusingly, 100% of my mail at gmail is spam, and they still can't figure it out.
You'd think that would be an easy one.

SpamAssassin is back

Posted Nov 8, 2018 14:04 UTC (Thu) by anarcat (subscriber, #66354) [Link]

some sites are using it to detect spam submitted in web forms

I've been using spamassassin for over a decade now, but somehow that never crossed my mind. Yet it's brilliant and simple. Has anyone good examples of how that's done? plugins for common CMS? I asked around for ikiwiki but it doesn't seem anyone did that before.

SpamAssassin is back

Posted Apr 3, 2019 17:07 UTC (Wed) by troubleshootme (guest, #131266) [Link]

SpamAssassin is extremely useful and effective when configured correctly. I had some issues finding tips when I was tweaking a SpamAssassin rule list so I put a little guide together. Check it out at https://troubleshootme.com/configuring-cpanel-with-spamas...
It's still a work in progress but may be useful.