Spam filtering with Rspamd
SpamAssassin is a highly effective tool; its developers could be forgiven for thinking that they have solved the spam problem and can move on. Which is good, because they would appear to have concluded exactly that. The "latest news" on the project's page reveals that the last release was 3.4.1, which came out in April 2015. Stability in a core communications tool is good but, still, it is worth asking whether there is really nothing more to be done in the area of spam filtering.
The Rspamd developers appear to believe that there is; this project is moving quickly with several releases over the past year, the last being 1.6.3 at the end of July. The project's repository shows 2,545 commits since the 1.3.5 release on September 1, 2016; 32 developers contributed to the project in that time, though one of them (Vsevolod Stakhov) was the source of 71% of the commits. The project is distributed under the Apache License v2.
The Rspamd developers clearly see processing speed as one of their selling
points. SpamAssassin, written in Perl, is known to be a bit of a resource
hog. Rspamd is written in C (with rules and extensions in Lua), and claims
to be able to "process up to
100 emails per second using a single CPU core
". That should be
sufficiently fast for most small-to-medium sites, though it is probably
advisable to dedicate another CPU to the task if there are any linux-kernel
subscribers in the mix.
One of the nice things about SpamAssassin is that it's relatively easy to set up; in an extreme, it can be run from a nonprivileged account using a procmail incantation with no daemon process required. Rspamd is not so simple; it really wants to run as a separate daemon that is tightly tied into the mail transport agent (MTA). That means, for example, configuring Postfix to pass messages to the Rspamd server; the configuration of Rspamd itself can also be fairly involved. As a result, experimenting with Rspamd is not quite so simple. But, in return, one gets a number of useful features.
Perhaps foremost, the direct integration with the MTA means that spam filtering takes place while the SMTP conversation is ongoing. That makes techniques like greylisting possible. It also enables the rejection of overt spam outright, before it has been accepted from the remote server; this has a couple of advantages: there is no need to store the spam locally, and the sender will get a bounce — assuming there is a real sender who cares about such things. Yes, one can configure things to use SpamAssassin in this way, but it involves a rather larger amount of duct tape.
Rspamd offers many of the same filtering mechanisms that SpamAssassin supports, including regular-expression matching, DKIM and SPF checks, and online blacklists. It has a bayesian engine that, the project claims, is more sophisticated and effective than SpamAssassin's; it looks at groups of words, rather than just single words. There is a "fuzzy hash" mechanism that is meant to catch messages that look like previous spam with trivial changes. As with SpamAssassin, each classification mechanism has a score associated with it; the sum of all the scores gives the overall spam score for a given message.
While it doesn't have to be this way, SpamAssassin is normally used in a binary mode: a message is either determined to be spam or it is not. Rspamd classifies messages into several groups, depending on how obvious its nature is. At different scores, a message might be greylisted, have its subject line marked, have an X-Spam header added, or be rejected outright. Implementing all of these actions requires cooperation from the MTA, of course.
Rspamd comes with its own built-in web server which, by default, is only
available through the loopback interface. It can present various types of
plots describing the traffic it has processed, as can be seen on the
right. The server can also be used to alter the configuration on the fly,
changing the scores associated with various tests, and more. These changes
do not appear to be saved permanently, though, so the system administrator
still has to edit the (numerous) configuration files to make a change that
will stick.
Your editor set up and ran Rspamd with a copy of his email stream. What followed was an unpleasant exercise in going carefully through the spam folder to see what the results were — a task that resembles cleaning up after the family pet with one's bare hands and which quickly reduces one's faith in humanity as a whole. The initial results were a little discouraging, in that Rspamd filtered spam less effectively than SpamAssassin. More discouraging was a fair number of false positives. When the number of incoming spam messages reaches into the thousands per day, one tends not to spend much time looking for messages that were erroneously classified as spam, especially as confidence in the filter grows. So false positives are legitimate email that will probably never be seen; avoiding false positives thus tends to be a high priority for developers of spam filters.
At this point, though, the comparison was somewhat unfair: a fresh Rspamd was pitted against a SpamAssassin with a well-trained bayesian filter. Like SpamAssassin, Rspamd provides a tool that can be used to feed messages for filter training. Your editor happened to have both a mail archive and a massive folder full of spam sitting around. Training the filter with both of those yielded considerably better results and, in particular, an apparent end to false positives — with one exception. And yes, the rspamc tool, used to train the filter, runs far more quickly than sa_learn does.
The one exception regarding false positives is significant. The documentation of Rspamd's pattern-matching rules is poor relative to SpamAssassin, so it took a while to find out what MULTIPLE_UNIQUE_HEADERS is looking for. In short, it is checking the message for multiple instances of headers that should appear only once (References: or In-Reply-to:, for example). The penalty for this infraction is severe: ten points, enough to condemn a message on its own, even if, say, the bayesian filter gives a 100% probability that the message is legitimate. Unfortunately, git send-email is prone to duplicating just those headers at times, with the result that patches end up in the spam folder.
SpamAssassin has an interesting mechanism for automatically computing what the score for each rule should be. Rspamd does not appear to have anything equivalent; how its scores have been determined is not entirely clear. The overall feeling the results suggests a relative lack of maturity that has the potential to create the occasional surprise.
After a few days of use, the overall subjective impression is that Rspamd is nearly — but not quite — as effective as SpamAssassin. It seems especially likely to miss the current crop of "your receipt" spams containing nothing but a hostile attachment. That said, training has improved its performance quickly and may well continue to do so. The experiment will be allowed to run for a while yet.
So is moving from SpamAssassin to Rspamd a reasonable thing to do? A site with a working SpamAssassin setup may well want to stay with it if the users are happy with the results. There might also be value in staying put for anybody who fears the security implications of a program written in C that is fully exposed to a steady stream of hostile input. The project does not appear to have ever called out an update with security implications; it seems unlikely that there have never been any security-relevant bugs fixed in a tool of this complexity.
But, for anybody who sees the benefit of
a more active development community, better performance, better MTA
integration, newer filtering mechanisms, and a web interface with cute
pie charts, changing over might make sense. There is even a module to import
custom SpamAssassin rules to make the task easier (but there is no way
to import an existing SpamAssassin bayesian database). In any case, it is
good to see that development on spam filters continues, even if the
SpamAssassin community has mostly moved on to other things.
