The SAY2K10 bug
The task assigned to SpamAssassin, of course, is to look over incoming email and assign a score to each message indicating how likely that message is to be spam. It does this job surprisingly well; your editor currently receives around 5,000 spams per day - one every 17 seconds or so - but it's a bad day if two dozen of those get past SpamAssassin and show up in the inbox. Put simply: without SpamAssassin, your editor's email address would simply be unusable. All it takes is a five-minute window without spamd running to see what life would be like if the incoming mail stream had to be dealt with in its full, unfiltered glory. This is mission-critical software, so any faults which turn up in it tend to be of great concern.
The core of SpamAssassin is a vast set of rules looking for spammy characteristics in incoming email. The rules match anything that the developers think might indicate spam; some of the tests include:
- The presence of a rot13-encoded email address.
- Large numbers of blank lines.
- The originating address is in any of a number of network blacklists.
- Discussion of medication in a number of forms.
- HTML messages with huge fonts.
- The presence of URLs registered to known spammers.
...and so on. Each matching rule adds a numeric score to the message; when the process is complete, the scores are added up to yield a total spamminess value. The bayesian recognizer also gets a chance to look at the message and add a score of its own. At the conclusion of this process, any message with a score of 5.0 or higher (by default) is considered to be spam.
Some years ago, a SpamAssassin developer noticed that some unwanted mail came in with dates far in the future. These messages almost certainly represent an attempt by spammers to take advantage of mail clients which sort messages by date; a far-future date should show up at the top of the list. To deal with these messages, said developer wrote a rule matching any date from the year 2010 or afterward. At the time, 2010 was some years in the future, so the rule seemed to make sense. Surely somebody would fix it long before that distant year arrived.
The scores assigned to rules in SpamAssassin are not random, but neither are they assigned by the rule authors. Instead, the project uses a "perceptron" program to determine which combination of scores performs best against a large body of spam and "ham" email. When this tool was run, legitimate email from 2010 was indeed a rare thing, so the rule turned out to be a very good positive indicator for spam. As a result, it was assigned a score which, in some situations, could be as high as 3.5.
As of January 1, mail with 2010 dates suddenly became rather more common. With the year-2010 rule now firing on every message, the SpamAssassin threshold was, in effect, lowered from 5.0 to as low as 1.5. That, in turn, caused a fair amount of legitimate email to be classified as spam, a most unwelcome development. Your editor, receiving 5,000 spams every day, has long since stopped scanning the spam folder for false positives; even if they exist (which they almost never do), they represent a needle which is almost impossible to find in a haystack that large. So email classified as spam is, for all practical purposes, simply lost.
As described in Justin Mason's weblog, the year-2010 problem was noted by a SpamAssassin developer in 2008. The rule was duly fixed in the project's repository, and promptly forgotten about. What the SpamAssassin developers did not do was any of (1) informing the user community of the rule change, (2) making a new major release with the fixed rule, or (3) distributing the rule fix through the sa-update channel, which exists for just this purpose. So everybody was caught by surprise - users, distributors, Internet service providers, and the SpamAssassin developers themselves.
All told, the harm caused by this problem was relatively small and mostly recoverable. It is a very small blot on SpamAssassin's long record of making email usable for large numbers of people. But it highlights a few points which are worthy of note:
- Even those of us who are not running financial exchanges have
critical infrastructure based on free software. When something goes
wrong with that infrastructure, it can hurt our businesses, social
lives, and more.
- Software which plays a crucial part on our operations should really
have a mechanism in place to get important fixes to users quickly.
But, just as importantly, that project has to take great care to
ensure that important fixes get routed into that channel.
SpamAssassin developers had fixed the 2010 problem a long time ago, but
that was not helpful for users, who had no way of knowing about the
problem or its fix. In the kernel realm, it has taken
some years to build the discipline of looking over patches and
considering them for stable kernel updates; there's probably still a
fair number of important fixes which do not get to stable kernel users
because nobody thinks to route them to the stable kernel maintainers.
- Important software requires a certain amount of development and review time. So it's discouraging to read in Justin's weblog that his SpamAssassin work happens in his scarce spare time, and that the project is, in general, short of active developers. Your editor suspects that the truth of the matter is this: SpamAssassin is long past its period of rapid development. At this point, it works well, to the point that there's not a lot of work to be done. So the interested developers have gone on to other projects.
It would appear that what SpamAssassin needs is some dedicated maintenance
talent which is
not dependent on evening hours put in by developers committed to other
projects. Typically that is the sort of work that requires a paying
customer. Given how many people and companies rely on this software, it
seems like it should be possible to find the money to motivate somebody to
put more time into SpamAssassin maintenance. The hard part is collecting
and administering those funds; that's not something that the free software
community has yet reliably become good at doing.
Posted Jan 7, 2010 2:21 UTC (Thu)
by ncm (guest, #165)
[Link] (9 responses)
We should be calling these "MMX" bugs.
Posted Jan 7, 2010 4:56 UTC (Thu)
by Bayes (guest, #52258)
[Link] (8 responses)
Posted Jan 7, 2010 7:27 UTC (Thu)
by ncm (guest, #165)
[Link] (1 responses)
Posted Jan 7, 2010 8:41 UTC (Thu)
by nix (subscriber, #2304)
[Link]
Posted Jan 8, 2010 18:31 UTC (Fri)
by MattPerry (guest, #46341)
[Link] (5 responses)
Posted Jan 9, 2010 0:26 UTC (Sat)
by sfeam (subscriber, #2841)
[Link] (4 responses)
Not if you have correspondents on the other side of the date line.
Posted Jan 9, 2010 3:22 UTC (Sat)
by MattPerry (guest, #46341)
[Link] (3 responses)
How so? Those correspondents will be, at most, only one day ahead. No timezone is more than 24 hours ahead of any other timezone. Therefore 'current date + 1 day' is sufficient.
Posted Jan 9, 2010 8:02 UTC (Sat)
by mp (subscriber, #5615)
[Link] (2 responses)
Posted Jan 9, 2010 18:11 UTC (Sat)
by MattPerry (guest, #46341)
[Link] (1 responses)
Posted Jan 14, 2010 0:53 UTC (Thu)
by Kissaki (guest, #61848)
[Link]
Egregious errors (wrong decade) are one thing, but 26 hours is probably cutting it too close.
I assume the reason it isn't dynamic is to be able to do static string matching / regular expression compilation, rather than recalculate the string to match every time. You could get past that by making it re-compute the date string on restart, assuming you restart frequently enough.
Posted Jan 7, 2010 7:35 UTC (Thu)
by lab (guest, #51153)
[Link] (3 responses)
Posted Jan 7, 2010 16:11 UTC (Thu)
by nix (subscriber, #2304)
[Link] (2 responses)
Posted Jan 7, 2010 16:50 UTC (Thu)
by lab (guest, #51153)
[Link]
Posted Apr 14, 2010 5:36 UTC (Wed)
by Zenith (guest, #24899)
[Link]
But yeah, I read the same thing from it as well ;)
Posted Jan 7, 2010 10:56 UTC (Thu)
by dwmw2 (subscriber, #2063)
[Link] (8 responses)
Posted Jan 7, 2010 11:32 UTC (Thu)
by fluke571 (guest, #57515)
[Link] (6 responses)
Posted Jan 7, 2010 11:43 UTC (Thu)
by dwmw2 (subscriber, #2063)
[Link]
Posted Jan 7, 2010 11:48 UTC (Thu)
by jschrod (subscriber, #1646)
[Link] (4 responses)
Huh, why not? Failure codes 552, 554, 451, and 452 are valid after <CR>.<CR>, according to RFC 821, section 4.3.
Posted Jan 7, 2010 12:46 UTC (Thu)
by anselm (subscriber, #2796)
[Link] (3 responses)
By way of clarification, I think the upstream comment meant »once the mail
has entered the local queue«.
It is possible to reject a message while it is being submitted, but once
the local MTA has accepted responsibility for it it can only be bounced,
which as has been noted will in most cases inconvenience those people
whose addresses the spam claims it is being sent from.
To reject spam rather than bounce it, one needs to run anti-spam software
while the message is still in the process of being read, where the more
common setup is to run the anti-spam software after the message has
been accepted locally but before it is delivered to the addressee's
mailbox. Depending on the checks the anti-spam software performs
(especially ones that access the network), pre-queue checking may be a
resource-intensive process, so it requires careful configuration.
Posted Jan 7, 2010 13:07 UTC (Thu)
by jschrod (subscriber, #1646)
[Link] (2 responses)
Posted Jan 7, 2010 13:19 UTC (Thu)
by dwmw2 (subscriber, #2063)
[Link] (1 responses)
Backscatterer.org is definitely best avoided, because it deliberately includes these false positives.
Besides, there are much better ways (PRVS/BATV/etc.) to avoid unwanted bounces.
My setup for that is documented here, although it can be done more simply now that Exim has built-in PRVS support. In short, the way it works is that I never send MAIL FROM:<dwmw2@infradead.org> and thus I never accept bounces to that address. And anyone who does sender verification callouts doesn't accept mail that's faked from my address either.
But we digress...
Posted Jan 7, 2010 14:23 UTC (Thu)
by jschrod (subscriber, #1646)
[Link]
Posted Jan 7, 2010 14:54 UTC (Thu)
by ballombe (subscriber, #9523)
[Link]
Even if you decide to reject spam, keeping a copy in a spam folder allow to assess the behavior of the spam filtering system, and recover from failure.
Posted Jan 7, 2010 14:54 UTC (Thu)
by jzbiciak (guest, #5246)
[Link] (3 responses)
So what's this 2K10 I see?
/cranky, need coffee.
Posted Jan 7, 2010 19:35 UTC (Thu)
by RobSeace (subscriber, #4435)
[Link] (2 responses)
Posted Jan 7, 2010 19:47 UTC (Thu)
by jzbiciak (guest, #5246)
[Link] (1 responses)
Posted Jan 8, 2010 0:24 UTC (Fri)
by nix (subscriber, #2304)
[Link]
Posted Jan 7, 2010 15:33 UTC (Thu)
by Baylink (guest, #755)
[Link]
/opt/zimbra/conf/spamassassin/72_active.cf
the fix is in that bug.
The SAY2K10 bug
The SAY2K10 bug
Next year we'll have MMXI bugs. And by the way, it's time to start using that in copyright notices again: "Copyright © MMX by Megacorp Inc., all rights reversed."
The SAY2K10 bug
The SAY2K10 bug
even published...
The SAY2K10 bug
'current date + 1 day' would be even better.The SAY2K10 bug
The SAY2K10 bug
Actually there are 26 hours between some places.
The SAY2K10 bug
The SAY2K10 bug
The SAY2K10 bug
The SAY2K10 bug
The SAY2K10 bug
The SAY2K10 bug
The SAY2K10 bug
Spam folders considered harmful
"Your editor, receiving 5,000 spams every day, has long since stopped scanning the spam folder for false positives; even if they exist (which they almost never do), they represent a needle which is almost impossible to find in a haystack that large. So email classified as spam is, for all practical purposes, simply lost."
This is why having a spam folder is often a bad idea. It's much better just to reject the offending mail so that when false positives happen, the sender gets a bounce and knows that the mail wasn't received.
Spam folders considered harmful
You are mistaken. You can quite happily give a 5xx rejection message after DATA — or a 4xx temporary rejection, if you've decided that the mail is suspicious enough to warrant greylisting, but not bad enough that you want to reject it outright.
Spam folders considered harmful
Spam folders considered harmful
Spam folders considered harmful
Spam folders considered harmful
Well, with the exception that it seems to be suggesting that people use backscatterer.org. It does admit that that list includes servers which only do sender verification callouts and don't actually send bounces, but then in the very next sentence says "That list can be used to reject just unwanted NDNs.", which is obviously false.
Spam folders considered harmful
Spam folders considered harmful
Spam folders considered harmful
The SAY2K10 bug
The SAY2K10 bug
The SAY2K10 bug
The SAY2K10 bug
watergate-gate *with a lowercase w*.
Zimbra ops need to fix this too