|
|
Subscribe / Log in / New account

The SAY2K10 bug

By Jonathan Corbet
January 6, 2010
SpamAssassin is crucial infrastructure, at least for some of us. So it was with some dismay that your editor, while performing a quick New Year's Day disaster check, noted that SpamAssassin had not made the adjustment to 2010 in good form. The bug was straightforward and easy to fix, but it merits a closer look for what it reveals about our infrastructure and how we support it.

The task assigned to SpamAssassin, of course, is to look over incoming email and assign a score to each message indicating how likely that message is to be spam. It does this job surprisingly well; your editor currently receives around 5,000 spams per day - one every 17 seconds or so - but it's a bad day if two dozen of those get past SpamAssassin and show up in the inbox. Put simply: without SpamAssassin, your editor's email address would simply be unusable. All it takes is a five-minute window without spamd running to see what life would be like if the incoming mail stream had to be dealt with in its full, unfiltered glory. This is mission-critical software, so any faults which turn up in it tend to be of great concern.

The core of SpamAssassin is a vast set of rules looking for spammy characteristics in incoming email. The rules match anything that the developers think might indicate spam; some of the tests include:

  • The presence of a rot13-encoded email address.

  • Large numbers of blank lines.

  • The originating address is in any of a number of network blacklists.

  • Discussion of medication in a number of forms.

  • HTML messages with huge fonts.

  • The presence of URLs registered to known spammers.

...and so on. Each matching rule adds a numeric score to the message; when the process is complete, the scores are added up to yield a total spamminess value. The bayesian recognizer also gets a chance to look at the message and add a score of its own. At the conclusion of this process, any message with a score of 5.0 or higher (by default) is considered to be spam.

Some years ago, a SpamAssassin developer noticed that some unwanted mail came in with dates far in the future. These messages almost certainly represent an attempt by spammers to take advantage of mail clients which sort messages by date; a far-future date should show up at the top of the list. To deal with these messages, said developer wrote a rule matching any date from the year 2010 or afterward. At the time, 2010 was some years in the future, so the rule seemed to make sense. Surely somebody would fix it long before that distant year arrived.

The scores assigned to rules in SpamAssassin are not random, but neither are they assigned by the rule authors. Instead, the project uses a "perceptron" program to determine which combination of scores performs best against a large body of spam and "ham" email. When this tool was run, legitimate email from 2010 was indeed a rare thing, so the rule turned out to be a very good positive indicator for spam. As a result, it was assigned a score which, in some situations, could be as high as 3.5.

As of January 1, mail with 2010 dates suddenly became rather more common. With the year-2010 rule now firing on every message, the SpamAssassin threshold was, in effect, lowered from 5.0 to as low as 1.5. That, in turn, caused a fair amount of legitimate email to be classified as spam, a most unwelcome development. Your editor, receiving 5,000 spams every day, has long since stopped scanning the spam folder for false positives; even if they exist (which they almost never do), they represent a needle which is almost impossible to find in a haystack that large. So email classified as spam is, for all practical purposes, simply lost.

As described in Justin Mason's weblog, the year-2010 problem was noted by a SpamAssassin developer in 2008. The rule was duly fixed in the project's repository, and promptly forgotten about. What the SpamAssassin developers did not do was any of (1) informing the user community of the rule change, (2) making a new major release with the fixed rule, or (3) distributing the rule fix through the sa-update channel, which exists for just this purpose. So everybody was caught by surprise - users, distributors, Internet service providers, and the SpamAssassin developers themselves.

All told, the harm caused by this problem was relatively small and mostly recoverable. It is a very small blot on SpamAssassin's long record of making email usable for large numbers of people. But it highlights a few points which are worthy of note:

  • Even those of us who are not running financial exchanges have critical infrastructure based on free software. When something goes wrong with that infrastructure, it can hurt our businesses, social lives, and more.

  • Software which plays a crucial part on our operations should really have a mechanism in place to get important fixes to users quickly. But, just as importantly, that project has to take great care to ensure that important fixes get routed into that channel. SpamAssassin developers had fixed the 2010 problem a long time ago, but that was not helpful for users, who had no way of knowing about the problem or its fix. In the kernel realm, it has taken some years to build the discipline of looking over patches and considering them for stable kernel updates; there's probably still a fair number of important fixes which do not get to stable kernel users because nobody thinks to route them to the stable kernel maintainers.

  • Important software requires a certain amount of development and review time. So it's discouraging to read in Justin's weblog that his SpamAssassin work happens in his scarce spare time, and that the project is, in general, short of active developers. Your editor suspects that the truth of the matter is this: SpamAssassin is long past its period of rapid development. At this point, it works well, to the point that there's not a lot of work to be done. So the interested developers have gone on to other projects.

It would appear that what SpamAssassin needs is some dedicated maintenance talent which is not dependent on evening hours put in by developers committed to other projects. Typically that is the sort of work that requires a paying customer. Given how many people and companies rely on this software, it seems like it should be possible to find the money to motivate somebody to put more time into SpamAssassin maintenance. The hard part is collecting and administering those funds; that's not something that the free software community has yet reliably become good at doing.


to post comments

The SAY2K10 bug

Posted Jan 7, 2010 2:21 UTC (Thu) by ncm (guest, #165) [Link] (9 responses)

I gather the electric bank machines in Germany also failed.

We should be calling these "MMX" bugs.

The SAY2K10 bug

Posted Jan 7, 2010 4:56 UTC (Thu) by Bayes (guest, #52258) [Link] (8 responses)

Why couldn't the rule be 'current date + 1 year'? No way to get current date in the rule body?

The SAY2K10 bug

Posted Jan 7, 2010 7:27 UTC (Thu) by ncm (guest, #165) [Link] (1 responses)

Next year we'll have MMXI bugs. And by the way, it's time to start using that in copyright notices again: "Copyright © MMX by Megacorp Inc., all rights reversed."

The SAY2K10 bug

Posted Jan 7, 2010 8:41 UTC (Thu) by nix (subscriber, #2304) [Link]

And, of course, last year we had MMIX bugs. And this before Volume IV was
even published...

The SAY2K10 bug

Posted Jan 8, 2010 18:31 UTC (Fri) by MattPerry (guest, #46341) [Link] (5 responses)

'current date + 1 day' would be even better.

The SAY2K10 bug

Posted Jan 9, 2010 0:26 UTC (Sat) by sfeam (subscriber, #2841) [Link] (4 responses)

'current date + 1 day' would be even better.

Not if you have correspondents on the other side of the date line.

The SAY2K10 bug

Posted Jan 9, 2010 3:22 UTC (Sat) by MattPerry (guest, #46341) [Link] (3 responses)

> Not if you have correspondents on the other side of the date line.

How so? Those correspondents will be, at most, only one day ahead. No timezone is more than 24 hours ahead of any other timezone. Therefore 'current date + 1 day' is sufficient.

The SAY2K10 bug

Posted Jan 9, 2010 8:02 UTC (Sat) by mp (subscriber, #5615) [Link] (2 responses)

Actually there are 26 hours between some places.

The SAY2K10 bug

Posted Jan 9, 2010 18:11 UTC (Sat) by MattPerry (guest, #46341) [Link] (1 responses)

Thanks for the links. I didn't know about those. Then 'current date + 26 hours' is sufficient. No need for 'current date + 1 year' since that would leave plenty of room for spammers to still set dates into the future.

The SAY2K10 bug

Posted Jan 14, 2010 0:53 UTC (Thu) by Kissaki (guest, #61848) [Link]

Not to pile on, but you don't want to be overly specific because sometimes people just make mistakes. If someone accidentally gets the month wrong when setting up or configuring their computer, and don't turn on NTP, you could be affecting their ability to send you email and (depending on what the email contains) give you money.

Egregious errors (wrong decade) are one thing, but 26 hours is probably cutting it too close.

I assume the reason it isn't dynamic is to be able to do static string matching / regular expression compilation, rather than recalculate the string to match every time. You could get past that by making it re-compute the date string on restart, assuming you restart frequently enough.

The SAY2K10 bug

Posted Jan 7, 2010 7:35 UTC (Thu) by lab (guest, #51153) [Link] (3 responses)

In Denmark, the new sophisticated parking meter machines, accepting card payments, also failed due to this bug, along with other assorted equipment.

The SAY2K10 bug

Posted Jan 7, 2010 16:11 UTC (Thu) by nix (subscriber, #2304) [Link] (2 responses)

The Danish parking meter machines depended on SpamAssassin?!

The SAY2K10 bug

Posted Jan 7, 2010 16:50 UTC (Thu) by lab (guest, #51153) [Link]

No, sorry. Inspired by the first commenter, I was jumping to a more general category of bugs, apparently triggered by the shift from 2009->2010. The explanations I've briefly read, says something about sloppy hex->decimal conversions, or some such.

The SAY2K10 bug

Posted Apr 14, 2010 5:36 UTC (Wed) by Zenith (guest, #24899) [Link]

Seriously LOL'ed at work from this. Got a few stares ;)

But yeah, I read the same thing from it as well ;)

Spam folders considered harmful

Posted Jan 7, 2010 10:56 UTC (Thu) by dwmw2 (subscriber, #2063) [Link] (8 responses)

"Your editor, receiving 5,000 spams every day, has long since stopped scanning the spam folder for false positives; even if they exist (which they almost never do), they represent a needle which is almost impossible to find in a haystack that large. So email classified as spam is, for all practical purposes, simply lost."
This is why having a spam folder is often a bad idea. It's much better just to reject the offending mail so that when false positives happen, the sender gets a bounce and knows that the mail wasn't received.

Spam folders considered harmful

Posted Jan 7, 2010 11:32 UTC (Thu) by fluke571 (guest, #57515) [Link] (6 responses)

You cannot just reject mail, once body was sent. You can only send bounce afterwards, but since 99,9% of From: fields are fake/random, you effectively become a spammer if you're doing this.

Spam folders considered harmful

Posted Jan 7, 2010 11:43 UTC (Thu) by dwmw2 (subscriber, #2063) [Link]

You are mistaken. You can quite happily give a 5xx rejection message after DATA — or a 4xx temporary rejection, if you've decided that the mail is suspicious enough to warrant greylisting, but not bad enough that you want to reject it outright.

Spam folders considered harmful

Posted Jan 7, 2010 11:48 UTC (Thu) by jschrod (subscriber, #1646) [Link] (4 responses)

> You cannot just reject mail, once body was sent.

Huh, why not? Failure codes 552, 554, 451, and 452 are valid after <CR>.<CR>, according to RFC 821, section 4.3.

Spam folders considered harmful

Posted Jan 7, 2010 12:46 UTC (Thu) by anselm (subscriber, #2796) [Link] (3 responses)

By way of clarification, I think the upstream comment meant »once the mail has entered the local queue«. It is possible to reject a message while it is being submitted, but once the local MTA has accepted responsibility for it it can only be bounced, which as has been noted will in most cases inconvenience those people whose addresses the spam claims it is being sent from.

To reject spam rather than bounce it, one needs to run anti-spam software while the message is still in the process of being read, where the more common setup is to run the anti-spam software after the message has been accepted locally but before it is delivered to the addressee's mailbox. Depending on the checks the anti-spam software performs (especially ones that access the network), pre-queue checking may be a resource-intensive process, so it requires careful configuration.

Spam folders considered harmful

Posted Jan 7, 2010 13:07 UTC (Thu) by jschrod (subscriber, #1646) [Link] (2 responses)

Well, http://www.dontbouncespam.org/ says it best.

Spam folders considered harmful

Posted Jan 7, 2010 13:19 UTC (Thu) by dwmw2 (subscriber, #2063) [Link] (1 responses)

Well, with the exception that it seems to be suggesting that people use backscatterer.org. It does admit that that list includes servers which only do sender verification callouts and don't actually send bounces, but then in the very next sentence says "That list can be used to reject just unwanted NDNs.", which is obviously false.

Backscatterer.org is definitely best avoided, because it deliberately includes these false positives.

Besides, there are much better ways (PRVS/BATV/etc.) to avoid unwanted bounces.

My setup for that is documented here, although it can be done more simply now that Exim has built-in PRVS support. In short, the way it works is that I never send MAIL FROM:<dwmw2@infradead.org> and thus I never accept bounces to that address. And anyone who does sender verification callouts doesn't accept mail that's faked from my address either.

But we digress...

Spam folders considered harmful

Posted Jan 7, 2010 14:23 UTC (Thu) by jschrod (subscriber, #1646) [Link]

I didn't want to imply that backscatter handling is described there best, but that the reasons why (a) one shall reject spam, not bounce it, and (b) that spam rejection after DATA is explicitely allowed by RFC 5321, contrary to the statement by fluke571.

Spam folders considered harmful

Posted Jan 7, 2010 14:54 UTC (Thu) by ballombe (subscriber, #9523) [Link]

In the case at hand, it is possible to grep the spam folder for messages with score <= 8.5 or even to rescan them with a fixed spamassassin.

Even if you decide to reject spam, keeping a copy in a spam folder allow to assess the behavior of the spam filtering system, and recover from failure.

The SAY2K10 bug

Posted Jan 7, 2010 14:54 UTC (Thu) by jzbiciak (guest, #5246) [Link] (3 responses)

Ok, I understood "2K"... that saves a two whole characters. 2K1 was cute for about 10 minutes, but 2K2 through 2K9 seemed like needless optimization.

So what's this 2K10 I see?

/cranky, need coffee.

The SAY2K10 bug

Posted Jan 7, 2010 19:35 UTC (Thu) by RobSeace (subscriber, #4435) [Link] (2 responses)

I'm afraid "Y2K-*" is destined to be the IT equivalent of politics' "*-gate" for all future date/time-related computer problems... In 2100, they'll be talking about the "Y2K100" bugs (or maybe "Y2.1K", if we're lucky)... ;-)

The SAY2K10 bug

Posted Jan 7, 2010 19:47 UTC (Thu) by jzbiciak (guest, #5246) [Link] (1 responses)

Lovely. On the politics side, I'm still waiting for there to be some water-related scandal, so we can have "Water-gate", and some new scandal involving the Watergate Hotel, so we can have "Watergate-gate".

The SAY2K10 bug

Posted Jan 8, 2010 0:24 UTC (Fri) by nix (subscriber, #2304) [Link]

You could also have a scandal involving maintenance of sluices,
watergate-gate *with a lowercase w*.

Zimbra ops need to fix this too

Posted Jan 7, 2010 15:33 UTC (Thu) by Baylink (guest, #755) [Link]

See http://bugzilla.zimbra.com/show_bug.cgi?id=43766, the file is

/opt/zimbra/conf/spamassassin/72_active.cf

the fix is in that bug.


Copyright © 2010, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds