LWN.net Logo

LWN.net Weekly Edition for January 7, 2010

The SAY2K10 bug

By Jonathan Corbet
January 6, 2010
SpamAssassin is crucial infrastructure, at least for some of us. So it was with some dismay that your editor, while performing a quick New Year's Day disaster check, noted that SpamAssassin had not made the adjustment to 2010 in good form. The bug was straightforward and easy to fix, but it merits a closer look for what it reveals about our infrastructure and how we support it.

The task assigned to SpamAssassin, of course, is to look over incoming email and assign a score to each message indicating how likely that message is to be spam. It does this job surprisingly well; your editor currently receives around 5,000 spams per day - one every 17 seconds or so - but it's a bad day if two dozen of those get past SpamAssassin and show up in the inbox. Put simply: without SpamAssassin, your editor's email address would simply be unusable. All it takes is a five-minute window without spamd running to see what life would be like if the incoming mail stream had to be dealt with in its full, unfiltered glory. This is mission-critical software, so any faults which turn up in it tend to be of great concern.

The core of SpamAssassin is a vast set of rules looking for spammy characteristics in incoming email. The rules match anything that the developers think might indicate spam; some of the tests include:

  • The presence of a rot13-encoded email address.

  • Large numbers of blank lines.

  • The originating address is in any of a number of network blacklists.

  • Discussion of medication in a number of forms.

  • HTML messages with huge fonts.

  • The presence of URLs registered to known spammers.

...and so on. Each matching rule adds a numeric score to the message; when the process is complete, the scores are added up to yield a total spamminess value. The bayesian recognizer also gets a chance to look at the message and add a score of its own. At the conclusion of this process, any message with a score of 5.0 or higher (by default) is considered to be spam.

Some years ago, a SpamAssassin developer noticed that some unwanted mail came in with dates far in the future. These messages almost certainly represent an attempt by spammers to take advantage of mail clients which sort messages by date; a far-future date should show up at the top of the list. To deal with these messages, said developer wrote a rule matching any date from the year 2010 or afterward. At the time, 2010 was some years in the future, so the rule seemed to make sense. Surely somebody would fix it long before that distant year arrived.

The scores assigned to rules in SpamAssassin are not random, but neither are they assigned by the rule authors. Instead, the project uses a "perceptron" program to determine which combination of scores performs best against a large body of spam and "ham" email. When this tool was run, legitimate email from 2010 was indeed a rare thing, so the rule turned out to be a very good positive indicator for spam. As a result, it was assigned a score which, in some situations, could be as high as 3.5.

As of January 1, mail with 2010 dates suddenly became rather more common. With the year-2010 rule now firing on every message, the SpamAssassin threshold was, in effect, lowered from 5.0 to as low as 1.5. That, in turn, caused a fair amount of legitimate email to be classified as spam, a most unwelcome development. Your editor, receiving 5,000 spams every day, has long since stopped scanning the spam folder for false positives; even if they exist (which they almost never do), they represent a needle which is almost impossible to find in a haystack that large. So email classified as spam is, for all practical purposes, simply lost.

As described in Justin Mason's weblog, the year-2010 problem was noted by a SpamAssassin developer in 2008. The rule was duly fixed in the project's repository, and promptly forgotten about. What the SpamAssassin developers did not do was any of (1) informing the user community of the rule change, (2) making a new major release with the fixed rule, or (3) distributing the rule fix through the sa-update channel, which exists for just this purpose. So everybody was caught by surprise - users, distributors, Internet service providers, and the SpamAssassin developers themselves.

All told, the harm caused by this problem was relatively small and mostly recoverable. It is a very small blot on SpamAssassin's long record of making email usable for large numbers of people. But it highlights a few points which are worthy of note:

  • Even those of us who are not running financial exchanges have critical infrastructure based on free software. When something goes wrong with that infrastructure, it can hurt our businesses, social lives, and more.

  • Software which plays a crucial part on our operations should really have a mechanism in place to get important fixes to users quickly. But, just as importantly, that project has to take great care to ensure that important fixes get routed into that channel. SpamAssassin developers had fixed the 2010 problem a long time ago, but that was not helpful for users, who had no way of knowing about the problem or its fix. In the kernel realm, it has taken some years to build the discipline of looking over patches and considering them for stable kernel updates; there's probably still a fair number of important fixes which do not get to stable kernel users because nobody thinks to route them to the stable kernel maintainers.

  • Important software requires a certain amount of development and review time. So it's discouraging to read in Justin's weblog that his SpamAssassin work happens in his scarce spare time, and that the project is, in general, short of active developers. Your editor suspects that the truth of the matter is this: SpamAssassin is long past its period of rapid development. At this point, it works well, to the point that there's not a lot of work to be done. So the interested developers have gone on to other projects.

It would appear that what SpamAssassin needs is some dedicated maintenance talent which is not dependent on evening hours put in by developers committed to other projects. Typically that is the sort of work that requires a paying customer. Given how many people and companies rely on this software, it seems like it should be possible to find the money to motivate somebody to put more time into SpamAssassin maintenance. The hard part is collecting and administering those funds; that's not something that the free software community has yet reliably become good at doing.

Comments (28 posted)

Computer science education and free software

By Jake Edge
January 6, 2010

There is an effort underway to enhance the current high school computer science curriculum in the US. Spearheaded by the US National Science Foundation (NSF), the intent is to "transform" high school computing education from one that is focused on application and programming training to one that opens up more of the "magic of computing". The idea is that computing cuts across many different types of activities and jobs, so narrowly focusing on things like word processing or Java programming may not provide a good overview of the field to teenagers.

The NSF executive summary [PDF] of its "Transforming High School Computing" project cites several statistics that highlight the current problems with computing education in the US, along with its plans for addressing them. Essentially, it would like to see three new classes developed that will benefit students who are headed in different directions.

Two of the courses would take the place of today's introductory and advanced placement (AP) computing classes, while an entirely new course would be developed for students who are headed to college and interested in a scientific field. But instead of an introductory class that teaches how to use a keyboard—something that is likely needed by very few high school students today—word processing, and the like, the new "Pre-AP" curriculum would "go beyond mere computer literacy to teaching fluency in the fundamentals of computing and computational thinking, using an inquiry-based instructional approach and engaging students with exciting, 21st century applications."

Likewise, the new AP course for potential science majors will "explore, in more detail and depth, computational concepts introduced in the "Pre-AP" course, including critical thinking, logic, algorithms, etc." While the text reads a bit like a marketing brochure (which, in some sense, it is), filled with phrases like "rigorous and engaging", it would seem to be a step in the right direction.

Another goal is to train 10,000 new teachers in the new curriculum so that by 2015 the new courses are being taught in 10,000 schools. These are fairly ambitious goals and will require a public/private partnership for funding according to the NSF. There will undoubtedly be large hardware and software companies falling all over themselves to give money and, more importantly from their perspective, hardware and software to schools in support of this effort. That's good, as far as it goes, but the NSF and those working on the project should most certainly consider the role for free software as part of the "transformation".

It is certainly true that there is far more to computing than learning how to use Office and Photoshop (or even OpenOffice and GIMP for that matter). Students will clearly understand computers and computing better if they get a sense for what computers can and cannot do. That implies access to a wide variety of different types of applications, not just those that might be used in an office or programming job, which is something that free software can provide much more easily, at a much lower price than the commercial vendors can.

Consider the breadth of applications available for today's Linux distributions—all installable at the click of a button. Most certainly many of them are not as polished as their commercial counterparts, but they are available to explore. Want to try computer aided design for the birdhouse you are building in wood shop? There's an app for that.

AutoCAD, even provided for free, seems a bit like overkill to explore the idea of CAD. Tracking down the proper computer with the proper license for the CAD software also seems like it would be counterproductive. Free software can be installed easily and abandoned quickly if it does not suit.

Teacher training could also focus on how to find interesting applications, and to note particularly good ones for specific purposes. It is far more useful to understand what a spreadsheet can do, how it works, and how it can help with your homework, than it is to know the specific function names in Excel, for example. Just as good programmers can switch languages fairly easily, computer literate people should be able to switch applications without much difficulty. That is done by understanding the underlying concepts and then being to able to apply them widely, which is something that the diversity in free software fosters.

The cost savings of using free software are likely to be quite large, but the commercial companies will try to reduce that advantage as much as they can—and take a tax write-off while they are about it. But the advantages of free software go well beyond the price. For anyone interested in "how it works", free software offers the ultimate inside look. From most proprietary software companies, that can't be bought at any price.

For budding programmers, or those that think they may have an interest, free software provides not only a look at the code, but also a look at the development culture. Finding a bug in some package may be frustrating, but a quick look on Google or the project's web site may find others who have the same problem and have a patch available to fix it. There is a lot to be learned (both good and bad) from grabbing a patch from the internet and rebuilding an application.

All of that is not to say that the entire curriculum should be narrowly focused on free software. There is plenty to be learned from the proprietary brands. Trying to keep Windows and Macs out of the classroom is unlikely to work, but is also a bad idea. Diversity is important when trying to learn about computers, so seeing how different organizations and projects do things can only help there.

The information available so far is unclear about what tools will be used in the new classes. One hopes that the NSF, which has sponsored a whole lot of free software along the way, doesn't fall into the trap of thinking that Windows and Mac are the only choices. Even if those two do dominate the computer labs in high schools, there is plenty of free software that runs atop them. The benefits of free software outlined here will not surprise many (any) LWN readers, but they may not be obvious to those outside our communities and that's something worth changing.

Comments (5 posted)

Looking forward to 2010

By Jonathan Corbet
January 5, 2010
Your editor, not generally known for his good sense, has long made a tradition of putting together a set of Linux-related predictions at the beginning of each year and posting them for the world to see. There is no particular source of inside knowledge behind these predictions, and no real reason to give them more credence than is merited by much of the material found in one's spam folder. Still, it's a fun exercise in pondering how things could go and trying to guess what the important themes will be.

On that note, here's your editor's thoughts for 2010. Any relation to reality is purely coincidental.

Open hardware platforms will be seen as increasingly important by the general public. Anybody who saw Verizon's heavy advertising campaign for its Android-based "Droid" offering will have understood that openness is now seen as a selling point in the mobile phone market - something which was not true even a year or two ago. Apple has done us a favor by showing how painful a restricted platform can be - even if it is a relatively open one. Future offerings, including the much-hyped "tablet" machines, will be judged by many criteria, one of which will be "who decides which applications I can run on it?" Locked-down systems will suffer as a result of their closed nature.

We'll see a number of Linux-based tablet computers offered to the market this year. What may take a bit longer to see is just what all of these machines will really be good for.

Software patents will strike close to home again. Nokia's suit against Apple is an especially ominous development. We are seeing the opening of a whole new computing market where none of the traditionally dominating companies have a commanding share. So it's a bit of a gold rush, and some companies will undoubtedly rush to gain their gold by way of the courts.

Copyright assignment policies will be debated by numerous projects over the course of the year. In the past year, the (attempted, in-progress) acquisition of MySQL (by way of Sun) by Oracle has clearly shown how assignment of copyrights to a corporation can go wrong, and Canonical's imposition of an assignment policy has created a backlash of its own. Even Eben Moglen, who has argued for copyright assignments in the past, has stated publicly that MySQL would be better off with a more diverse ownership structure. Developers in the future will think harder about signing assignment agreements, and projects will wonder whether their interests are truly best served by imposing assignment agreements. Copyright assignment agreements will not go away, but, like heavy-handed trademark policies, they will come to be seen an an impediment to freedom which is often counterproductive.

Speaking of MySQL, Oracle's acquisition of Sun will proceed without the imposition of major changes by the European Union. Regardless of its long-term plans, Oracle will treat MySQL with a light hand in the coming year. There will almost certainly be attempts to fork the project, though, regardless of how Oracle behaves.

The browser war will heat up again, but the main contestants will be free software. Firefox holds a commanding position, but its heavy weight and long startup time are enough to push some users to the competition - which, increasingly, looks to be Google's Chrome. If Google continues to develop the browser, and continues to avoid fatal errors like disallowing ad blocking extensions, Chrome may hold a significant part of the market by the end of the year.

Solid-state storage devices will come into wider use this year, with some interesting results. For example, the above-mentioned long startup time for Firefox tends to just vanish when the browser is SSD-based. Wider use of SSDs will tend to hide lazy or inefficient application development, but it will also put more pressure on the kernel's block subsystem, which will struggle to keep up with rapidly-increasing operation rates.

Adventurous distributors will be offering Btrfs by the end of the year. The filesystem will be feature-complete and stabilizing, but it will still be very much for adventurous (and well backed up) users at that point. Ext4, instead, will be moving beyond community distributions and into "enterprise" production use.

The big kernel lock will be gone from the mainline kernel. Actually, it will probably remain in a number of places, but things will have reached a point where a lock_kernel() call is an indication of old, unmaintained, and unused code. On any reasonably current hardware, a leading-edge kernel will be able to run with no BKL use at all. This work will be part of the larger job of getting the realtime preemption patch set into the mainline, but your editor dares not attempt another prediction on when that task will be complete.

Production use of LLVM will be on the rise as this compiler matures and stabilizes. Some of the most interesting uses are likely to be in nontraditional projects like Unladen Swallow.

There will be a scary security incident involving mobile Linux devices. Our security is pretty good, but it's far from perfect; just think, for example, about the number of bugs likely to be found in wireless network drivers, which are quite complex and reviewed by relatively few people.

Speaking of security, 2010 will be the year of the sandbox. Technologies like SELinux, AppArmor, and TOMOYO will not be going away, but increasing numbers of people will decide that many security objectives are more easily obtained by just placing at-risk processes into their own box.

There will be lots of talk of clouds, with companies stumbling over each other to become the host for some portion of our lives. Your editor can only hope that, at some point, this rush toward highly centralized services will be countered by a push for personal control of data. Perhaps members of our community will make it easy for nontechnical users to set up "cloudlets" for individual or small-group use, with a focus on individual control and portability.

GNOME 3 will be released. Learning from the KDE 4 experience, the GNOME developers will promote their work less and focus more on not breaking things for users. The result will be a launch which draws relatively little attention, of either the good or the bad variety, but which lays the base for the platform's future development.

Developers will start using Python 3 as that language becomes more widely available in community distributions. By the end of the year, a small number of Python 3 programs will be in reasonably wide use. Meanwhile, we'll still be waiting for Perl 6.

Community distributions will grow in commercial importance over the course of the year. Distributions like Debian and Gentoo already show up in surprising places, with prominent organizations choosing them for their combination of stability, broad software selection, and great support. More companies will begin to realize that the "enterprise distribution" model is not perfect for all situations and will go looking for solutions which bring them closer to the communities which create all of that software in the first place.

Linux and free software will be stronger than ever at the end of the year. Yes, your editor makes this prediction every year, but it has proved rather more reliable than most of the others. It makes sense to go with a known winner, and, in any case, this prediction is easy to justify. The software keeps getting better, the community gets larger, and the value of free software is becoming more widely understood. There doesn't seem to be any reason for any of that to change anytime soon.

Comments (65 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

  • Security: GSM encryption crack made public; New vulnerabilities in automake, kernel, NetworkManager, wireshark,...
  • Kernel: Restricting the network; Memory compaction; RCU strings.
  • Distributions: XtreemOS; news from Debian, Mandriva and Ubuntu.
  • Development: A look at Thunderbird 3, Getting to Samba 4, KDE in 2009, new versons of mpd, MySQL, SQLite, OpenChange, sendmail, conntrack-tools, stdeb, Midgard, JackEQ, jcgui, GNOME, MapOSMatic, SPTK, Scilab, ams, HylaFAX, Firefox, Forban, Numpy, Git, Mercurial, GNU patch.
  • Announcements: Red Hat results, MySQL campaign, sonos experience, Childsplay review, Linux Gazette returns, Kernel performance tracking, OLPC 2012, Bossa Conf cfp, OSCON cfp, OSDC.TW cfp, FOSDEM interviews, LibrePlanet conf.
Next page: Security>>

Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds