By Jonathan Corbet
January 6, 2010
SpamAssassin is crucial
infrastructure, at least for some of us. So it was with some dismay that
your editor, while performing a quick New Year's Day disaster check, noted
that SpamAssassin had not made the adjustment to 2010 in good form. The
bug was straightforward and easy to fix, but it merits a closer look for
what it reveals about our infrastructure and how we support it.
The task assigned to SpamAssassin, of course, is to look over incoming
email and assign a score to each message indicating how likely that message
is to be spam. It does this job surprisingly well; your editor currently
receives around 5,000 spams per day - one every 17 seconds or so - but it's
a bad day if two dozen of those get past SpamAssassin and show up in the
inbox. Put simply: without
SpamAssassin, your editor's email address would simply be unusable. All it
takes is a five-minute window without spamd running to see what life would
be like if the incoming mail stream had to be dealt with in its full,
unfiltered glory. This is mission-critical software, so any faults which
turn up in it tend to be of great concern.
The core of SpamAssassin is a vast set of
rules looking for spammy characteristics in incoming email. The rules
match anything that the developers think might indicate spam; some of
the tests include:
- The presence of a rot13-encoded email address.
- Large numbers of blank lines.
- The originating address is in any of a number of network blacklists.
- Discussion of medication in a number of forms.
- HTML messages with huge fonts.
- The presence of URLs registered to known spammers.
...and so on. Each matching rule adds a numeric score to the message; when
the process is complete, the scores are added up to yield a total
spamminess value. The bayesian recognizer also gets a chance to look at
the message and add a score of its own. At the conclusion of this process,
any message with a score of 5.0 or higher (by default) is considered to be
spam.
Some years ago, a SpamAssassin developer noticed that some unwanted mail
came in with dates far in the future. These messages almost certainly
represent an attempt by spammers to take advantage of mail clients which
sort messages by date; a far-future date should show up at the top of the
list. To deal with these messages, said developer wrote a rule matching
any date from the year 2010 or afterward. At the time, 2010 was some years
in the future, so the rule seemed to make sense. Surely somebody would fix
it long before that distant year arrived.
The scores assigned to rules in SpamAssassin are not random, but neither
are they assigned by the rule authors. Instead, the project uses a "perceptron"
program to determine which combination of scores performs best against a
large body of spam and "ham" email. When this tool was run, legitimate email from
2010 was indeed a rare thing, so the rule turned out to be a very good
positive indicator for spam. As a result, it was assigned a score which,
in some situations, could be as high as 3.5.
As of January 1, mail with 2010 dates suddenly became rather more common.
With the year-2010 rule now firing on every message, the SpamAssassin
threshold was, in effect, lowered from 5.0 to as low as
1.5. That, in turn, caused a fair amount of legitimate email to be
classified as spam, a most unwelcome development. Your editor, receiving
5,000 spams every day, has long since stopped scanning the spam folder for
false positives; even if they exist (which they almost never do), they
represent a needle which is almost impossible to find in a haystack that
large. So email classified as spam is, for all practical purposes, simply
lost.
As described in Justin
Mason's weblog, the year-2010 problem was noted by a SpamAssassin
developer in 2008. The rule was duly fixed in the project's repository,
and promptly forgotten about. What the SpamAssassin developers did not do
was any of (1) informing the user community of the rule change,
(2) making a new major release with the fixed rule, or
(3) distributing the rule fix through the sa-update
channel, which exists for just this purpose. So everybody was caught
by surprise - users, distributors, Internet service providers, and the
SpamAssassin developers themselves.
All told, the harm caused by this problem was relatively small and mostly
recoverable. It is a very small blot on SpamAssassin's long record of
making email usable for large numbers of people. But it highlights a few
points which are worthy of note:
- Even those of us who are not running financial exchanges have
critical infrastructure based on free software. When something goes
wrong with that infrastructure, it can hurt our businesses, social
lives, and more.
- Software which plays a crucial part on our operations should really
have a mechanism in place to get important fixes to users quickly.
But, just as importantly, that project has to take great care to
ensure that important fixes get routed into that channel.
SpamAssassin developers had fixed the 2010 problem a long time ago, but
that was not helpful for users, who had no way of knowing about the
problem or its fix. In the kernel realm, it has taken
some years to build the discipline of looking over patches and
considering them for stable kernel updates; there's probably still a
fair number of important fixes which do not get to stable kernel users
because nobody thinks to route them to the stable kernel maintainers.
- Important software requires a certain amount of development and review
time. So it's discouraging to read in Justin's weblog that his
SpamAssassin work happens in his scarce spare time, and that the
project is, in general, short of active developers. Your editor
suspects that the truth of the matter is this: SpamAssassin is long
past its period of rapid development. At this point, it works well,
to the point that there's not a lot of work to be done. So the
interested developers have gone on to other projects.
It would appear that what SpamAssassin needs is some dedicated maintenance
talent which is
not dependent on evening hours put in by developers committed to other
projects. Typically that is the sort of work that requires a paying
customer. Given how many people and companies rely on this software, it
seems like it should be possible to find the money to motivate somebody to
put more time into SpamAssassin maintenance. The hard part is collecting
and administering those funds; that's not something that the free software
community has yet reliably become good at doing.
Comments (28 posted)
By Jake Edge
January 6, 2010
There is an effort
underway to enhance the current high school
computer
science curriculum in the US. Spearheaded by the US National Science
Foundation (NSF), the intent is to "transform" high school computing
education from one that is focused on application and programming training
to one that opens
up more of the "magic of computing". The idea is that computing
cuts across many different types of activities and jobs, so narrowly
focusing on things like word processing or Java programming may not provide
a good overview of the field to teenagers.
The NSF executive
summary [PDF] of its "Transforming High School Computing" project cites
several statistics that highlight
the current problems with computing education in the US, along with its
plans for addressing them. Essentially, it would like to see three new classes
developed that will benefit students
who are headed in different directions.
Two of the courses would take the place of today's introductory and
advanced placement (AP) computing classes, while an entirely new course
would be developed for students who are headed to college and interested
in a scientific field. But instead of an introductory class that teaches
how to use a keyboard—something that is likely needed by very few
high school students today—word processing, and the like, the new
"Pre-AP" curriculum would "go beyond mere computer literacy to teaching
fluency in the fundamentals of computing and computational thinking, using
an inquiry-based instructional approach and engaging students with
exciting, 21st century applications."
Likewise, the new AP course for potential science majors will
"explore, in more detail and depth, computational concepts introduced
in the "Pre-AP" course, including critical thinking, logic, algorithms,
etc." While the text reads a bit like a marketing brochure (which,
in some sense, it is), filled with phrases like "rigorous and
engaging", it would seem to be a step in the right direction.
Another goal is to train 10,000 new teachers in the new curriculum so that
by 2015 the new courses are being taught in 10,000 schools. These are
fairly ambitious goals and will require a public/private
partnership for funding according to the NSF. There will undoubtedly be
large hardware
and software companies falling all over themselves to give money and, more
importantly from their perspective, hardware and software to schools in
support of this effort. That's good, as far as it goes, but the NSF and
those working on the project should most certainly consider the role for
free software as part of the "transformation".
It is certainly true that there is far more to computing than learning how
to use Office and Photoshop (or even OpenOffice and GIMP for that matter).
Students will clearly understand computers and computing better if they get
a sense for what computers can and cannot do. That implies access to a
wide variety of different types of applications, not just those that might
be used in an office or programming job, which is something that free
software can provide much more easily, at a much lower price than the
commercial vendors can.
Consider the breadth of applications available for today's Linux
distributions—all installable at the click of a button. Most
certainly many of them are not as polished as their commercial
counterparts, but they are available to explore. Want to try computer
aided design for the birdhouse you are building in wood shop? There's an
app for that.
AutoCAD, even provided for free, seems a bit
like overkill to explore the idea of CAD.
Tracking down the proper computer with the proper
license for the CAD software also seems like it would be
counterproductive. Free software can be installed easily and abandoned
quickly if it does not suit.
Teacher training could also focus on how to find interesting applications,
and to note particularly good ones for specific purposes. It is far more
useful to understand what a spreadsheet can do, how it works, and how it
can help with your homework, than it is to know the specific function names
in Excel, for example. Just as good programmers can switch languages
fairly easily, computer literate people should be able to switch
applications without much difficulty. That is done by understanding the
underlying concepts and then being to able to apply them widely, which is
something that the diversity in free software fosters.
The cost savings of using free software are likely to be quite large, but
the commercial companies will try to reduce that advantage as much as they
can—and take a tax write-off while they are about it. But the
advantages of free software go well beyond the price. For anyone
interested in "how it works", free software offers the ultimate inside
look. From most proprietary software companies, that can't be bought at
any price.
For budding programmers, or those that think they may have an interest,
free software provides not only a look at the code, but also a look
at the development culture. Finding a bug in some package may be
frustrating, but a quick look on Google or the project's web site may find
others who have the same
problem and have a patch available to fix it. There is a lot to be learned
(both good and bad) from grabbing a patch from the internet and rebuilding
an application.
All of that is not to say that the entire curriculum should be narrowly
focused on free software. There is plenty to be learned from the
proprietary brands. Trying to keep Windows and Macs out of the classroom is
unlikely to work, but is also a bad idea. Diversity is important when
trying to learn about computers, so seeing how different organizations and
projects do things can only help there.
The information available so far is unclear about what tools will be used
in the new classes. One hopes that the NSF, which has sponsored a whole
lot of free software along the way, doesn't fall into the trap of thinking
that Windows and Mac are the only choices. Even if those two do dominate
the computer labs in high schools, there is plenty of free software that
runs atop them. The benefits of free software outlined here will not
surprise many (any) LWN readers, but they may not be obvious to those
outside our communities and that's something worth changing.
Comments (5 posted)
By Jonathan Corbet
January 5, 2010
Your editor, not generally known for his good sense, has long made a tradition
of putting together a set of Linux-related predictions at the beginning of each
year and posting them for the world to see. There is no particular source
of inside knowledge behind these
predictions, and no real reason to give them more credence than is merited
by much of the material found in one's spam folder. Still, it's a fun
exercise in pondering how things could go and trying to guess what the
important themes will be.
On that note, here's your editor's thoughts for 2010. Any relation to
reality is purely coincidental.
Open hardware platforms will be seen as increasingly important by the
general public. Anybody who saw Verizon's heavy advertising campaign
for its Android-based "Droid" offering will have understood that openness
is now seen as a selling point in the mobile phone market - something which
was not true even a year or two ago. Apple has done us a favor by showing
how painful a restricted platform can be - even if it is a relatively open
one. Future offerings, including the much-hyped "tablet" machines, will be
judged by many criteria, one of which will be "who decides which
applications I can run on it?" Locked-down systems will suffer as a result
of their closed nature.
We'll see a number of Linux-based tablet computers offered to the
market this year. What may take a bit longer to see is just what all of
these machines will really be good for.
Software patents will strike close to home again. Nokia's suit
against Apple is an especially ominous development. We are seeing the
opening of a whole new computing market where none of the traditionally
dominating companies have a commanding share. So it's a bit of a gold
rush, and some companies will undoubtedly rush to gain their gold by way of
the courts.
Copyright assignment policies will be debated by numerous projects
over the course of the year. In the past year, the (attempted,
in-progress) acquisition of MySQL (by way of Sun) by Oracle has clearly shown how
assignment of copyrights to a corporation can go wrong, and Canonical's
imposition of an assignment policy has created a backlash of its own.
Even Eben Moglen, who has argued for copyright
assignments in the past, has stated publicly that MySQL would be better
off with a more diverse ownership structure. Developers in the future will
think harder about signing assignment agreements, and projects will wonder
whether their interests are truly best served by imposing assignment
agreements. Copyright assignment agreements will not go away, but, like
heavy-handed trademark policies, they will come to be seen an an impediment
to freedom which is often counterproductive.
Speaking of MySQL, Oracle's acquisition of Sun will proceed without
the imposition of major changes by the European Union.
Regardless of its long-term plans, Oracle will treat MySQL with a light
hand in the coming year. There will almost certainly be attempts to fork
the project, though, regardless of how Oracle behaves.
The browser war will heat up again, but the main contestants will be
free software. Firefox holds a commanding position, but its heavy weight
and long startup time are enough to push some users to the competition -
which, increasingly, looks to be Google's Chrome. If Google continues to
develop the browser, and continues
to avoid fatal errors like disallowing ad blocking extensions, Chrome
may hold a significant part of the market by the end of the year.
Solid-state storage devices will come into wider use this year, with
some interesting results. For example, the above-mentioned long startup
time for Firefox tends to just vanish when the browser is SSD-based. Wider
use of SSDs will tend to hide lazy or inefficient application development,
but it will also put more pressure on the kernel's block subsystem, which
will struggle to keep up with rapidly-increasing operation rates.
Adventurous distributors will be offering Btrfs by the end of the
year. The filesystem will be feature-complete and stabilizing, but it will
still be very much for adventurous (and well backed up) users at that
point. Ext4, instead, will be moving beyond community distributions and
into "enterprise" production use.
The big kernel lock will be gone from the mainline kernel.
Actually, it will probably remain in a number of places, but things will
have reached a point where a lock_kernel() call is an indication
of old, unmaintained, and unused code. On any reasonably current hardware,
a leading-edge kernel will be able to run with no BKL use at all. This
work will be part of the larger job of getting the realtime preemption
patch set into the mainline, but your editor dares not attempt another
prediction on when that task will be complete.
Production use of LLVM will be on the rise as this compiler matures
and stabilizes. Some of the most interesting uses are likely to be in
nontraditional projects like Unladen Swallow.
There will be a scary security incident involving mobile Linux
devices. Our security is pretty good, but it's far from perfect; just
think, for example, about the number of bugs likely to be found in wireless
network drivers, which are quite complex and reviewed by relatively few
people.
Speaking of security, 2010 will be the year of the sandbox.
Technologies like SELinux, AppArmor, and TOMOYO will not be going away, but
increasing numbers of people will decide that many security objectives are
more easily obtained by just placing at-risk processes into their own box.
There will be lots of talk of clouds, with companies stumbling over
each other to become the host for some portion of our lives. Your editor
can only hope that, at some point, this rush toward highly centralized
services will be countered by a push for personal control of data. Perhaps
members of our community will make it easy for nontechnical users to set up
"cloudlets" for individual or small-group use, with a focus on individual
control and portability.
GNOME 3 will be released. Learning from the KDE 4 experience, the
GNOME developers will promote their work less and focus more on not
breaking things for users. The result will be a launch which draws
relatively little attention, of either the good or the bad variety, but
which lays the base for the platform's future development.
Developers will start using Python 3 as that language becomes more
widely available in community distributions. By the end of the year, a
small number of Python 3 programs will be in reasonably wide use.
Meanwhile, we'll still be waiting for Perl 6.
Community distributions will grow in commercial importance over the
course of the year. Distributions like Debian and Gentoo already show up
in surprising places, with prominent organizations choosing them for their
combination of stability, broad software selection, and great support.
More companies will begin to realize that the "enterprise distribution"
model is not perfect for all situations and will go looking for solutions
which bring them closer to the communities which create all of that
software in the first place.
Linux and free software will be stronger than ever at the end of
the year. Yes, your editor makes this prediction every year, but it has
proved rather more reliable than most of the others. It makes sense to go
with a known winner, and, in any case, this prediction is easy to justify.
The software keeps getting better, the community gets larger, and the value
of free software is becoming more widely understood. There doesn't seem to
be any reason for any of that to change anytime soon.
Comments (65 posted)
Page editor: Jonathan Corbet
Inside this week's LWN.net Weekly Edition
- Security: GSM encryption crack made public; New vulnerabilities in automake, kernel, NetworkManager, wireshark,...
- Kernel: Restricting the network; Memory compaction; RCU strings.
- Distributions: XtreemOS; news from Debian, Mandriva and Ubuntu.
- Development: A look at Thunderbird 3, Getting to Samba 4, KDE in 2009, new versons of mpd, MySQL, SQLite, OpenChange, sendmail, conntrack-tools, stdeb, Midgard, JackEQ, jcgui, GNOME, MapOSMatic, SPTK, Scilab, ams, HylaFAX, Firefox, Forban, Numpy, Git, Mercurial, GNU patch.
- Announcements: Red Hat results, MySQL campaign, sonos experience, Childsplay review, Linux Gazette returns, Kernel performance tracking, OLPC 2012, Bossa Conf cfp, OSCON cfp, OSDC.TW cfp, FOSDEM interviews, LibrePlanet conf.
Next page:
Security>>