, we compared
with a new set of spam
filtering systems based on Bayesian
analysis. The Bayesian approach, while seemingly "dumber," looked poised
to depose SpamAssassin's complex set of rules as the most effective
filtering system. Bayesian filters are fast, and they can learn from the
actual mail received by each potential consumer of anatomical enlargements
and Nigerian investment schemes. The combination of speed and adaptibility
looked like a winning pair.
Of course, SpamAssassin has not stood still in the intervening months. The
SpamAssassin team released version 2.50 a couple of weeks ago; this release
is described as a "beta version."
No true Linux
user fears beta-quality software, so we decided to have a look.
The SpamAssassin 2.50 rule
set is more impressive than ever. Several hundred rules detect forged
headers, cellphone pitches, Nigerian scams, mortgage refinance pitches,
mentions of Oprah, porn terms ("various types of feline"), suspicious HTML,
"my wife Jody," 100% guarantees, etc. There is
a growing set of rules aimed at catching spam in languages other than
English. One can only marvel at the noble souls who examine that much spam
closely enough to develop these tests.
As is usually the case with a new SpamAssassin release, the new
rules are far more effective than the previous set - for a while at least.
The normal pattern, though, is that spammers begin to figure out how to
evade the latest rules, and the number of false negatives slowly rises over
The truly significant new feature of version 2.50 may help change that
pattern, however. Included in this release is, of course, a Bayesian
filter. SpamAssassin will now remember the terms it finds in both spam and
"ham," and assign to each a probability that the message containing it is
spam. The Bayesian filter is not actually used to classify mail until a
sizeable body of both good and bad mail has been processed. Once that
threshold has been reached, the Bayesian filter will simply assign points
to a message like any other test.
In 2.50, the scoring is set up so that the Bayesian filter, by itself,
cannot condemn a message as spam. Even if the filter says the probability
of spam is 99%, a maximum of 4.3 points (out of the 5 needed by default)
will be assigned. Still, the Bayesian filter is sufficient to tip the
balance on many spams that would have otherwise been classified as real
mail. We ran a quick test with 5000 messages from the LWN.net inbox;
before training, SpamAssassin flagged 3,935 of them as spam. Afterwards,
it caught 4,139 instead. That's 204 spams we would not have to see, and
that can only be a good thing.
The combination of SpamAssassin's rule base and the Bayesian filter
addresses one of the biggest weaknesses of the Bayesian approach: the
filter must be trained. Everybody gets a unique mix of mail, and, believe
it or not, a unique mix of spam. There is no set of Bayesian rules that
will work for everybody. If you happen to have nicely sorted piles of spam
and real mail sitting around, you can quickly train the SpamAssassin filter
via the sa-learn tool. But most users will simply want the filter
to work without an extra effort on their part.
SpamAssassin's rules can help make that happen. Even without the Bayesian
filter, SpamAssassin can flag most of the spam in a user's mail stream. It
can, thus, train the Bayesian filter by itself; that filter will eventually
start catching spam which sneaks past the regular rules. SpamAssassin has
tool which learns how to do its job better over time. As a result, we are
free to spend out time dealing with more interesting things. It would be
hard to find a better example of useful, user-friendly free software.
Comments (9 posted)
[This article was contributed by Joe 'Zonker'
(LI) has been an
extremely important organization in the growth of Linux. But LI has been
kind of quiet as of late. Its website hasn't been updated in a while,
and the last press release put out by the organization on their home
page is dated March 1, 2002. We caught up with LI president Jon "maddog"
Hall recently and asked him where things stood. He assured us that LI is
still working to promote Linux:
Press releases are not the only indication of life. In the past year I
have addressed over 57 groups with thousands of people (including
hundreds of press) on issues of Linux and Open Source. LI had major
input (for example) into the recent article in Business Week. While the
popular press always puts their own spin on what you say, and while I
did not agree with each point of the article, we got some good
The BusinessWeek articles Hall referred to have generated a lot of discussion for Linux. But they don't reflect LI's influence. Hall says that's the way that he wants it:
It's [LI's influence] there more than you know. It's just that...we
don't put out a lot of press releases and we don't say what we've done.
Part of that is my philosophy...I've been trying to promote Linux and
trying to promote member companies.
He does admit, however, that the modesty can backfire when trying to
justify LI dues to member companies and potential members. Dues for
corporate sponsors are $10,000, and dues for member companies are $2,500
annually. Hall says that it can be difficult to convince companies to
fund LI, even though some can spend much more than that on their
individual marketing efforts.
Another function of LI that doesn't get discussed much is the
stewardship of the Linux trademark. For the most part, that's been spun
off to the Linux Mark Institute
(LMI), a non-profit that controls the Linux trademark and grants use of
the trademark to commercial entities for a fee. Hall says the fee is
necessary to build a "warchest" to deal with legal issues instead of
taking the money from LI's budget. Unless there's a "sticky situation"
with a trademark, says Hall, the process is more or less automatic.
Hall also mentioned that LI/LMI have stepped in when people have tried
to trademark Linux in other countries and attempt to "hold it for
Hall did note that LI is undergoing some changes, particularly in terms
of membership. "The industry has changed...used to be lots and lots of
small companies, but the small companies have gone away and Linux has
become not a "Linux-only" product, but a "Linux-also" product." Because
of this, he says, companies like IBM and HP wonder what LI can do for
them when they can afford to put money into Linux advertising that is
branded with their company name instead of promoting Linux alone.
So is LI still relevant, if these companies are putting big bucks into
marketing Linux? Hall says yes, because it can do things that IBM and HP
cannot. For example, Hall mentioned that he wants to focus on providing
materials for Linux User Groups to use to promote Linux. Because LI is
vendor-neutral, Hall says that it will have much more success working
with community groups to promote Linux on behalf of its member companies
than those companies would directly:
We've been very good in doing things that are very difficult for one
company to do. Helping the trademark, helping LPI (Linux Professional
Institute), helping standards...there's an important place for us in the
usergroups. I don't think that IBM would want HP by itself addressing
all the user groups.
Jeremy White, CEO of CodeWeavers and an LI supporter, says that he
believes that LI was and continues to be important. White says that the
case studies on LI's site have been useful for him, and also says that
Hall's role as a Linux promoter is very important. "The most important
thing they do is having Jon fly around and evangelize Linux. That's
important to me. Linux, at least on the desktop, is very much in the
adoption phase...I need the whole market to grow."
Hall says that there is still a lot of work to do in promoting Linux:
As long as people have the questions that I have experienced over the
past several weeks, I have to disagree. While people may know what the
word "Linux" means, they do not understand the market or even what free
software stands for. If you believe that Linux has been sucessfully
"promoted", I submit to you that you are sadly mistaken.
Comments (none posted)
The nasty remotely exploitable vulnerability in sendmail is covered in this week's Security Page
. There is one aspect
of this episode, however, which deserves a separate look: the involvement
of the new U.S. Department of Homeland Security (DHS).
The disclosure of this vulnerability was, it seems, coordinated by the DHS
Analysis & Infrastructure Protection Directorate. This is new:
government agencies have not generally handled the response to
vulnerabilities in the past. Given that the government's security
interests have not always aligned all that well with everybody else's
interests (consider the whole issue of cryptographic code), this
involvement should be watched closely.
On the face of it, the sendmail disclosure was handled reasonably well.
The process took a little while (the vulnerability was first discovered in
December), but the information release went as it should. There were no
early disclosures (which have occasionally been a problem in the past), and
almost everybody was in a position to have an
update available when the disclosure did happen. System administrators who
are paying attention should not get hit when the vulnerability is
There is an interesting statement in InfoWorld's
coverage of the disclosure process, however:
When Sendmail patches were ready, the coordinating team managed
their release to the DoD, providing early protection to military
sites on February 25 and 26, four days before the general public
was informed, SANS said. Warnings were more widely issued to
government groups in the U.S. and in other countries on February 27
and 28, including U.S. Cabinet level departments, national cyber
security offices in other countries and Information Sharing and
Analysis Centers (ISACs) for critical infrastructure, SANS said.
In other words, the military got to hear about the problem before anybody
else, and the rest of the government also got a couple of days of lead
time. The DHS put the needs of the government above those of everybody
else, including "critical infrastructure companies."
Things worked out in the sendmail case. As the DHS gets itself
established, however, we should be concerned about how it might handle
vulnerabilities in the future. It does not take a great deal of paranoia
to imagine that disclosure of some problems could be suppressed
altogether. It is also not hard to imagine future regulations
criminalizing the disclosure of vulnerabilities without governmental
approval. After all, that is the sort of thing governments do, and the
current U.S. administration is rather more inclined in that direction than
True information systems security requires disclosure of vulnerabilities.
One can imagine a governmental role in the coordination of that disclosure
to effect a quick and universal availability of patches - though it is far
from clear that this role is truly necessary. But a high level
of vigilance will be required to keep the governmental role from expanding
to where it subverts the disclosure process altogether.
Comments (11 posted)
As promised last week, we have implemented a slight change in the pricing
of subscriptions to LWN. One-time subscriptions (or extensions) of at
least ten months
will now be given a 10% discount. This discount reflects the fact that
longer-term subscriptions cost us less to manage.
Remember that subscriptions are what keeps LWN.net alive; if you have not
yet subscribed, please consider signing up now.
Comments (21 posted)
Page editor: Jonathan Corbet
Inside this week's LWN.net Weekly Edition
- Security: The sendmail bug; new vulnerabilities in file, snort, tcpdump, ...
- Kernel: remap_file_pages(); spelling fixes; three new driver porting articles.
- Distributions: Is RPM Doomed? (DistroWatch)
- Development: The Speex RC3 Speech Codec,
ALSA 0.9.0rc8, gEDA/gaf 20030223, Qtopia 1.6,
Aegir CMS 1.0 RC 2, LSB PPC64 and S390X specs,
Crystal Space 0.96r003, Samba 2.2.8pre2, SBCL 0.7.13,
OProfile 0.5.1, Jext 3.2pre3.
- Press: Microsoftens, is Linux suitable for the Desktop?,
Copyright law hurts technology, Linux video recording, OCR for Linux.
- Announcements: Sony and IBM Linux computing grid,
Ximian and SuSE partner, OSDL Database Test Tools,
TCP/IP stack defect report, LPI-News, OpenOffice conf,
Linux Symposium CFP, KDE shirts.