|
|
Subscribe / Log in / New account

LWN.net Weekly Edition for March 6, 2003

SpamAssassin 2.50

Last September, we compared SpamAssassin with a new set of spam filtering systems based on Bayesian analysis. The Bayesian approach, while seemingly "dumber," looked poised to depose SpamAssassin's complex set of rules as the most effective filtering system. Bayesian filters are fast, and they can learn from the actual mail received by each potential consumer of anatomical enlargements and Nigerian investment schemes. The combination of speed and adaptibility looked like a winning pair.

Of course, SpamAssassin has not stood still in the intervening months. The SpamAssassin team released version 2.50 a couple of weeks ago; this release is described as a "beta version." No true Linux user fears beta-quality software, so we decided to have a look.

The SpamAssassin 2.50 rule set is more impressive than ever. Several hundred rules detect forged headers, cellphone pitches, Nigerian scams, mortgage refinance pitches, mentions of Oprah, porn terms ("various types of feline"), suspicious HTML, "my wife Jody," 100% guarantees, etc. There is a growing set of rules aimed at catching spam in languages other than English. One can only marvel at the noble souls who examine that much spam closely enough to develop these tests.

As is usually the case with a new SpamAssassin release, the new rules are far more effective than the previous set - for a while at least. The normal pattern, though, is that spammers begin to figure out how to evade the latest rules, and the number of false negatives slowly rises over time.

The truly significant new feature of version 2.50 may help change that pattern, however. Included in this release is, of course, a Bayesian filter. SpamAssassin will now remember the terms it finds in both spam and "ham," and assign to each a probability that the message containing it is spam. The Bayesian filter is not actually used to classify mail until a sizeable body of both good and bad mail has been processed. Once that threshold has been reached, the Bayesian filter will simply assign points to a message like any other test.

In 2.50, the scoring is set up so that the Bayesian filter, by itself, cannot condemn a message as spam. Even if the filter says the probability of spam is 99%, a maximum of 4.3 points (out of the 5 needed by default) will be assigned. Still, the Bayesian filter is sufficient to tip the balance on many spams that would have otherwise been classified as real mail. We ran a quick test with 5000 messages from the LWN.net inbox; before training, SpamAssassin flagged 3,935 of them as spam. Afterwards, it caught 4,139 instead. That's 204 spams we would not have to see, and that can only be a good thing.

The combination of SpamAssassin's rule base and the Bayesian filter addresses one of the biggest weaknesses of the Bayesian approach: the filter must be trained. Everybody gets a unique mix of mail, and, believe it or not, a unique mix of spam. There is no set of Bayesian rules that will work for everybody. If you happen to have nicely sorted piles of spam and real mail sitting around, you can quickly train the SpamAssassin filter via the sa-learn tool. But most users will simply want the filter to work without an extra effort on their part.

SpamAssassin's rules can help make that happen. Even without the Bayesian filter, SpamAssassin can flag most of the spam in a user's mail stream. It can, thus, train the Bayesian filter by itself; that filter will eventually start catching spam which sneaks past the regular rules. SpamAssassin has become a tool which learns how to do its job better over time. As a result, we are free to spend out time dealing with more interesting things. It would be hard to find a better example of useful, user-friendly free software.

Comments (9 posted)

The State of Linux International

[This article was contributed by Joe 'Zonker' Brockmeier]

Linux International (LI) has been an extremely important organization in the growth of Linux. But LI has been kind of quiet as of late. Its website hasn't been updated in a while, and the last press release put out by the organization on their home page is dated March 1, 2002. We caught up with LI president Jon "maddog" Hall recently and asked him where things stood. He assured us that LI is still working to promote Linux:

Press releases are not the only indication of life. In the past year I have addressed over 57 groups with thousands of people (including hundreds of press) on issues of Linux and Open Source. LI had major input (for example) into the recent article in Business Week. While the popular press always puts their own spin on what you say, and while I did not agree with each point of the article, we got some good visibility there.

The BusinessWeek articles Hall referred to have generated a lot of discussion for Linux. But they don't reflect LI's influence. Hall says that's the way that he wants it:

It's [LI's influence] there more than you know. It's just that...we don't put out a lot of press releases and we don't say what we've done. Part of that is my philosophy...I've been trying to promote Linux and trying to promote member companies.

He does admit, however, that the modesty can backfire when trying to justify LI dues to member companies and potential members. Dues for corporate sponsors are $10,000, and dues for member companies are $2,500 annually. Hall says that it can be difficult to convince companies to fund LI, even though some can spend much more than that on their individual marketing efforts.

Another function of LI that doesn't get discussed much is the stewardship of the Linux trademark. For the most part, that's been spun off to the Linux Mark Institute (LMI), a non-profit that controls the Linux trademark and grants use of the trademark to commercial entities for a fee. Hall says the fee is necessary to build a "warchest" to deal with legal issues instead of taking the money from LI's budget. Unless there's a "sticky situation" with a trademark, says Hall, the process is more or less automatic. Hall also mentioned that LI/LMI have stepped in when people have tried to trademark Linux in other countries and attempt to "hold it for ransom."

Hall did note that LI is undergoing some changes, particularly in terms of membership. "The industry has changed...used to be lots and lots of small companies, but the small companies have gone away and Linux has become not a "Linux-only" product, but a "Linux-also" product." Because of this, he says, companies like IBM and HP wonder what LI can do for them when they can afford to put money into Linux advertising that is branded with their company name instead of promoting Linux alone.

So is LI still relevant, if these companies are putting big bucks into marketing Linux? Hall says yes, because it can do things that IBM and HP cannot. For example, Hall mentioned that he wants to focus on providing materials for Linux User Groups to use to promote Linux. Because LI is vendor-neutral, Hall says that it will have much more success working with community groups to promote Linux on behalf of its member companies than those companies would directly:

We've been very good in doing things that are very difficult for one company to do. Helping the trademark, helping LPI (Linux Professional Institute), helping standards...there's an important place for us in the usergroups. I don't think that IBM would want HP by itself addressing all the user groups.

Jeremy White, CEO of CodeWeavers and an LI supporter, says that he believes that LI was and continues to be important. White says that the case studies on LI's site have been useful for him, and also says that Hall's role as a Linux promoter is very important. "The most important thing they do is having Jon fly around and evangelize Linux. That's important to me. Linux, at least on the desktop, is very much in the adoption phase...I need the whole market to grow."

Hall says that there is still a lot of work to do in promoting Linux:

As long as people have the questions that I have experienced over the past several weeks, I have to disagree. While people may know what the word "Linux" means, they do not understand the market or even what free software stands for. If you believe that Linux has been sucessfully "promoted", I submit to you that you are sadly mistaken.

Comments (none posted)

Vulnerability disclosure and government

The nasty remotely exploitable vulnerability in sendmail is covered in this week's Security Page. There is one aspect of this episode, however, which deserves a separate look: the involvement of the new U.S. Department of Homeland Security (DHS).

The disclosure of this vulnerability was, it seems, coordinated by the DHS Information Analysis & Infrastructure Protection Directorate. This is new: government agencies have not generally handled the response to vulnerabilities in the past. Given that the government's security interests have not always aligned all that well with everybody else's interests (consider the whole issue of cryptographic code), this involvement should be watched closely.

On the face of it, the sendmail disclosure was handled reasonably well. The process took a little while (the vulnerability was first discovered in December), but the information release went as it should. There were no early disclosures (which have occasionally been a problem in the past), and almost everybody was in a position to have an update available when the disclosure did happen. System administrators who are paying attention should not get hit when the vulnerability is (inevitably) exploited.

There is an interesting statement in InfoWorld's coverage of the disclosure process, however:

When Sendmail patches were ready, the coordinating team managed their release to the DoD, providing early protection to military sites on February 25 and 26, four days before the general public was informed, SANS said. Warnings were more widely issued to government groups in the U.S. and in other countries on February 27 and 28, including U.S. Cabinet level departments, national cyber security offices in other countries and Information Sharing and Analysis Centers (ISACs) for critical infrastructure, SANS said.

In other words, the military got to hear about the problem before anybody else, and the rest of the government also got a couple of days of lead time. The DHS put the needs of the government above those of everybody else, including "critical infrastructure companies."

Things worked out in the sendmail case. As the DHS gets itself established, however, we should be concerned about how it might handle vulnerabilities in the future. It does not take a great deal of paranoia to imagine that disclosure of some problems could be suppressed altogether. It is also not hard to imagine future regulations criminalizing the disclosure of vulnerabilities without governmental approval. After all, that is the sort of thing governments do, and the current U.S. administration is rather more inclined in that direction than many.

True information systems security requires disclosure of vulnerabilities. One can imagine a governmental role in the coordination of that disclosure to effect a quick and universal availability of patches - though it is far from clear that this role is truly necessary. But a high level of vigilance will be required to keep the governmental role from expanding to where it subverts the disclosure process altogether.

Comments (11 posted)

A subscription pricing change

As promised last week, we have implemented a slight change in the pricing of subscriptions to LWN. One-time subscriptions (or extensions) of at least ten months will now be given a 10% discount. This discount reflects the fact that longer-term subscriptions cost us less to manage.

Remember that subscriptions are what keeps LWN.net alive; if you have not yet subscribed, please consider signing up now.

Comments (21 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

  • Security: The sendmail bug; new vulnerabilities in file, snort, tcpdump, ...
  • Kernel: remap_file_pages(); spelling fixes; three new driver porting articles.
  • Distributions: Is RPM Doomed? (DistroWatch)
  • Development: The Speex RC3 Speech Codec, ALSA 0.9.0rc8, gEDA/gaf 20030223, Qtopia 1.6, Aegir CMS 1.0 RC 2, LSB PPC64 and S390X specs, Crystal Space 0.96r003, Samba 2.2.8pre2, SBCL 0.7.13, OProfile 0.5.1, Jext 3.2pre3.
  • Press: Microsoftens, is Linux suitable for the Desktop?, Copyright law hurts technology, Linux video recording, OCR for Linux.
  • Announcements: Sony and IBM Linux computing grid, Ximian and SuSE partner, OSDL Database Test Tools, TCP/IP stack defect report, LPI-News, OpenOffice conf, Linux Symposium CFP, KDE shirts.
Next page: Security>>

Copyright © 2003, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds