User: Password:
Subscribe / Log in / New account

Leading items

A look at SpamAssassin 3.0

June 2, 2004

This article was contributed by Joe 'Zonker' Brockmeier.

For many of us, SpamAssassin is all that stands between us and an inbox clogged to the gills with unwanted e-mail. With the much-anticipated 3.0 release just around the corner, we decided to see what anti-spam fighters would have to work with in the near future. To that end, we touched base with SpamAssassin developers Theo Van Dinter and Craig Hughes. Hughes left the project recently, but was heavily involved in the development of 3.0 and still has his finger on the pulse of SpamAssassin development.

What's different from the current release, and why the version jump? Both Van Dinter and Hughes noted some important technical improvements in the 3.0 release. Hughes said that the most important feature for 3.0 is its modularity. The 3.0 release is "more modular, easier to write plugins for...easier to plug in other pieces of functionality that aren't distributed with the core package," said Hughes. He noted that prior to 3.0, it was difficult to add in custom code for functions that were not part of SpamAssassin.

Both Hughes and Van Dinter also noted the replacement of SpamAssassin's "genetic algorithm" with a "perceptron learner" for score generation. Van Dinter noted that the new score generation is vastly improved, taking the average time from "[around] 14 hours to less than five minutes per scoreset (there are four)." Van Dinter also told LWN that the message/mime parser for SpamAssassin has been rewritten "essentially from scratch."

Another big improvement for 3.0 is improved scalability. The new version supports installations with larger numbers of mailboxes, with preferences stored in an SQL database or LDAP server. The primary focus there, according to Hughes, was for large ISPs that wanted to use SpamAssassin without having a Unix login or home directory for every user.

While there are plenty of technical improvements in SpamAssassin, Hughes also noted that there's a non-technical rationale for the bump to 3.0. SpamAssassin is in the process of becoming a top-level project of the Apache Software Foundation. This also means a licensing change for the project, which was quite a bit of work according to Hughes:

It's going to be using the Apache License instead of using Perl's licensing, and we've gone through a tremendously long, laborious, tedious even, process of sourcing every line of code...making sure that every author really did have the rights to publish it.

Hughes said that the project met little resistance in switching from the former licensing scheme -- which allowed licensing under either the GPL or the Perl Artistic License -- to the Apache Software License. Hughes said that "only a handful" of developers said they wouldn't allow their code to be relicensed, as well as "two or three we couldn't contact." The end result, he said, was that nothing substantial had to be removed due to licensing issues.

Because of the nature of the project, we were also curious how SpamAssassin manages to stay ahead of spammers. According to Van Dinter, it's not so much staying ahead as an "arms race" between SpamAssassin and spammers:

We filter, they mutate, we start filtering the mutation, they mutate again. Lather, Rinse, Repeat. I'm actually not really involved in the rules (I work on the back-end code more than anything else,) but it basically comes down to looking at the spam that's coming in, seeing which ones aren't caught, and figuring out how to catch them in the future. There are also other useful data points unrelated to the messages themselves. For instance, verifying that the sender isn't forged via SPF (Sender Policy Framework) and utilizing the information provided by SenderBase.

Hughes told LWN that there are two things that help SpamAssassin stay ahead of spammers:

One is that you only have to stay ahead of most spammers. There may be one percent that may be particularly good [at getting by SpamAssassin] but if you can block 99 percent of it, it doesn't matter that much...we're not shooting to be perfect, we're shooting to be as good as we can without trying to squeeze out that last one percent.

The other thing is the sheer complexity of SpamAssassin. It's not just a Bayesian filter, it's not just looking up things in's all those things together. It's actually very, very non-trivial for a human to be able to craft a message that's a piece of spam and get defeat all of the system requires a great deal of work, or a lot of luck.

Another piece of good news for SpamAssassin enthusiasts, is that it shouldn't be hard to upgrade. According to Hughes, it "should be simple, as long as you're not doing anything really funky" in terms of tweaking and customizing the SpamAssassin code. He noted that the 3.0 release is designed to recognize file format changes, and to automatically upgrade user files that are in the old format.

If the SpamAssassin 3.0 meta-bug dependency tree is any indication, there's not much left to do before the 3.0-final release. Hughes said that the project "looks like it's on target" to meet the June 30 release date. Users are encouraged to help test SpamAssassin prior to the final release.

Comments (20 posted)

BayStar leaves the building

Back in October, 2003, the $50 million PIPE investment in the SCO Group by BayStar and the Royal Bank of Canada was seen as good news for SCO. In May, 2004, things have changed to the point that the dissolution of that investment is also seen as good news for the company. SCO, it seems, is in a different world than it was late last year.

BayStar had been left holding 40,000 of the 50,000 shares of "series A-1" preferred stock created by the initial investment. BayStar had also been very public about its desire to redeem those shares and its lack of faith in SCO's management. The result was a dark cloud of potential litigation lurking over SCO; it is not surprising that SCO was looking for a way to settle the issue. As it turns out, SCO did pretty well for itself.

The full stock repurchase agreement is available via the SEC. It calls for SCO to buy back those 40,000 shares of preferred stock; the cost will be $13 million in cash and just over 2.1 million shares of SCO common stock. So, in the end, SCO sold that stock for $50 million, and was able to buy it back (including the 10,000 shares redeemed by RBC) for $13 million and some paper. This is, indeed, a good deal for SCO; BayStar must have wanted out badly.

There are a couple of interesting provisions in the agreement. One is that BayStar is limited in how quickly it can sell the common stock; it can't make up more than 10% of the average volume on any given day. The two companies also agree not to badmouth each other. The effect of that agreement would seem to be immediately apparent. In April, BayStar was complaining about SCO's attempts to continue to look like a software company, SCO's management, and its lack of focus on the IBM case. In the press release describing the new agreement, instead, we read:

"After productive and substantial discussions with SCO's management team, board of directors and legal team, BayStar is extremely satisfied with SCO's current operating and cash management plans, new initiatives, management of the litigation, and plans for improving its business going forward," said Larry Goldfarb, managing general partner, BayStar Capital.

It is true that the company would appear to have muzzled Darl McBride recently. Other than that, however, there has been little change. The same management team is in charge, and it's doing the same things. If BayStar were so happy with SCO's progress, what reason could it possibly have for cashing out its investment now at a serious loss? BayStar, instead, gives every indication of running for the exit at full speed, preferably ahead of the quarterly earnings announcement (which has been delayed until June 10).

One other interesting feature of the non-disparagement clause:

...the Company's obligation not to disparage or defame BayStar as set forth above shall be limited to the actions or comments of the Company's executive officers, directors, attorneys, advisors [sic], consultants, representatives and The Canopy Group, Inc.

Canopy is not a party to this agreement. One might well wonder how SCO is able to commit Canopy to keeping its mouth shut.

The end result of all this is that the SCO Group has freed itself from a major distraction, cleared a liability off its books (including the 8% dividends it was supposed to start paying BayStar next year), and obtained $37 million of obligation-free cash (excluding lawyer fees, of course). The company is, indeed, in a better position to concentrate on its many open court cases. It may even be able to turn Darl loose in the near future; life hasn't been the same without his strange pronouncements.

[Looking forward: the next events in SCO's legal calendar include a hearing in the DaimlerChrysler case (June 9), and a ruling due anytime in the Novell case. The Novell ruling will include Novell's motion to dismiss, and, if that is denied, SCO's motion to move the case back to Utah state court.]

Comments (4 posted)

SCO shows more code

On the surface, the declaration of Todd. M. Shaugnessy filed by IBM in the SCO case looks like fairly boring stuff. It consists of a long list of exhibits filed by IBM. Some of those exhibits, however, have not been seen before, and some of those warrant a look. In particular, exhibit 28 covers SCO's answers to the motions to compel discovery. SCO has now "shown the code," and we can see what the company is claiming.

The first part of the declaration covers code contributed from AIX and Dynix to Linux. In the former case, SCO now contents itself with listing the JFS filesystem. From Dynix, SCO notes the read-copy-update technique and some NUMA support code. The broader claim over Linux's SMP code appears to have quietly gone away.

IBM keeps asking SCO to identify the specific lines of System V code which, SCO claims, IBM contributed to Linux. SCO continues to evade that question. The company did, under duress, provide listings of parts of AIX and Dynix that, it claims, derive from Unix. The bulk of the AIX listing is the curses and terminfo libraries; no kernel files are listed there. For Dynix, some kernel files are listed (along with the source of utilities like awk), but there appears to be no intersection with the Dynix files that, SCO claims, IBM contributed to Linux. SCO says that doesn't matter:

In fact, SCO steadfastly maintains that this item is not relevant to this litigation nor is it likely to lead to the discovery of admissible evidence. The main issue in this case is whether IBM has breached its contract with SCO because it contributed or otherwise disposed of a part of AIX or Dynix/ptx to others in contravention of the terms of the license agreement.

In other words, there is not actually any SCO-owned code in IBM's contributions to Linux, but SCO claims control over those contributions anyway. Nothing particularly new there.

Finally, and, perhaps, most interestingly, SCO has included a set of other files (exhibit 28-G) for which it claims ownership. The first part of this list consists of the Linux streams (LiS) patch which has never been part of the mainline kernel. Interestingly, the LiS distribution was hosted at Caldera for some time. But the company formerly known as Caldera would rather forget that now; the company claims, in its filing, the LiS has not appeared in "any Linux-related product distributed by SCO."

The Free Software Foundation recently claimed that the reason SCO went after the kernel and not the FSF was the latter's copyright assignment policies. So the FSF should be interested to see that SCO claims rights over significant chunks of the glibc and binutils packages. In particular, SCO claims ownership of just about anything which touches the ELF executable file format. Many tens of thousands of lines of FSF-owned code are claimed by SCO. Some of the claims are amusing in typical SCO fashion; for example, the exhibit lists elf/interp.c from glibc, which consists of the copyright header and exactly one line of code:

const char __invoke_dynamic_linker__[] __attribute__ ((section (".interp")))

SCO has also added claims to the ELF code in the 2.4.21 kernel, along with the SYSV filesystem and the SYSV interprocess communication code.

SCO acknowledges that it distributed all of the above code (except for LiS), but claims it was unaware that "its intellectual property" was present at the time. One might well question how, if the SCO group claims to own the ELF file format, it could be unaware that it was distributing ELF-related code. ELF is, after all, the fundamental file format used by Linux. But one should not be surprised by this sort of claim from the SCO Group.

The interesting question, instead, is whether the SCO Group will attempt to pursue its claims to the ELF code. These claims could be used to launch attacks against the FSF, any Linux distributor, or even any of the BSD variants. The last thing SCO needs is yet another lawsuit, but that has not stopped this company before. As SCO's claims against the Linux kernel fall apart, its management may well be tempted to cast a wider net.

Comments (11 posted)

Page editor: Jonathan Corbet
Next page: Security>>

Copyright © 2004, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds