For many of us, SpamAssassin is all that stands between us and an inbox
clogged to the gills with unwanted e-mail. With the much-anticipated 3.0
release just around the corner, we decided to see what anti-spam fighters
would have to work with in the near future. To that end, we touched base
with SpamAssassin developers Theo Van Dinter and Craig Hughes. Hughes left
the project recently, but was heavily involved in the development of 3.0
and still has his finger on the pulse of SpamAssassin development.
What's different from the current release, and why the version jump? Both
Van Dinter and Hughes noted some important technical improvements in the
3.0 release. Hughes said that the most important feature for 3.0 is its
modularity. The 3.0 release is "more modular, easier to write plugins
for...easier to plug in other pieces of functionality that aren't
distributed with the core package," said Hughes. He noted that prior
to 3.0, it was difficult to add in custom code for functions that were not
part of SpamAssassin.
Both Hughes and Van Dinter also noted the replacement of SpamAssassin's
"genetic algorithm" with a "perceptron learner" for score generation. Van
Dinter noted that the new score generation is vastly improved, taking the
average time from "[around] 14 hours to less than five minutes per scoreset
(there are four)." Van Dinter also told LWN that the message/mime parser
for SpamAssassin has been rewritten "essentially from
scratch."
Another big improvement for 3.0 is improved scalability. The new version
supports installations with larger numbers of mailboxes, with preferences
stored in an SQL database or LDAP server. The primary focus there,
according to Hughes, was for large ISPs that wanted to use SpamAssassin
without having a Unix login or home directory for every user.
While there are plenty of technical improvements in SpamAssassin, Hughes
also noted that there's a non-technical rationale for the bump to
3.0. SpamAssassin is in the process of becoming a top-level project of the
Apache Software Foundation. This also means a licensing change for the
project, which was quite a bit of work according to Hughes:
It's going to be using the Apache License instead of using Perl's
licensing, and we've gone through a tremendously long, laborious, tedious
even, process of sourcing every line of code...making sure that every
author really did have the rights to publish it.
Hughes said that the project met little resistance in switching from the
former licensing scheme -- which allowed licensing under either the GPL or
the Perl Artistic License -- to the Apache Software
License. Hughes said that "only a handful" of developers
said they wouldn't allow their code to be relicensed, as well as "two
or three we couldn't contact." The end result, he said, was that
nothing substantial had to be removed due to licensing issues.
Because of the nature of the project, we were also curious how SpamAssassin
manages to stay ahead of spammers. According to Van Dinter, it's not so
much staying ahead as an "arms race" between SpamAssassin and spammers:
We filter, they mutate, we start filtering the mutation, they mutate
again. Lather, Rinse, Repeat. I'm actually not really involved in the rules
(I work on the back-end code more than anything else,) but it basically
comes down to looking at the spam that's coming in, seeing which ones
aren't caught, and figuring out how to catch them in the future. There are
also other useful data points unrelated to the messages themselves. For
instance, verifying that the sender isn't forged via SPF (
Sender Policy Framework) and utilizing the
information provided by SenderBase.
Hughes told LWN that there are two things that help SpamAssassin stay ahead
of spammers:
One is that you only have to stay ahead of most spammers. There may be one
percent that may be particularly good [at getting by SpamAssassin] but if
you can block 99 percent of it, it doesn't matter that much...we're not
shooting to be perfect, we're shooting to be as good as we can without
trying to squeeze out that last one percent.
The other thing is the sheer complexity of SpamAssassin. It's not just a
Bayesian filter, it's not just looking up things in RBLs...it's all those
things together. It's actually very, very non-trivial for a human to be
able to craft a message that's a piece of spam and get through...to defeat
all of the system requires a great deal of work, or a lot of luck.
Another piece of good news for SpamAssassin enthusiasts, is that it
shouldn't be hard to upgrade. According to Hughes, it "should be
simple, as long as you're not doing anything really funky" in terms
of tweaking and customizing the SpamAssassin code. He noted that the 3.0
release is designed to recognize file format changes, and to automatically
upgrade user files that are in the old format.
If the SpamAssassin 3.0 meta-bug
dependency
tree is any indication, there's not much left to do before the
3.0-final release. Hughes said that the project "looks like it's on
target" to meet the June 30 release date. Users are
encouraged to help test SpamAssassin prior to the final release.
Comments (20 posted)
Back in October, 2003, the $50 million PIPE investment in the SCO Group by
BayStar and the Royal Bank of Canada was seen as good news for SCO. In
May, 2004, things have changed to the point that the dissolution of that
investment is also seen as good news for the company. SCO, it seems, is in
a different world than it was late last year.
BayStar had been left holding 40,000 of the 50,000 shares of "series A-1"
preferred stock created by the initial investment. BayStar had also been
very public about its desire to redeem those shares and its lack of faith
in SCO's management. The result was a dark cloud of potential litigation
lurking over SCO; it is not surprising that SCO was looking for a way to
settle the issue. As it turns out, SCO did pretty well for itself.
The full
stock repurchase agreement is available via the SEC. It calls for SCO
to buy back those 40,000 shares of preferred stock; the cost will be
$13 million in cash and just over 2.1 million shares of SCO
common stock. So, in the end, SCO sold that stock for $50 million,
and was able to buy it back (including the 10,000 shares redeemed by RBC)
for $13 million and some paper. This is,
indeed, a good deal for SCO; BayStar must have wanted out badly.
There are a couple of interesting provisions in the agreement. One is that
BayStar is limited in how quickly it can sell the common stock; it can't
make up more than 10% of the average volume on any given day. The two
companies also agree not to badmouth each other. The effect of that
agreement would seem to be immediately apparent. In April, BayStar was
complaining about SCO's attempts to continue to look like a software
company, SCO's management, and its lack of focus on the IBM case. In the press
release describing the new agreement, instead, we read:
"After productive and substantial discussions with SCO's management
team, board of directors and legal team, BayStar is extremely
satisfied with SCO's current operating and cash management plans,
new initiatives, management of the litigation, and plans for
improving its business going forward," said Larry Goldfarb,
managing general partner, BayStar Capital.
It is true that the company would appear to have muzzled Darl McBride
recently. Other than that, however, there has been little change. The
same management team is in charge, and it's doing the same things. If
BayStar were so happy with SCO's progress, what reason could it possibly
have for cashing out its investment now at a serious loss? BayStar, instead,
gives every indication of running for the exit at full speed, preferably
ahead of the quarterly earnings announcement (which has been delayed until
June 10).
One other interesting feature of the non-disparagement clause:
...the Company's obligation not to disparage or defame BayStar as
set forth above shall be limited to the actions or comments of the
Company's executive officers, directors, attorneys, advisors [sic],
consultants, representatives and The Canopy Group, Inc.
Canopy is not a party to this agreement. One might well wonder how SCO is
able to commit Canopy to keeping its mouth shut.
The end result of all this is that the SCO Group has freed itself from a
major distraction, cleared a liability off its books (including the 8%
dividends it was supposed to start paying BayStar next year), and obtained
$37 million of obligation-free cash (excluding lawyer fees, of
course). The company is, indeed, in a better position to concentrate on
its many open court cases. It may even be able to turn Darl loose in the
near future; life hasn't been the same without his strange pronouncements.
[Looking forward: the next events in SCO's legal calendar include a hearing
in the DaimlerChrysler case (June 9), and a ruling due anytime in the
Novell case. The Novell ruling will include Novell's motion to dismiss,
and, if that is denied, SCO's motion to move the case back to Utah state
court.]
Comments (4 posted)
On the surface, the
declaration
of Todd. M. Shaugnessy filed by IBM in the SCO case looks like fairly
boring stuff. It consists of a long list of exhibits filed by IBM. Some
of those exhibits, however, have not been seen before, and some of those
warrant a look. In particular, exhibit 28 covers SCO's answers to the
motions to compel discovery. SCO has now "shown the code," and we can see
what the company is claiming.
The first part of the declaration covers code contributed from AIX and
Dynix to Linux. In the former case, SCO now contents itself with listing
the JFS filesystem. From Dynix, SCO notes the read-copy-update technique
and some NUMA support code. The broader claim over Linux's SMP code
appears to have quietly gone away.
IBM keeps asking SCO to identify the specific lines of System V code
which, SCO claims, IBM contributed to Linux. SCO continues to evade that
question. The company did, under duress, provide listings of parts of AIX
and Dynix that, it claims, derive from Unix. The bulk of the AIX listing is the curses and
terminfo libraries; no kernel files are listed there. For Dynix, some
kernel files are listed (along with the source of utilities like
awk), but there appears to be no intersection with the Dynix files
that, SCO
claims, IBM contributed to Linux. SCO says that doesn't matter:
In fact, SCO steadfastly maintains that this item is not relevant to
this litigation nor is it likely to lead to the discovery of
admissible evidence. The main issue in this case is whether IBM
has breached its contract with SCO because it contributed or
otherwise disposed of a part of AIX or Dynix/ptx to others in
contravention of the terms of the license agreement.
In other words, there is not actually any SCO-owned code in IBM's
contributions to Linux, but SCO claims control over those contributions
anyway. Nothing particularly new there.
Finally, and, perhaps, most interestingly, SCO has included a set of other
files (exhibit 28-G) for which it claims ownership. The first part of this
list consists
of the Linux streams (LiS)
patch which has never been part of the mainline kernel. Interestingly,
the LiS distribution was
hosted at Caldera for some time. But the company formerly known as
Caldera would rather forget that now; the company claims, in its filing,
the LiS has not appeared in "any Linux-related product distributed by SCO."
The Free Software Foundation recently claimed that the
reason SCO went after the kernel and not the FSF was the latter's copyright
assignment policies. So the FSF should be interested to see that SCO
claims rights over significant chunks of the glibc and binutils packages. In
particular, SCO claims ownership of just about anything which touches the
ELF executable file format. Many tens of thousands of lines of FSF-owned
code are claimed by SCO. Some of the claims are amusing in typical SCO
fashion; for example, the exhibit lists elf/interp.c from glibc,
which consists of the copyright header and exactly one line of code:
const char __invoke_dynamic_linker__[] __attribute__ ((section (".interp")))
= RUNTIME_LINKER;
SCO has also added claims to the ELF code in the 2.4.21 kernel, along with
the SYSV filesystem and the SYSV interprocess communication code.
SCO acknowledges that it distributed all of the above code (except for
LiS), but claims it was unaware that "its intellectual property" was
present at the time. One might well question how, if the SCO group claims
to own the ELF file format, it could be unaware that it was distributing
ELF-related code.
ELF is, after all, the fundamental file format used by
Linux. But one should not be surprised by this sort of claim from the SCO
Group.
The interesting question, instead, is whether the SCO Group will attempt to
pursue its claims to the ELF code. These claims could be used to launch
attacks against the FSF, any Linux distributor, or even any of the BSD
variants. The last thing SCO needs is yet another lawsuit, but that has
not stopped this company before. As SCO's claims against the Linux kernel
fall apart, its management may well be tempted to cast a wider net.
Comments (11 posted)
Page editor: Jonathan Corbet
Inside this week's LWN.net Weekly Edition
- Security: CVS vulnerability timeline; New vulnerabilities in apache2, kerberos, mailman, ...
- Kernel: x86 NX support; The staircase scheduler; Finding errors automatically; Diskdump.
- Distributions: If you Need a Firewall...; new: Utkarsh Linux, X-Evian; reviewed: SUSE LINUX 9.1, Fedora Core 2
- Development: Building Packages From Source With CheckInstall,
new versions of ALSA, CLSQL, Perdition, CUPS, ht://Check, MediaWiki,
PythonCAD, Bakery, PCB, MyBudget, gnome-games, DiaCanvas, PyQt,
wxWidgets, Gaim, Rosegarden, OpenOffice.org, Epiphany, Galeon,
SableVM, Python.
- Press: The benefits of open-source, ESR does Samizdat, HP expands open-source support,
open source for Solaris, Oracle to be Linux shop, Matt Szulik interview,
the EFF Patent-Busting Project.
- Announcements: Flash Player 7 for Linux, The Sendmail messaging integrity pilot program,
Sun Java WSDP 1.4, YAPC::Europe Foundation, OMG Information Day: London,
Wizards of OS 3.
- Letters: qmail; Defending against buffer overflows; Five-nines.
Next page:
Security>>