By Jake Edge
October 28, 2008
A seemingly innocuous change to the networking code that went into the
2.6.27 kernel is now
causing trouble for various distributions. Ubuntu, Fedora, and openSUSE are
all buttoning up their
packages for a release in the near future—with Ubuntu's due this
week—so kernel changes are not
particularly welcome. Unfortunately, if the problem is not addressed, some
users may never be able to download a
fix because their TCP/IP won't interoperate with some broken equipment
on the internet.
The problem stems from changes that were made to clean up the TCP option
code that were merged
back in July as part of the 2.6.27 merge window. TCP options are
a mechanism to expand the functionality of the protocol as conditions
change. There are a handful of commonly used options that the two
endpoints of a connection can agree to use, for things like maximum segment
size (MSS), window scaling, selective acknowledgment (SACK), and
timestamps. Options have been added over time to provide more internet
robustness and performance as well as to support higher-bandwidth
physical connections.
A perfectly
reasonable, if unintended, consequence of the code change was that the
the options were put into the header in a slightly different order.
According to the relevant RFCs,
options can appear in any order in the option section of the TCP header.
But, some home and/or internet routers seem to expect a fixed order;
refusing to make connections if the order is "wrong".
In particular, it would seem that the MSS option needs to appear before the
SACK option.
The bug was reported
to Ubuntu Launchpad in early September, but not a lot of progress was
made until it was added to the kernel.org
bugzilla in early October. It seems to have only affected a relatively
small number of users—Red Hat's Dave Jones said that there were no
reports from users of the rawhide 2.6.27 kernel—as it was rather
hardware-specific. This made it difficult to track down for the majority
of folks who couldn't reproduce it. Ubuntu user Aldo Maggi, who filed the
kernel bug,
sets a marvelous example of how to work with the kernel hackers to track
down the problem as can be seen in the bugzilla entry.
Eventually, the option re-ordering problem was discovered and a patch was submitted by Ilpo Järvinen that
restored the order of the options. Along the way, with help from
Mandriva,
it was discovered that
turning off TCP timestamps by way of:
sysctl -w net.ipv4.tcp_timestamps=0
worked around the problem without changing the kernel—at the cost of
losing the TCP timestamp functionality.
So it would seem that the problem has been solved—the patch has been
merged
into Linus Torvalds's tree for 2.6.28—but there are still a few
unresolved issues. The three distributions that are preparing new releases
are all based on 2.6.27, but as yet, there has not been a -stable kernel
release that picks up the patch, though it is likely to come fairly soon.
In the meantime, Fedora has added the patch to its kernel in rawhide, so
Fedora 10 (and eventually Fedora 9 when it gets rebased on 2.6.27) will
have the fix. openSUSE is waiting a bit to see what gets submitted by the
kernel networking developers to the
-stable team. As Novell/SUSE kernel hacker Greg Kroah-Hartman puts it:
"We still have a while to go before the final 11.1
kernel is released, so we feel no pressure here." Unfortunately,
Ubuntu got caught very late in its release cycle as 8.10 (or Intrepid Ibex)
is due on October 30.
The original plan as outlined
by Debian/Ubuntu hacker Steve Langasek was to note the problem in the
release notes
for 8.10, but not address the underlying problem until after the release:
The kernel fix is known upstream; implementing it requires kernel uploads
and installer rebuilds, which it's just not possible to fit in between the
release candidate and the release. We will certainly want to include this
fix in a kernel update as soon as possible after the release, but this is
unfortunately in a class of bugs that we can't fix the week of release (even
turning timestamps off requires a kernel upload, unless we want to
permanently disable tcp timestamp support for Ubuntu 8.10).
That led many in the Launchpad bug thread to note that it was going to be
a real mess, especially for the least technical of users. Nick Lowe sums
up the problem:
[...] You should really delay for this if you need more time...
RC shouldn't mean Release ComeHellOrHighWater
The users who are most likely to hit this are home users behind their
aged/unmaintained consumer routers who are highly unlikely to understand
why they can't access the Web and will just go elsewhere...
Certainly, the release notes are not the first place an affected user would
go if they ran into the problem. More than likely, they would just decide that
Ubuntu—by extension Linux—is simply broken, so it is a relief
to see
that Ubuntu eventually relented. For 8.10, the procps package has
been changed to work around the problem by turning off timestamps. Once a
new kernel package is released with the re-ordering patch included,
timestamps can presumably be restored.
This kind of problem—where affected users may not be able to retrieve an
update to fix it—should really be part of the definition of a
show-stopping (i.e. release date slipping) problem. It was rather galling
to some that Ubuntu
would consider shipping with this known issue, simply to make its 8.10
release in the 10th month of 2008 (which is how Ubuntu releases are numbered).
Ubuntu is justifiably proud of its record of shipping releases on time, but
it cannot do that at the expense of its users. While the workaround that
was implemented was suboptimal, perhaps, it does ensure that
users—especially non-technical users—won't find that web
surfing doesn't work in Linux. It should also allow Ubuntu to release on
schedule.
[ Thanks to Nick Lowe for giving us a heads-up about this issue. ]
(
Log in to post comments)