Networking change causes distribution headaches

By Jake Edge
October 28, 2008

A seemingly innocuous change to the networking code that went into the 2.6.27 kernel is now causing trouble for various distributions. Ubuntu, Fedora, and openSUSE are all buttoning up their packages for a release in the near future—with Ubuntu's due this week—so kernel changes are not particularly welcome. Unfortunately, if the problem is not addressed, some users may never be able to download a fix because their TCP/IP won't interoperate with some broken equipment on the internet.

The problem stems from changes that were made to clean up the TCP option code that were merged back in July as part of the 2.6.27 merge window. TCP options are a mechanism to expand the functionality of the protocol as conditions change. There are a handful of commonly used options that the two endpoints of a connection can agree to use, for things like maximum segment size (MSS), window scaling, selective acknowledgment (SACK), and timestamps. Options have been added over time to provide more internet robustness and performance as well as to support higher-bandwidth physical connections.

A perfectly reasonable, if unintended, consequence of the code change was that the the options were put into the header in a slightly different order. According to the relevant RFCs, options can appear in any order in the option section of the TCP header. But, some home and/or internet routers seem to expect a fixed order; refusing to make connections if the order is "wrong". In particular, it would seem that the MSS option needs to appear before the SACK option.

The bug was reported to Ubuntu Launchpad in early September, but not a lot of progress was made until it was added to the kernel.org bugzilla in early October. It seems to have only affected a relatively small number of users—Red Hat's Dave Jones said that there were no reports from users of the rawhide 2.6.27 kernel—as it was rather hardware-specific. This made it difficult to track down for the majority of folks who couldn't reproduce it. Ubuntu user Aldo Maggi, who filed the kernel bug, sets a marvelous example of how to work with the kernel hackers to track down the problem as can be seen in the bugzilla entry.

Eventually, the option re-ordering problem was discovered and a patch was submitted by Ilpo Järvinen that restored the order of the options. Along the way, with help from Mandriva, it was discovered that turning off TCP timestamps by way of:

    sysctl -w net.ipv4.tcp_timestamps=0

worked around the problem without changing the kernel—at the cost of losing the TCP timestamp functionality.

So it would seem that the problem has been solved—the patch has been merged into Linus Torvalds's tree for 2.6.28—but there are still a few unresolved issues. The three distributions that are preparing new releases are all based on 2.6.27, but as yet, there has not been a -stable kernel release that picks up the patch, though it is likely to come fairly soon.

In the meantime, Fedora has added the patch to its kernel in rawhide, so Fedora 10 (and eventually Fedora 9 when it gets rebased on 2.6.27) will have the fix. openSUSE is waiting a bit to see what gets submitted by the kernel networking developers to the -stable team. As Novell/SUSE kernel hacker Greg Kroah-Hartman puts it: "We still have a while to go before the final 11.1 kernel is released, so we feel no pressure here." Unfortunately, Ubuntu got caught very late in its release cycle as 8.10 (or Intrepid Ibex) is due on October 30.

The original plan as outlined by Debian/Ubuntu hacker Steve Langasek was to note the problem in the release notes for 8.10, but not address the underlying problem until after the release:

The kernel fix is known upstream; implementing it requires kernel uploads and installer rebuilds, which it's just not possible to fit in between the release candidate and the release. We will certainly want to include this fix in a kernel update as soon as possible after the release, but this is unfortunately in a class of bugs that we can't fix the week of release (even turning timestamps off requires a kernel upload, unless we want to permanently disable tcp timestamp support for Ubuntu 8.10).

That led many in the Launchpad bug thread to note that it was going to be a real mess, especially for the least technical of users. Nick Lowe sums up the problem:

[...] You should really delay for this if you need more time...

RC shouldn't mean Release ComeHellOrHighWater

The users who are most likely to hit this are home users behind their aged/unmaintained consumer routers who are highly unlikely to understand why they can't access the Web and will just go elsewhere...

Certainly, the release notes are not the first place an affected user would go if they ran into the problem. More than likely, they would just decide that Ubuntu—by extension Linux—is simply broken, so it is a relief to see that Ubuntu eventually relented. For 8.10, the procps package has been changed to work around the problem by turning off timestamps. Once a new kernel package is released with the re-ordering patch included, timestamps can presumably be restored.

This kind of problem—where affected users may not be able to retrieve an update to fix it—should really be part of the definition of a show-stopping (i.e. release date slipping) problem. It was rather galling to some that Ubuntu would consider shipping with this known issue, simply to make its 8.10 release in the 10th month of 2008 (which is how Ubuntu releases are numbered).

Ubuntu is justifiably proud of its record of shipping releases on time, but it cannot do that at the expense of its users. While the workaround that was implemented was suboptimal, perhaps, it does ensure that users—especially non-technical users—won't find that web surfing doesn't work in Linux. It should also allow Ubuntu to release on schedule.

[ Thanks to Nick Lowe for giving us a heads-up about this issue. ]

Networking change causes distribution headaches

Posted Oct 28, 2008 20:48 UTC (Tue) by pj (subscriber, #4506) [Link] (11 responses)

Ubuntu clearly needs to be aiming its release date for 10/01 not 10/31 ; that would give it time to slip a bit if needed.

Networking change causes distribution headaches

Posted Oct 28, 2008 21:22 UTC (Tue) by ca9mbu (guest, #11098) [Link] (10 responses)

Even targetting at 10/01 (or, less ambiguously, 2008-10-01), wouldn't have helped in this particular case, I don't think. According to http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-..., the fix was only applied on 2008-10-26, which, I don't think, would have given Ubuntu time enough to respin the release. And, given the nature of the bug, picking up an update post-installation through apt-get is unlikely to have worked.

Yes, it sucks that this had the potential to impact Ubuntu's release schedule, but I guess that's the price one pays for time-based releases (which I'm all in favour of). I don't agree with Ubuntu's decision to workaround the issue via a procps update as opposed to a kernel update just to avoid a release slippage, but then I'm not the RM (or even involved with Ubuntu in any capacity).

Regards,

Matt.

Networking change causes distribution headaches

Posted Oct 29, 2008 1:07 UTC (Wed) by jordip (guest, #47356) [Link] (9 responses)

Aiming to 2008/10/01 would have made impossible to use 2.6.27, using 2.6.26 instead so this problem would have not appeared.
The problem here is that Ubuntu released too close to a kernel release. It was a known issue but 2.6.27 have enough benefits to weight that.
I think future releases of Ubuntu will be more conservative on this regard.
Also, a faster Ubuntu bugtracker -> kernel bugtracker interaction may have save the day. This is an old discussion now...

On the other hand 10/01 will make Ubuntu release farther in time from Fedora and Opensuse ...

Networking change causes distribution headaches

Posted Oct 29, 2008 5:27 UTC (Wed) by ncm (guest, #165) [Link] (6 responses)

How will the procps workaround ever be removed, once a fixed kernel is installed? I.e., how does procps know what kernels might be on the system, and need it? Is the workaround a script that will test the running kernel's version, and operate only for known-broken kernels?

Networking change causes distribution headaches

Posted Oct 29, 2008 8:24 UTC (Wed) by rvfh (guest, #31018) [Link] (3 responses)

The procps package with the fix will depend on a kernel with version greater or equal to the fixed one. Simple.

Networking change causes distribution headaches

Posted Oct 29, 2008 11:25 UTC (Wed) by mjg59 (subscriber, #23239) [Link] (2 responses)

It can't do that, since users may be using unpackaged kernels. Also, the Ubuntu kernel packaging system results in a different package name per version (it's the only way to handle parallel installation on Debian-style systems) so you can only depend on a single specific kernel version.

Networking change causes distribution headaches

Posted Oct 29, 2008 16:30 UTC (Wed) by jzbiciak (guest, #5246) [Link] (1 responses)

If only there were a filesystem that allowed the kernel to export information to user space, such as what kernel version was currently running... ;-)

Seriously, why can't procps look at /proc/version and presume that 2.6.27 is broken, but any other version (including 2.6.27.1). As long as it's looking very specifically for the broken version's version string there, it should work ok.

Sure, if someone installs a broken kernel with a different string, then the workaround won't kick in, but I don't really see a problem with that. If you're installing your own kernel rather than sticking to vendor kernels, then you're signing up to own a bit more of the problem yourself, don't ya think?

Networking change causes distribution headaches

Posted Oct 29, 2008 18:21 UTC (Wed) by jzbiciak (guest, #5246) [Link]

Seriously, why can't procps look at /proc/version and presume that 2.6.27 is broken, but not any other version (including 2.6.27.1)

oops. :-) Fixed that.

Networking change causes distribution headaches

Posted Oct 29, 2008 12:40 UTC (Wed) by nix (subscriber, #2304) [Link] (1 responses)

The workaround is a one-line sysctl.conf addition.

Networking change causes distribution headaches

Posted Oct 29, 2008 18:42 UTC (Wed) by ncm (guest, #165) [Link]

So it appears there's no clean way to phase out the workaround.

Networking change causes distribution headaches

Posted Oct 30, 2008 1:05 UTC (Thu) by sbergman27 (guest, #10767) [Link] (1 responses)

"""
The problem here is that Ubuntu released too close to a kernel release.
"""

Ubuntu has traditionally been more conservative. But in a "practice what we preach" action, they sync'd up with the other major distros which were planning to use 2.6.27 to help synchronize problem finding and debugging focus. Personally, the way kernel development is done these days I think the distros need to lag kernel releases a bit more. The fall releases, with the exception of Fedora, should really have targeted 2.6.26. I'm not criticizing the current kernel development process (though I gravely note Andrew's ongoing quality concerns), but the 2.6 dev process means that the distros are responsible for that much more of the QA. And that can't be done in a hurry. This particular issue doesn't seem too severe. But a month before general availability, going gold, or whatever you want to call it, the included kernel shouldn't be physically destroying beta testers' hardware or otherwise exhibiting behavior of baby-eating magnitude.

Networking change causes distribution headaches

Posted Oct 30, 2008 13:50 UTC (Thu) by filipjoelsson (guest, #2622) [Link]

Excuse me but, what are you talking about?

In the earlier series of kernels, the vendor patchsets were much larger - and contained everything from drivers and filesystems to security fixes. I would argue that there was less QA before kernel release in the 2.0, 2.2 and 2.4 series (where the QA was, it boots Linus'/Alan's/Marcello's computer). Ok, so now there is a much bigger difference between each point version than it was then, but still - the vendors cooperate in the same kernel tree to a much larger extent, and there actually _is_ QA now. Anyone remember versions 2.2.0, and 2.2.1 (what were they, one or two days apart)? Care to have a chat about kernel versions 2.4.0 to 2.4.13?

What we have now is tremendously better tested than the old and ancient series.

"Thanks to Nick Lowe"

Posted Oct 28, 2008 21:55 UTC (Tue) by rfunk (subscriber, #4054) [Link]

Cool, the "Cruel To Be Kind" guy runs Linux! ;-)

(What? What's so funny?)

Networking change causes distribution headaches

Posted Oct 28, 2008 23:34 UTC (Tue) by PaulWay (guest, #45600) [Link] (18 responses)

Is it possible to detect if the dodgy equipment is causing problems, and set a flag in the kernel to transmit the packets in the correct order? E.g. does the dodgy equipment return some response that Linux can see - e.g. an icmp reject on the connection?

Yes, this is a hack, and I for one hate hacks that permit bad behaviour in other devices at the expense of maintainability and simplicity of the non-offending code. But it may be a better option than turning all TCP timestamps off or reverting the kernel.

It might also provide a way to alert users that their networking hardware needs updating, which solves the problem in a more permanent way.

Networking change causes distribution headaches

Posted Oct 28, 2008 23:43 UTC (Tue) by rfunk (subscriber, #4054) [Link] (17 responses)

How does that work better than just making the kernel *always* transmit in
the "right" order? Sounds like adding lots of complication for nearly
zero gain.

Networking change causes distribution headaches

Posted Oct 29, 2008 0:24 UTC (Wed) by dlang (guest, #313) [Link] (8 responses)

the kernel always did transmit in the "right" order according to the RFCs

the problem is that it has been discovered that there are some routers out there that do not follow the RFCs and only work if things get transmitted in one specific order.

so the kernel has been changed (post 2.6.27) to transmit in the order that this batch of broken routers require.

for bonus points, what should the kernel do if another batch of broken routers is discovered that wants a different order?

Networking change causes distribution headaches

Posted Oct 29, 2008 1:08 UTC (Wed) by jamesh (guest, #1159) [Link] (3 responses)

> for bonus points, what should the kernel do if another batch of
> broken routers is discovered that wants a different order?

Presumably, the current broken routers work with the packets generated by Windows. If a new router expected a different option order it wouldn't work with Windows, which is the kind of problem that would be noticed.

Networking change causes distribution headaches

Posted Oct 29, 2008 1:29 UTC (Wed) by dlang (guest, #313) [Link] (2 responses)

so we need to reverse engineer how windows does things and never do anything different, even if the RFC allows it?

with that mindset we can never be better than windows.

yes, it is the case with doggy hardware that sometimes we do end up saying that 'windows does it this way and it works, the hardware doesn't follow the specs so we just need to do it the same way'

but to take that attitude about something that's supposed to be as generic as your network packets can be crippling.

Networking change causes distribution headaches

Posted Oct 29, 2008 2:23 UTC (Wed) by corbet (editor, #1) [Link]

The sad fact is that "what does Windows do?" is a question that kernel developers often have to keep in mind. Whatever Windows does is what's actually tested; it's often the only thing that works. It's a pain.

Networking change causes distribution headaches

Posted Nov 1, 2008 4:32 UTC (Sat) by jbailey (guest, #16890) [Link]

It's not so much a matter or never so much as knowingly. Linux doing ECN
managed to make all sorts of devices on the Internet not cope with Linux,
so it had to be disabled in order to work. But it's still there and an
option. This thing isn't going to matter one way or the other, so it may
as well be done as Windows does it to avoid any hassle.

Tks,
Jeff Bailey

Networking change causes distribution headaches

Posted Oct 29, 2008 3:48 UTC (Wed) by gdt (subscriber, #6284) [Link] (3 responses)

In the real world kernels deal with equipment which incorrectly implements specifications all of the time: ranging from hard disks to TCP. TCP itself has one option (the urgent pointer) who's current interpretation differs from the original specification due to an implementation error in early BSD.

This issue is hardly the first home router or firewall issue encountered: some break on ECN, some break on SACK, some incorrectly handle large window scale values. Some of those home routers with bugs run Linux.

It is disappointing that Ubuntu chose to limit the performance of TCP rather than ship a patched kernel.

Networking change causes distribution headaches

Posted Oct 29, 2008 4:02 UTC (Wed) by dlang (guest, #313) [Link] (2 responses)

are tey limiting the performance of TCP?

I've seen many cases where doing the time calls in the TCP stack becomes the limiting factor, so disabling this should speed up TCP, it limits the features, but not the performance

Networking change causes distribution headaches

Posted Oct 29, 2008 14:52 UTC (Wed) by drag (guest, #31333) [Link] (1 responses)

Well if your Ubuntu system is failing to, you know, contact the update server to download a fixed kernel because TCP is being blocked by a broken router; when every older version of Ubuntu can do it just fine, and every other OS does it just fine... then ya that's a dramatic reduction in performance.

-----------------------------------------

I can't beleive Ubuntu people are so closed minded that they can't understand that if you can't get out on the internet to download a fixed kernel, then your screwed. Your only option, as a end user, is to download the kernel fix post-installation. But if you can't contact it because your kernel is triggering a common TCP implimentation bug.. then your SOL.

There is a similar issue with DNS brokenness with Linux in general. As in; Linux behaving correctly, but getting bad results because a ISP can't get their shit straight or you have a buggy DNS proxy in some SOHO router. This is pretty common and it prevents end users from being able to reliably use some websites, which otherwise works perfectly well in any other OS. (the fix is usually to install a local DNS caching service like dnsmasq on the system)

Your bugs / my problem

Posted Oct 29, 2008 18:57 UTC (Wed) by tialaramex (subscriber, #21167) [Link]

This sort of brokenness is universal. Software has bugs. Sometimes the other guy's software has bugs, but you have to pay the price. So long as we don't have some evidence that the bugs were a result of malice, there is nothing much to do except name & shame, and then suck it up.

Prior examples include: DNS servers that silently ignore AAAA requests instead of replying that there's no matching record, causing a timeout for users who merely /enquired/ if they could use IPv6. IP "firewalls" that drop every type of ICMP packet indiscriminately by default. HTTP servers that silently accept pipelined requests, but don't pipeline the answers - so it answers all your HTTP queries, but the results are arbitrarily muddled together. Home routers that silently modify any 4 byte sequence resembling your private IP address to the 4 bytes representing the masqueraded public address? Yes, those really exist. Sometimes it seems like it'd be better to flush it away and start over - but don't make that mistake, we'd make just as many errors next time.

Although they seem to be the worst offenders, the proprietary systems aren't the only ones making these goofs. Samba's buggy attempt at early implementation of a new Windows SMB feature meant that not only could you not use the feature with Samba, but Microsoft had to disable it for Windows clients too, so everyone lost.

And let's not dwell on Debian's OpenSSL goof. To achieve a reasonable expectation of security everyone's SSL implementations should be updated to regard all the affected keys as weak, and reject them outright - but doing that means a permanent increase in the overhead of using SSL forever and for everyone in the whole world. Ouch.

Networking change causes REGRESSION

Posted Oct 29, 2008 1:29 UTC (Wed) by brianomahoney (guest, #6206) [Link] (1 responses)

I, for one am getting _very_fed-up_ with people who dont seem to
understand that breaking something that was working is a very bad thing. I
completely agree with Linus that these REGRESSIONS are to be avoided. and
fixed ASAP.

At the least they are very irritating and usually time-consuming, and
there are all too many these days, in the kernel eg this, e1000e and
userland eg latest Firefox/Seamonkey breaking non CUPS printing.

While it is true that newbies and should not be using alpha, beta stuff it
is true that fewer and fewer corner-cases are being tested before shipping
the newist-latest ... ops!

Networking change causes REGRESSION

Posted Oct 29, 2008 4:02 UTC (Wed) by dlang (guest, #313) [Link]

who are you upset at?

the kernel developers did fix it quickly after it was reported.

it's impossible to test against all hardware as there is nobody in the world that has one of everything to test against (especially when you consider that firmware updates can radicaly change the behavior as well)

Networking change causes distribution headaches

Posted Oct 29, 2008 2:51 UTC (Wed) by PaulWay (guest, #45600) [Link] (5 responses)

Because the intention is not to wallpaper over the mistake and forget about it, the intention is to alert the user that they have non-compliant hardware on their network and they should upgrade.

Because Linux is not an operating system that says "well, it sort of kind of works, that's good enough, why change it?" to decisions like this. Reverting back to the previous behaviour is good to fix the problem short-term, but a long-term solution needs to be developed.

IMO patching device drivers and kernels to make them work with hardware in the machine is (vaguely) acceptable; the further the device is from the machine, the more it's not the kernel's responsibility.

Networking change causes distribution headaches

Posted Oct 29, 2008 4:31 UTC (Wed) by jamesh (guest, #1159) [Link] (2 responses)

In this particular case, there are multiple ways to structure a packet that are considered equally valid according to the RFC and have the same code complexity.

One of the options happens to avoid a bug in certain hardware, probably due to matching the behaviour of a certain competing operating system. Why on earth wouldn't you choose that option?

Your suggestion would result in more complex code that has the potential to be slower and more buggy.

Networking change causes distribution headaches

Posted Oct 30, 2008 7:00 UTC (Thu) by grahammm (guest, #773) [Link] (1 responses)

So what happens when (as is sure to happen some time) option A is needed to avoid a bug in one particular hardware and option B to avoid a bug in a different hardware?

Networking change causes distribution headaches

Posted Oct 30, 2008 14:44 UTC (Thu) by mrshiny (guest, #4266) [Link]

Worry about that when it happens. Until then, zero-cost workarounds that prevent loss of functionality are more desirable than some sort of notion of purity.

Networking change causes distribution headaches

Posted Oct 29, 2008 12:59 UTC (Wed) by epa (subscriber, #39769) [Link] (1 responses)

the intention is to alert the user that they have non-compliant hardware on their network and they should upgrade.

Yes, that's exactly what the intention is. Clearly, what the users want most of all is not to get their work done, but to receive useful and informative messages about hardware purchases they need to make in order to remain fully standards-compliant. Imagine a new user's heartfelt shame on first installing Linux and finding out they had been running a router that didn't strictly follow the RFCs, soon turning to joy and gratitude that Linux had revealed their sins and given them an opportunity to buy a replacement, helping to financially support honest manufacturers who test their products with all the world's wide diversity of operating systems.

Compared to these noble goals, it would be baseness and narrow-mindedness indeed for anyone to complain that Linux "doesn't work" or does not let them access networks that seemingly worked with Microsoft Windows. Indeed, we should surely add more of these features to the kernel, righteously refusing to work with any hardware or program that doesn't correctly implement standards, to lead us further towards the goal of a world where all computers work harmoniously together. Let Linux lead the way!

(Excuse the excess of sarcasm, I'm really missing the Linux Hater's Blog since he stopped posting.)

Networking change causes distribution headaches

Posted Oct 29, 2008 18:49 UTC (Wed) by ncm (guest, #165) [Link]

Imagine, further, the joy in the detective work to identify and locate the owner of each intermediate router discarding one's packets, and the further joy of warm human contact achieved while persuading said owners to upgrade their equipment, and said owners' joy in locating and installing upgrades, and in finally having compliant equipment.

Such an outpouring of joy could not but uplift Ubuntu's standing in the world.

Networking change causes distribution headaches

Posted Oct 29, 2008 6:16 UTC (Wed) by davem (guest, #4154) [Link] (9 responses)

I think the procps "fix" is incredibly stupid.

Turning TCP timestamps off has severe consequences, not as
severe as the lack of connectivity this is trying to work
around, but pretty severe.

It has security implications in fact, it makes the range of
what an attacker has to guess to forge packets into your
TCP stream MUCH smaller, for one thing.

It's a really crass move on Ubuntu's part to be so asinine about
making kernel changes, even obvious critical ones like this
option fix, at a late stage. I've run into this problem with them
in the past when they supported sparc64, and this analness wrt.
last minute kernel changes hurts them a lot.

Networking change causes distribution headaches

Posted Oct 29, 2008 7:19 UTC (Wed) by jspaleta (subscriber, #50639) [Link] (7 responses)

No distributor enjoys having to deal with late breaking regressions leading up to a release.

For all the crap I give Canonical for other decisions, I'm not going to beat them up over a time-sensitive judgement call concerning a technical regression in the 11th hour 57th minute of their release cycle. I would not wish this sort of situation on distributor with a deadline to meet. They were pressed, they made a judgement call, a judgement call which hopefully insures that all installs have working network connectivity so all users can install updates as soon as the install is complete.

Distribution release processes...are painful. I think of it as akin to how child birth is depicted in older movies. Everytime the Fedora release team is in their final week during a release I feel like I need to be boiling water like you see anxious fathers being told to do by midwives... or something equally futile to stay out of releng's way (lwn posting might count). The release freeze process itself always causes a window of delay where security fixes can crop up that can't be included in the composed "release" tree without scrapping the whole compose process and starting over.

For whatever security implications the chosen quickfix has for Ubuntu users, hopefully Ubuntu will be able to put out a release day update to all users of 8.10 that addresses the issue which fixes the issue properly.

It's moot anyways, most people should be boycotting self-installing 8.10 at release anyways and purchasing it as part of a shiny new Dell pre-install to bolster pre-installed linux OEM demand statistics for this fiscal quarter. Dell will apply available updates for you as part of the pre-install.

-jef

Networking change causes distribution headaches

Posted Oct 29, 2008 7:45 UTC (Wed) by Cato (guest, #7643) [Link] (2 responses)

I agree about dealing with regressions like this - fixing the kernel in haste would probably have slipped the release date, and most people really don't need TCP timestamps, so it's fine for it to be fixed in a later update - people who do need timestamps (on long fat networks such as satellite links) can enable it quite easily after any routers have been fixed.

Most people should be running 8.04 not 8.10 in any case - as the first release after an LTS release, 8.10 is going to be more risky (hence the 'Intrepid', just like the 'Edgy' for 6.10).

I don't normally run a new Ubuntu release in production until a month or two has elapsed in any case.

Networking change causes distribution headaches

Posted Oct 29, 2008 14:56 UTC (Wed) by drag (guest, #31333) [Link] (1 responses)

THe problem is that if you install Ubuntu and your behind a buggy router then you may run into severe issues when trying to update your system.

In order to download the "fixed" kernel you need to be able to get on the Internet. If you can't get on the internet due to a "broken" kernel then how exactly are you suppose to solve your problem?

Catch-22.

The _only_ effective fix for Ubuntu, at this point, is to include the "fixed" kernel with the installer.

And I can understand if they can't pull it off. But it's going to suck for some people that they can't.

(Of course, I am fully aware that it's not Linux being broken, just the environment that Linux is expected to operate in has buggy network hardware sometimes)

Networking change causes distribution headaches

Posted Oct 29, 2008 15:00 UTC (Wed) by drag (guest, #31333) [Link]

*ooops*

that's what I get for not reading it to the end. They are a lot smarter then me, after all.

Networking change causes distribution headaches

Posted Oct 29, 2008 15:18 UTC (Wed) by TRS-80 (guest, #1804) [Link]

I disagree - the reason they had to resort to a workaround instead of applying the actual patch was because there wasn't enough time in their schedule to rebuild the kernel and installer between RC and release. In other words, the Ubuntu schedule made it impossible to fix any show-stopping kernel issue directly if found once the RC is built, which is clearly an avoidable problem.

Networking change causes distribution headaches

Posted Oct 29, 2008 15:40 UTC (Wed) by nevyn (guest, #33129) [Link] (1 responses)

For whatever security implications the chosen quickfix has for Ubuntu users, hopefully Ubuntu will be able to put out a release day update to all users of 8.10 that addresses the issue which fixes the issue properly.

Understandably you're thinking of rpm here and not dpkg, because dpkg has no was to do "installonly" type packages the kernel has the version in the name ... thus. there's no good way to say in procps "Requires: kernel >= 2.6.27-2". They might hack it by having a dep. from the fixed kernel on the newer procps, or they might release a procps later and assume noone will install that and use the GA kernel ... but they might just leave timestamps off for 8.10.

Personally it seems like they made a poor choice, but as you point out there are other more fundamental problems ... so this one is not high on the list, IMO.

Networking change causes distribution headaches

Posted Oct 29, 2008 17:06 UTC (Wed) by jspaleta (subscriber, #50639) [Link]

I'm not going to pretend that I have expert knowledge with regard to dpkg.

I can only assume that the Ubuntu release team thought this through and have the ability to push an update out that reverts the quick fix when a proper fix is available and tested.

If there are security implications for turning timestamping off, then intrepid Intrepid users should probably impress on the Ubuntu devs the importance of turning timestamping on as an update as soon as possible to limit exposure...in the appropriate Ubuntu communications channel.

I'm not going to falsely stand myself up as a network security expert and make a judgement on the validity of the security concern. Even if the security implications are a valid concern, I think its reasonable for Ubuntu to use the option of having a release day update available instead of having to restart their release process to incorporate the upstream fix. As long as a release day update addresses the security implications by turning timestamping back on and integrates the proper kernel patch for the routing regression, the exposure is mitigated to the level of any security issue which requires a post release update.

-jef

Posted Oct 29, 2008 23:43 UTC (Wed) by xoddam (subscriber, #2322) [Link]

> "Most people ... Dell pre-install".

Well yes. I did *ask* for it, but Dell for some reason still doesn't sell them in a whole heap of countries, so we simply chose a machine on which Ubuntu is sold pre-installed elsewhere, and installed it ourselves. I'd have chosen the version with Intel graphics (fully supported in free software), but that too isn't available here, so I have a fancy evil nvidia GPU, which luckily is documented to work fine with free-software drivers. And so it did, until I tried to resume after suspend-to-ram. No display.

How fortunate that Ubuntu makes it easy for me to "give up my freedom" and switch to the nasty source-free driver from the GPU maker. I intend to give some time to sorting out the suspend/resume problem with the nv developers (and/or the ubuntu xorg maintainers), but since it means a reboot every time it doesn't work, it will be a very time-consuming process and I couldn't use the machine for its intended purpose meanwhile. Which would annoy my employer, who paid for it.

Networking change causes distribution headaches

Posted Oct 30, 2008 18:23 UTC (Thu) by busterb (guest, #560) [Link]

Look, they already fixed it for real in today's updates:

Setting up procps (1:3.2.7-9ubuntu2.1) ...
Removing obsolete conffile /etc/sysctl.d/10-tcp-timestamps-workaround.conf
* Setting kernel variables (/etc/sysctl.conf)... [ OK
]
* Setting kernel variables (/etc/sysctl.d/10-console-messages.conf)... [ OK
]
* Setting kernel variables (/etc/sysctl.d/10-network-security.conf)... [ OK
]
* Setting kernel variables (/etc/sysctl.d/10-process-security.conf)... [ OK
]
* Setting kernel variables (/etc/sysctl.d/30-tracker.conf)... [ OK
]

Setting up linux-headers-2.6.27-7 (2.6.27-7.15) ...
Setting up linux-headers-2.6.27-7-generic (2.6.27-7.15) ...
Examining /etc/kernel/header_postinst.d.

So, which hardware is it?

Posted Oct 29, 2008 15:22 UTC (Wed) by AJWM (guest, #15888) [Link] (5 responses)

I couldn't find a single specific example of offending hardware in this article or any of the comments so far. I'd like to know which makes and models this out-of-spec behaviour has been encountered in so that I know what I might want to replace and what vendors to think twice about in future.

If the hardware doesn't conform to spec in this instance, who knows what other traps lie lurking in the defective hardware implementation?

So, which hardware is it?

Posted Oct 29, 2008 16:56 UTC (Wed) by jspaleta (subscriber, #50639) [Link] (4 responses)

There is a specific router mentioned in the kernel and ubuntu bug reports.

That's the biggest problem with this issue, we don't know how widespread it is.

If there was a way to test for brokenness without having to have users boot into an affected kernel.. something we can have them run as a quick test app. I'd be more than happy to take my rhetorical skills to the Fedora userbase and encourage to test their network gear for brokenness and report back so we can get a better handle which gear manufacturers we need to talk to about firmware updates.

-jef

So, which hardware is it?

Posted Oct 29, 2008 17:10 UTC (Wed) by jake (editor, #205) [Link] (3 responses)

> There is a specific router mentioned in the kernel and ubuntu bug reports.
>
> That's the biggest problem with this issue, we don't know how widespread it is.

Hmm, I thought another dimension of the problem was that it is not clear that it is only home routers that are problematic. If there is gear installed at the ISPs that is affected by this problem, it doesn't much matter what gear we buy. Which is not to say that it would not be worth knowing, just that no matter how much testing is done and how many new home routers are bought, we may still be routed through bad hardware.

jake

So, which hardware is it?

Posted Oct 29, 2008 18:51 UTC (Wed) by ncm (guest, #165) [Link]

After seven years, routers that discard ECN packets remain common. The routers' owners don't care, and why should they? The only people affected are, y'know, us.

So, which hardware is it?

Posted Oct 30, 2008 15:27 UTC (Thu) by AJWM (guest, #15888) [Link] (1 responses)

Well, those of us who work for ISPs or at companies with major data centers may care (and have influence) over more than just home-based gear. And those of us who work for vendors of such gear can raise their voices towards getting it fixed going forward.

(Minor rant mode: I don't know if it's just me noticing it more, or the problem is getting worse, but lately I'm seeing a lot of messages (posts and emails) complaining about problems without providing any specifics that would allow me (or someone) to investigate/fix the problem. Maybe it's the run-up to the election: all these zero-real-content political messages are causing widespread brain damage. Minor rant mode off.)

So, which hardware is it?

Posted Oct 30, 2008 15:59 UTC (Thu) by jake (editor, #205) [Link]

> Well, those of us who work for ISPs or at companies with major data centers
> may care (and have influence) over more than just home-based gear. And those
> of us who work for vendors of such gear can raise their voices towards
> getting it fixed going forward.

Which would be great of course. It is just not clear to me how Linux users who are experiencing problems with their TCP connectivity will be able to even determine what hardware is causing the problem. They may be able to switch to a known-good home router, but if they still have the problem, it is not obvious (at least to me) how to diagnose it from there. ISPs, at least in my experience, are not very interested in discussing their networking gear with their customers. Alerting them to the problem might help, at least in some cases, but it really isn't ever going to allow Linux to put the options in any arbitrary order.

jake

Networking change causes distribution headaches

Posted Oct 29, 2008 20:29 UTC (Wed) by davem (guest, #4154) [Link] (13 responses)

Btw, if you care at all about your data, you will not run
Ubuntu's release that doesn't fix the kernel and instead
turns timestamps off.

If you turn timestamps off, at rates of 1GB/s and above you
are exposed to possible sequence number wraparound. This in
turn can lead to data corruption. Without timestamps there is
no PAWS protection (Protection Against Wrapped Sequence numbers)
and thus at high enough data rates new data can be interpreted
as old data and vice versa, corrupting your data stream.

Ubuntu made the wrong decision, there is simply no argument for
the way this was "handled."

I don't understand why everyone gets their tits in a knot when
even the slightest suggestion of slipping a release is suggested
in order to fix a serious bug like one of this magnitude. It is
always the right thing to do, and it avoids crap like what is
happening here.

To reiterate, if timestamps are off, you are exposed to possible
data corruption.

Networking change causes distribution headaches

Posted Oct 29, 2008 22:04 UTC (Wed) by jspaleta (subscriber, #50639) [Link] (11 responses)

So its a data corruption issue on top of a security issue.

Has this been communicated into the Ubuntu bug tracker?

-jef

Networking change causes distribution headaches

Posted Oct 29, 2008 22:17 UTC (Wed) by nick.lowe (guest, #54609) [Link] (10 responses)

Yes, I have quoted and posted a link for that very reason.

Networking change causes distribution headaches

Posted Oct 29, 2008 22:32 UTC (Wed) by jspaleta (subscriber, #50639) [Link] (9 responses)

In the context of the problem the workaround attempts to fix... data corruption is a significant problem because it impacts your ability to get updates. Wasn't that the underlying motivation for the quickfix as implemented? Or am I misreading?

-jef

Networking change causes distribution headaches

Posted Oct 29, 2008 22:54 UTC (Wed) by nick.lowe (guest, #54609) [Link] (8 responses)

"data corruption is a significant problem because it impacts your ability to get updates"

No, not at all. :)

It is a seperate issue.

The workaround -introduces- a data corruption problem at high data rates because it disables PAWS protection in the TCP/IP stack by virtue of the timestamps no longer being there.

The issue here is that the server release will go out with this, which will be run on machines highly likely to see these data rates!

Networking change causes distribution headaches

Posted Oct 30, 2008 3:02 UTC (Thu) by njs (subscriber, #40338) [Link] (7 responses)

>The issue here is that the server release will go out with this, which will be run on machines highly likely to see these data rates!

Obviously this whole situation is unfortunate, but... is your suggestion really that there are large numbers of people with GB/s equipment who are likely to jump to a non-LTS Ubuntu release, on the first day, and don't read release notes, and don't install updates? Because that seems like a relatively narrow slice of the userbase to me -- not so narrow it should be ignored, so I'm glad you're continuing to help the ubuntu devs keep on top of things, but narrow enough that some of the other folks in this thread could maybe stand to relax a bit...

Networking change causes distribution headaches

Posted Oct 30, 2008 3:40 UTC (Thu) by nick.lowe (guest, #54609) [Link]

Fair point :)

Networking change causes distribution headaches

Posted Oct 30, 2008 6:27 UTC (Thu) by davem (guest, #4154) [Link] (1 responses)

Feel free to ignore the security implications of this change
as I detailed in an earlier comment.

I just mentioned the data corrupter just to show how absolutely
insane this was on just about every level.

Want to know the litmus test of how stupid this is? Not one
damn ubuntu kernel developer asked any of the core networking
folks for guidance on how to handle this problem. They didn't
know the implications, and they didn't bother to ask people
who did.

That's the definition of failure.

Networking change causes distribution headaches

Posted Oct 30, 2008 16:08 UTC (Thu) by hppnq (guest, #14462) [Link]

Well, you would have to be doing a distribution upgrade, boot into it immediately and go into production without looking for updates. I think it is fair to assume that not too many people would run into TCP timestamp related corruption. If they really care about their data, obviously their scripts would notice the absence of TCP timestamping with this new release.

Here's a simple explanation for Ubuntu's decision.

As a side note: for home users -- who are extremely unlikely to be running at high enough data rates -- there is (also) the option to revert to the last working kernel. Maybe in a next release, this specific kind of distribution problem will actually be "solved" by Ubuntu. Which would be very nice.

Second side note: a couple of years ago PAWS users were vulnerable -- on a rather big scale -- to a remote Dos. Your mileage will always vary.

Phasing out the broken-workaround procps

Posted Oct 30, 2008 9:49 UTC (Thu) by ncm (guest, #165) [Link] (3 responses)

I still haven't seen a plausible story of how they're going to phase out the buggy workaround. Will there be a good procps package that conflicts with the bad kernel, and a good kernel that conflicts with the workaround procps version? Will they be in the security-updates repository?

Phasing out the broken-workaround procps

Posted Oct 30, 2008 9:56 UTC (Thu) by njs (subscriber, #40338) [Link]

> Will there be a good procps package that conflicts with the bad kernel, and a good kernel that conflicts with the workaround procps version?

Sure, maybe. Is your objection literally that you don't know how they're going to phase it out, or that you're worried that in fact they won't phase it out? I don't know how they're planning to do it (though the suggestion somewhere upthread of checking the runtime version of the kernel sounded plausible to me, and they can just drop it altogether in 9.04 in any case), but I'm pretty confident that they don't want to carry this annoyingness around and will find some way to get rid of it, and the exact mechanism they choose doesn't affect me, so I don't really care what it is.

Phasing out the broken-workaround procps

Posted Oct 31, 2008 1:43 UTC (Fri) by jamesh (guest, #1159) [Link] (1 responses)

I just updated my system, and got a new procps package (1:3.2.7-9ubuntu2.1) that drops the sysctl changes and a new linux kernel package (2.6.27-7.15) that fixes the options order. Both packages came through the intrepid-security repository.

Phasing out the broken-workaround procps

Posted Oct 31, 2008 2:11 UTC (Fri) by ncm (guest, #165) [Link]

Okay, then.

At what point will the download CD/DVD images get the updates?

Networking change causes distribution headaches

Posted Oct 31, 2008 18:50 UTC (Fri) by Cato (guest, #7643) [Link]

If you really mean 1 Gigabyte/second transfer rates, that means a 10G Ethernet link or the equivalent using SONET/DWDM, from a single host, and probably over a long fat pipe (e.g. satellite), which sounds exceedingly unlikely for Ubuntu, which is aimed at consumer and business usage - more like something CERN would do.

IMHO, anyone who is not running Ubuntu on a supercomputer using that type of network connection can safely ignore the chance of the PAWS issue corrupting their data.

Even if you meant 1 Gbps, that's an impressive sustained rate for a high latency network session (i.e. the ones where sequence numbers matter, hence over a WAN). A Gigabit Ethernet LAN would almost by definition have low latencies (a few milliseconds).

I'm generalising here, but I really think the PAWS issue is irrelevant to people who are likely to use Ubuntu. If it was an HPC distro it would be quite different.

Networking change causes distribution headaches

Networking change causes distribution headaches

Networking change causes distribution headaches

Networking change causes distribution headaches

Networking change causes distribution headaches

Networking change causes distribution headaches

Networking change causes distribution headaches

Networking change causes distribution headaches

Networking change causes distribution headaches

Networking change causes distribution headaches

Networking change causes distribution headaches

Networking change causes distribution headaches

Networking change causes distribution headaches

"Thanks to Nick Lowe"

Networking change causes distribution headaches

Networking change causes distribution headaches

Networking change causes distribution headaches

Networking change causes distribution headaches

Networking change causes distribution headaches

Networking change causes distribution headaches

Networking change causes distribution headaches

Networking change causes distribution headaches

Networking change causes distribution headaches

Networking change causes distribution headaches

Your bugs / my problem

Networking change causes __REGRESSION__

Networking change causes __REGRESSION__

Networking change causes distribution headaches

Networking change causes distribution headaches

Networking change causes distribution headaches

Networking change causes distribution headaches

Networking change causes distribution headaches

Networking change causes distribution headaches

Networking change causes distribution headaches

Networking change causes distribution headaches

Networking change causes distribution headaches

Networking change causes distribution headaches

Networking change causes distribution headaches

Networking change causes distribution headaches

Networking change causes distribution headaches

Networking change causes distribution headaches

Networking change causes distribution headaches

So, which hardware is it?

So, which hardware is it?

So, which hardware is it?

So, which hardware is it?

So, which hardware is it?

So, which hardware is it?

Networking change causes distribution headaches

Networking change causes distribution headaches

Networking change causes distribution headaches

Networking change causes distribution headaches

Networking change causes distribution headaches

Networking change causes distribution headaches

Networking change causes distribution headaches

Networking change causes distribution headaches

Networking change causes distribution headaches

Phasing out the broken-workaround procps

Phasing out the broken-workaround procps

Phasing out the broken-workaround procps

Phasing out the broken-workaround procps

Networking change causes distribution headaches

Networking change causes REGRESSION

Networking change causes REGRESSION