User: Password:
|
|
Subscribe / Log in / New account

The newest development model and 2.6.14

The newest development model and 2.6.14

Posted Nov 3, 2005 12:37 UTC (Thu) by malor (guest, #2973)
Parent article: The newest development model and 2.6.14

The Linux developers have gone off into a very strange, la-la land, if they think two months is A) enough to guarantee a stable kernel, and B) too SLOW.

All code has bugs. Complex code with many interactions has lots of bugs. It takes a long time to test all the cases. While the quality of the code in the linux kernel is very high, you get a geometric increase in bugs from unforeseen interactions. Even if you start with a very low bug per line of code ratio, if you multiply that enough times, you still get big numbers.

The way to get something really stable is to beat on it for a very long time. You have to test and test and test, and then put into production, and test some more. You fix problems by doing a least-possible patch. You don't add new functions. You don't make architectural changes. You do the absolute minimum amount of tweaking possible.

After several years, the 2.4 kernel is pretty stable. It's not perfect; high loads on more esoteric (or cheap) hardware can still cause kernel death. It's mostly good enough for serious production use. Up until 2.4.10 or so, it was a nightmare, but once they branched off to 2.5 and STOPPED MESSING with it, it stabilized.

This new method of kernel development is not about the users, it's about the developers. Their organization is immature and becoming more so. They are wanting to focus on the fun part (writing new code) instead of the boring part (making the code actually work reliably under every possible condition.) Yes, they still care about bugs, and they do try to squash them, but it's no longer the primary focus of the dev team[. Bugs are an annoying distraction from the fun stuff. The team has literally sort of waved its hands in the air and decreed that "the distros will make it work". They stop maintaining kernels after sixty days, so if you want further bugfixes, you'll have to take new code along with them. This will NEVER stabilize. New code ALWAYS has bugs.

This is coming at the exact wrong time.. Linux is really starting to get traction, and they've changed the rules midstream. A huge ecosystem was built up around the fact that Linux was solid and reliable and you could trust it. The community exists based on that fundamental social contract, "we're giving you really excellent code that won't fall over, go have fun", and they have changed to "we're giving you rather untested code because we're tired of all that boring bugfixing".

So suddenly, the community doesn't have a stable center anymore. There isn't a canonical kernel. There's no single focus point for the Linux world. Instead, you have a target that's moving so fast that many companies won't even finish testing a new release before it's obsolete. There is no center point which can be taken as THE reference.

Instead, the devs seem to think that the distros will have to do the bugfixing. But there's no guarantee they'll do it in compatible ways. If, say, shared memory has a problem, Red Hat may solve it one way, and Mandrake may solve it another, and it's entirely possible they won't be compatible. So if you, the customer, has a problem with a particular program running (like, say, Oracle), then all you end up with is finger pointing. Oracle blames Red Hat, Mandrake blames Oracle.

What you'll inevitably end up with is a kernel fork, where Oracle will pick, for instance, just one supplier to support. If we get two separate commercial products that choose different distros, we'll, you're not going to be able to run them both at the same time without a lot of pain. And few commercial vendors is going to choose Debian, which I consider the closest to a 'true Linux' distro out there.

There needs to be a stable center. There needs to be a One True Kernel, the Real Linux. It needs to be rock-solid reliable. It needs to be the reference so that all finger pointing stops. If Oracle doesn't run on the One True Kernel, then it's Oracle's problem. If a distro changes the behavior of the One True Kernel and it breaks someone's app, then it's the distro's fault.

Given that center, there can be explosive innovation in all directions, some of which will end up back in the One True Kernel. It allows both evolution and selection. Competition is wonderful, but without a strong center of gravity, the creative forces in the different kernel trees will inevitably force them into incompatible directions.

The kernel devs need to come back to reality, where people plan deployments months in advance, and take a year to roll something out. The Linux kernel has enjoyed what success it has in business because it wasn't just for developers, but because system administrators loved it.

This system administrator, for one, has most emphatically fallen out of love.


(Log in to post comments)

The newest development model and 2.6.14

Posted Nov 3, 2005 14:18 UTC (Thu) by tkreagan (subscriber, #4548) [Link]

Actually, I agree with this. Linux has turned into an interesting research OS. It's lost interest as a
stable system. And the focus on fun code vs tight code or efficient code means that its just a
matter of time until some lurking issue surfaces and with a big piece of humble pie.

That being said, it's still fascinating to follow the developments and see them feeling their way
towards the future. There should be a parallel group merging the work of the distributors back into
the main line.

I've moved to OpenBSD - lots of interesting stuff to play with, compatible enough, yet so totally
stable that I can run my firewalls/public servers on it and simply forget about it.

--tkr

The newest development model and 2.6.14

Posted Nov 3, 2005 15:26 UTC (Thu) by mingo (subscriber, #31122) [Link]

(disclaimer: i am a Linux kernel developer)

i have read your comments with interest, up to this one:

>>>
Instead, the devs seem to think that the distros will have to do the bugfixing. But there's no guarantee they'll do it in compatible ways. If, say, shared memory has a problem, Red Hat may solve it one way, and Mandrake may solve it another, and it's entirely possible they won't be compatible. So if you, the customer, has a problem with a particular program running (like, say, Oracle), then all you end up with is finger pointing. Oracle blames Red Hat, Mandrake blames Oracle.
<<<

while i mostly agree with the points you raised before this, you are really - i stress - _really_ wrong regarding compatibility, and about the mechanics of Linux kernel bugfixing.

firstly, distros do not fix bugs 'for themselves', they fix them in the _upstream_ kernel, primarily because they dont want the additional maintainance cost of having to carry a special fix with them. So they try really, really hard to have an upstream solution for whatever bug. If upstream has changed in that area too much they _still_ fix the bug in the upstream kernel and backport that fix to their own kernel. That way they'll be able to monitor the upstream kernel about possible side-effects and reap other QA benefits.

the result: distros fix kernel bugs pretty much the same way: they take an existing upstream fix in 90% of the cases, in 9% of the cases they fix upstream themselves, and backport the upstream fix. Only in a small portion (1%) of cases do they write their 'own' special fix, and such fixes they still try to get 'rid of' via getting it upstream, because per-distro maintainance is expensive.

secondly, the overwhelming majority of stability fixes are cornercase fixes, and only a small minority of bugfixes can introduce something user-visible like an incompatibility. Distros are very much aware of such issues and they try to avoid incompatible changes like fire: incompatibility almost always means 'apps break on our distro only' which causes follow-up regressions, so it's a big no-no.

out of tens of thousands of distro bugfixes i can recall one or two at most that caused some sort of (incidental and harmless) incompatibility, which was quickly fixed later on. So i believe that what you fear is really not happening in practice. Fact is that distributions almost always 'fork' the upstream kernel - and it's a natural thing. SuSE for example has more than 1000 patches ontop of the upstream kernel. So if there was any incentive for an incompatible fork, it would have happened years ago. But it didnt happen, for the reasons i outlined, and for a variety of other reasons.

my personal observation is that the new 2.6 kernel development method has actually improved the dynamics of bugfixing. E.g. Fedora Rawhide (which is as bleeding-edge as you can get - it picks up Linus' kernel tree 2-3 days after Linus _commits_ it to his tree) has quite okay-ish stability to run on most of my boxes, and it even has a daily yum enabled to automatically install all devel RPMs. There's a new kernel RPM almost every day, often with kernel stuff that i wrote perhaps a week ago and which got into Linus' tree a couple of days ago - such a short 'latency of deployment' was unheard of before. In the 2.3/2.4 days we literally had to wait years to get stuff into distro kernels, and having a small latency here helps reliability immensely.

and it is clearly the new 2.6 development method that enabled this: it is stable enough to run bleeding-edge distro code, giving much closer interaction between latest kernel and latest user-space developments. Previously there was a big lag between kernel and userspace - and often the devel kernel didnt even build in a distro setup - let alone boot. So there was alot of effort wasted on trying to fix bugs that could have been found easily by a large, dedicated team of early adopters. With Ubuntu, OpenSuSE and Fedora these early adopters are there now en masse, and are making a difference big time. They are finding kernel bugs much faster, leading to the seemingly paradox situation of "more changes result in more stability".

regarding reliability, there will always be deployments where not even the 2.0 kernel is proven enough. It is the market that decides how fast various distros should go - you can certainly pick something like RHEL to have much more conservative updates. You can vote with your money.

The newest development model and 2.6.14

Posted Nov 3, 2005 22:01 UTC (Thu) by iabervon (subscriber, #722) [Link]

I agree that there are potential concerns there, but you seem to have a strange idea of the 2.4 and before days. At the time, no distro shipped anything similar to the vanilla kernel, and vendor software really only worked at all with a single distro, because dealing with the divergence in backported features, bugfixes, and external patches applied by distros really was too much to handle. (Of course, a certain portion of this was userspace divergence, where everyone had a different libc version and C++ ABI; but it was also true of the kernel). In particular, Oracle would only run on Red Hat, not on any vanilla kernel, and, IIRC, depended on a patch they'd done themselves which was only in Red Hat, because it was too intrusive to get into mainline 2.4. Part of the point of the change was so that, when something was made to work on today's Red Hat, it would work on next week's vanilla kernel, not only on the vanilla kernel in three years when something yet newer was needed. This requires that the kernel development cycle be significantly shorter than a product lifetime.

Really, I think the main useful improvement I could see would be to have a "why isn't 2.6.X being released yet" section on kernel.org, so developers can see what they could do to get to the next development cycle faster. The next thing I could see is having a QA period run by the -stable team when Linus is done with 2.6.X and opens 2.6.X+1. Then, perhaps starting the inclusion period for 2.6.X arbitrarily early, maybe even with a "preview" patchset so that interested people could see what's coming and test if they think they'll be affected (this would be a bit like -mm, but only include things proposed for inclusion in the next vanilla kernel). Two months is plenty of time if new code is already in good shape when it enters the process and there is effective testing and debugging during those two months. No amount of time is enough if nothing is getting done to improve things during the time.

The newest development model and 2.6.14

Posted Nov 14, 2005 19:28 UTC (Mon) by malor (guest, #2973) [Link]

I missed this before, sorry.

My experience with Linux from about 0.8 through 2.4 was very simply this: I rolled my own kernel. Even when I was using a distro, I still rolled my own. I can't do that anymore.

And Debian shipped that way from the very start. They've had to change their procedures quite a bit, in fact, to cope with the steadily-decreasing reliability of the Linux kernel. They've had to do backporting of bugfixes and that sort of thing, where they used to just track either 2.2 or 2.4 vanilla. Even 2.4 wasn't good enough for Debian for a long time, and it's MUCH better than 2.6.

I feel bad for the 2.6 Debian maintainers.

The newest development model and 2.6.14

Posted Nov 4, 2005 4:14 UTC (Fri) by mgb (guest, #3226) [Link]

I've been running production systems on Unixen for a quarter century, including Linux for more than a decade. Late 2.4 was definately the high-point.

2.6 may run better than 2.4 on 1024-way clusters but who cares? For every day use 2.6 requires 2 to 4 times the RAM to accomplish the same task as 2.4. A laptop that used to handle a KDE desktop now takes more than half an hour just to boot a shell because of all the bureaucracy which replaced the efficient reliable mknod in /dev.

2.6 kernel quality is not awful, but it's nothing to write home about either. And then we have the ever changing kernel ABI and the kernel crew's antipathy to MadWifi which makes Linux WiFi more trouble than it's worth. It's easier to carry around a LinkSys gateway than hunt down new MadWifi RPM's for PCMCIA cards every time someone on the kernel team sneezes.

Many kernel hackers are now being paid professional salaries. Why then has product discipline dropped to amateur levels?

The newest development model and 2.6.14

Posted Nov 4, 2005 7:34 UTC (Fri) by emkey (guest, #144) [Link]

My pet peeve is the way new kernels will occasionally change things such that yesterdays eth0 is now eth5. This should NEVER EVER EVER CHANGE! I feel better now. Having to run tcpdump to figure out which interface goes with which network after a kernel update is just plain wrong.

The old model of kernel development had its issues, but it seems like things have swung to a far opposite extreme. Which is a pretty typical over reaction.

I predict another major change sometime in the next two to three years. Hopefully to a system that will stick around for the long haul.

The newest development model and 2.6.14

Posted Nov 4, 2005 15:05 UTC (Fri) by busterb (subscriber, #560) [Link]

There is an easy fix for that; give your interfaces fixed (even
meaningful!) names. see 'man nameif'; it's installed by default with
Debian at least.

The newest development model and 2.6.14

Posted Nov 4, 2005 18:52 UTC (Fri) by yokem_55 (subscriber, #10498) [Link]

This sounds like a distro problem as opposed to a kernel problem. I've never seen and ethx interface change its name simply from updating the kernel that's running the device....

The newest development model and 2.6.14

Posted Nov 4, 2005 19:13 UTC (Fri) by emkey (guest, #144) [Link]

How many systems do you have with two or more interfaces? The higher the number the more likely you are to see this.

While I won't say this scenario is frequent, it should be non existent. Most recently I've witnessed it when going from RHEL3 to RHEL4 on a system with five interfaces. Could this be RedHat's fault? Possibly, though based on other experiences I suspect this is more of a 2.4->2.6 issue.

The newest development model and 2.6.14

Posted Nov 4, 2005 22:11 UTC (Fri) by tjw.org (guest, #20716) [Link]

It could very well be RH's problem for changing the modprobe order. If you have different chipsets among your interfaces, the ethX name will be dependant on which kernel module gets loaded first.

You may be able to avoid these problems by just adding something like this to your modules.conf:

alias eth0 e100
alias eth1 e100
...
alias eth5 tulip

The newest development model and 2.6.14

Posted Nov 5, 2005 0:07 UTC (Sat) by emkey (guest, #144) [Link]

They are all e1000's in this case. And actually we have six total, though only five are in use. Two are built in copper, and the other four are provided by dual fibre E1000 cards. The first two remained the same (eth0, eth1) after the upgrade. The cards swapped though. (IE, eth2, eth3 became eth4, eth5 and eth4, eth5 became eth2, eth3)

A few minutes with tcpdump solved the mystery as to why things weren't working properly after the upgrade.

There may be a way around this, but there really shouldn't need to be.

The newest development model and 2.6.14

Posted Nov 6, 2005 23:47 UTC (Sun) by zblaxell (subscriber, #26385) [Link]

I think the real problem is that there is no guarantee of stability in unit enumeration in most distros. The mechanism certainly exists in the kernel, and has existed since 2.1.somewhere-near-100, but there isn't a user-space implementation installed by default on most distros.

eth0 is "the first detected ethernet card", eth5 is "the sixth detected ethernet card". If the eth0 card's PCI bus controller dies (as mine did a year or two ago), suddenly eth5 becomes eth4, eth4 becomes eth3, etc. This does horrible things if you were enforcing some kind of security on those devices, and the machine manages to come back up after this sort of failure. This can be triggered with just a lightning strike and a reboot--same kernel version, same distribution, but suddenly some or all of the ethernet cards have new names because a low-numbered one got zapped. It's inconvenient, but don't blame the kernel developers for breaking your fragile configuration.

This isn't a new problem that arose in 2.6, it has *always* been there. Use 'ip name ...' or 'nameif' to force your network devices to have specific names that don't match any possible default name. Set up your routing and firewall rules to use the specific names, and firewall everything that has an anonymous "eth0"-style name to the DROP target. Once configured, your interfaces will never be renamed again, although now you'll have to update the MAC addresses table every time you swap out a card or build a new machine.

Distro vendors could help people in your situation (only read half of the manuals, built a broken configuration, got surprised when the interest payment on technical debt became due) by including a user-space tool which assigns dynamic but persistent device names, so "eth0" would become "the first ethernet card *ever* detected in the system", "eth1" would be "the second ethernet card *ever* detected," etc. Single-user systems would only see "eth0", gateway hosts would have "eth0" through "eth5" that behave the way you expect between reboots, machines which replaced a broken NIC would have just an "eth1" since there's no way for the system to know if the card-formerly-known-as-eth0 might come back one day. It might be somewhat inconvenient to replace a card (you'd have to update routing table and firewalls for eth1 instead of eth0), but that's what you get for not reading the manual--and if you did, you'd probably find the state file that defines the persistent mapping and just edit it manually.

The newest development model and 2.6.14

Posted Nov 7, 2005 2:23 UTC (Mon) by emkey (guest, #144) [Link]

In our situation we would never come up with a networking card "missing". We have mechanisms in place to make sure this doesn't happen.

Being able to tie a device to a given MAC address is potentially interesting though.

As for not reading the manuals, I don't read the source code to the kernel either. The environment I work in takes hundreds of pages to document and is changing in small ways on a daily basis. I read LWN and am always on the lookout for new sources of information, but the simple truth of the matter is I do not have the time to be an expert in every single aspect of Linux, networking, etc. I wish I did.

Thanks for the information.

The newest development model and 2.6.14

Posted Nov 7, 2005 2:29 UTC (Mon) by zblaxell (subscriber, #26385) [Link]

Last time I checked, mknod in /dev still works if you want it. udev is just a glorified automated mknod in /dev which the kernel invokes from time to time. devfs had a lot of kernel-side bureaucracy that you could only get rid of by removing it from the kernel, which as of 2.6.13 has now been done permanently. Your distro vendor may give you some trouble, if they've built the system to rely on devfs or udev.

I experienced significant slowdowns on my laptop running 2.6 after upgrading from 2.4, until I configured the I/O scheduler to use cfq instead of the anticipatory scheduler. The defaults seem to be tuned for systems with a pair of high-performance SCSI disks arranged in RAID0 or RAID1...but on a laptop hard drive they multiply boot times by 10.

Measuring memory usage is different in 2.6. There are some new statistics, and statistics with old names are calculated differently, so it's hard to do a 1:1 comparison--and that in and of itself is annoying. It's hard to tell if there are more programs waiting for disk I/O because of increased RAM usage, or due to a new block I/O scheduler or some new kind of lazy or preemptive swapping scheme.

The newest development model and 2.6.14

Posted Nov 14, 2005 18:42 UTC (Mon) by mmarq (guest, #2332) [Link]

"" The community exists based on that fundamental social contract, "we're giving you really excellent code that won't fall over, go have fun", and they have changed to "we're giving you rather untested code because we're tired of all that boring bugfixing". ""

But kernel developers dont have to be bored trying to catch all the bugs. There always has been a large portion of advanced users who have lived on cutting edge software, who like to experiment new stuff, who knows the risks.

So i belive they will be more than trilled to help, even if they dont have enough expertize, if a good and effective bug tracking and report mechanism were in place. I always had a debbuger running on my systems, and i usually submit bug reports on the correspondent facilities for gnome, kde and firefox, though i dont exactly understand most of times what really is going on. So why not for the kernel ?

This process could be refined to the point of every single user can submit reports, from every part of their systems, requiring very little knowledge of the process and *no_knowledge* at all of programming.

A descentralized but with a single point of entry gateway is what is needed. This gateway can have automatic selection created with a handfull of rules, distributing almost all reports to the correspondent registered maintainer party, and rarely prompting for manual attention. Than there are created the conditions for a hierarchical order for dealing with bugs.

Sure there can be confusion (i dont like calling it war) about the precise bug tracking mechanism. Will it be bugzilla ? cant a standard be created allowing interoperable mechanisms so that what is for a particular server goes for the maintenace party of that server, what is for kernel goes for kernel, what is for kde, gnome or firefox, goes to kde, gnome and mozilla ?.

I belive this is quite feasible without having patsys of developers wanting to kick in the head the proponents of it, for sugesting burying them under tons of meaningless reports.


Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds