Andrew Morton on kernel development
Years ago, there was a great deal of worry about the possibility of burning out Linus. Life seems to have gotten easier for him since then; now instead, I've heard concerns about burning out Andrew. It seems that you do a lot; how do you keep the pace and how long can we expect you to stay at it?
I'm still keeping up with the reviewing and merging but the -mm release periods are now far too long.
There are of course many things which I should do but which I do not.
Over the years my role has fortunately decreased - more maintainers are running their own trees and the introduction of the linux-next tree (operated by Stephen Rothwell) has helped a lot.
The linux-next tree means that 85% of the code which I used to redistribute for external testing is now being redistributed by Stephen. Some time in the next month or two I will dive into my scripts and will find a way to get the sufficiently-stable parts of the -mm tree into linux-next and then I will hopefully be able to stop doing -mm releases altogether.
So. The work level is ramping down, and others are taking things on.
What can we do to help?
Secondly: it would help if people's patches were less buggy. I still have to fix a stupidly large number of compile warnings and compilation errors and each -mm release requires me to perform probably three or four separate bisection searches to weed out bad patches.
Thirdly: testing, testing, testing.
Fourthly: it's stupid how often I end up being the primary responder on bug reports. I'll typically read the linux-kernel list in 1000-email batches once every few days and each time I will come across multiple bug reports which are one to three days old and which nobody has done anything about! And sometimes I know that the person who is responsible for that part of the kernel has read the report. grr.
Is it your opinion that the quality of the kernel is in decline? Most developers seem to be pretty sanguine about the overall quality problem. Assuming there's a difference of opinion here, where do you think it comes from? How can we resolve it?
When I'm out and about I will very often hear from people whose machines we broke in ways which I'd never heard about before. I ask them to send a bug report (expecting that nothing will end up being done about it) but they rarely do.
So I don't know where we are and I don't know what to do. All I can do is to encourage testers to report bugs and to be persistent with them, and I continue to stick my thumb in developers' ribs to get something done about them.
I do think that it would be nice to have a bugfix-only kernel release. One which is loudly publicised and during which we encourage everyone to send us their bug reports and we'll spend a couple of months doing nothing else but try to fix them. I haven't pushed this much at all, but it would be interesting to try it once. If it is beneficial, we can do it again some other time.
There have been a number of kernel security problems disclosed recently. Is any particular effort being put into the prevention and repair of security holes? What do you think we should be doing in this area?
But a security hole is just a bug - it is just a particular type of bug, so one way in which we can reduce the incidence rate is to write less bugs. See above. More careful coding, more careful review, etc.
Now, is there any special pattern to a security-affecting bug? One which would allow us to focus more resources on preventing that type of bug than we do upon preventing "average" bugs? Well, perhaps. If someone were to sit down and go through the past five years' worth of kernel security bugs and pull together an overall picture of what our commonly-made security-affecting bugs are, then that information could perhaps be used to guide code-reviewers' efforts and code-checking tools.
That being said, I have the impression that most of our "security holes" are bugs in ancient crufty old code, mainly drivers, which nobody runs and which nobody even loads. So most metrics and measurements on kernel security holes are, I believe, misleading and unuseful.
Those security-affecting bugs in the core kernel which affect all kernel users are rare, simply because so much attention and work gets devoted to the core kernel. This is why the recent splice bug was such a surprise and head-slapper.
I have sensed that there is a bit of confusion about the difference between -mm and linux-next. How would you describe the purpose of these two trees? Which one should interested people be testing?
The -mm tree used to consist of the following:
- 80-odd subsystem maintainer trees (git and quilt), eg: scsi, usb, net.
- various patches which I picked up which should be in a subsystem maintainer's tree, but which for one of various reasons didn't get merged there. I spend a lot of time acting as backup for leaky maintainers.
- patches which are mastered in the -mm tree. These are now organised as subsystems too, and I count about 100 such subsystems which are mastered in -mm. eg: fbdev, signals, uml, procfs. And memory management.
- more speculative things which aren't intended for mainline in the short-term, such as new filesystems (eg reiser4).
- debugging patches which I never intend to go upstream.
The 80-odd subsystem trees in fact account for 85% of the changes which go into Linux. Pretty much all of the remaining 15% are the only-in-mm patches.
Right now (at 2.6.26-rc4 in "kernel time"), the 80-odd subsystem trees are in linux-next. I now merge linux-next into -mm rather than the 80-odd separate trees.
As mentioned previously, I plan to move more of -mm into linux-next - the 100-odd little subsystem trees.
Once that has happened, there isn't really much left in -mm. Just
- the patches which subsystem maintainers leaked. I send these to the subsystem maintainers.
- the speculative not-for-next-release features
- the not-to-be-merged debugging patches.
Do you have any specific goals for the development of the kernel over the next year or so? What would they be?
I keep on hoping that kernel development in general will start to ramp down. There cannot be an infinite number of new features out there! Eventually we should get into more of a maintenance mode where we just fix bugs, tweak performance and add new drivers. Famous last words.
And it's just vaguely possible that we're starting to see that happening now. I do get a sense that there are less "big" changes coming in. When I sent my usual 1000-patch stream at Linus for 2.6.26 I actually received an email from him asking (paraphrased) "hey, where's all the scary stuff?"
In the early-May discussions, Linus said a couple of times that he does not think code review helps much. Do you agree with that point of view?
How would you describe the real role of code review in the kernel development process?
It also increases the number of people who have an understanding of the new code - both the reviewer(s) and those who closely followed the review are now better able to support that code.
Also, I expect that the prospect of receiving a close review will keep the originators on their toes - make them take more care over their work.
There clearly must be quite a bit of communication between you and Linus, but much of it, it seems, is out of the public view. Could you describe how the two of you work together? How are decisions (such as when to release) made?
We each know how the other works and I hope we find each other predictable and that we have no particular issues with the other's actions. There just doesn't seem to be much to say, really.
Is there anything else you would like to say to LWN's readers?
Nothing special is needed - just install it on as many machines as you dare and use them in your normal day-to-day activities.
If you do hit a bug (and you will) then please be persistent in getting us to fix it. Don't let us release a kernel with your bug in it! Shout at us if that's what it takes. Just don't let us break your machines.
Our testers are our greatest resource - the whole kernel project would grind to a complete halt without them. I profusely thank them at every opportunity I get :)
We would like to thank Andrew for taking time to answer our questions.
Index entries for this article | |
---|---|
Kernel | Development model |
Posted Jun 11, 2008 16:16 UTC (Wed)
by Hanno (guest, #41730)
[Link] (9 responses)
Posted Jun 11, 2008 17:10 UTC (Wed)
by MisterIO (guest, #36192)
[Link] (2 responses)
Posted Jun 11, 2008 17:27 UTC (Wed)
by hmh (subscriber, #3838)
[Link]
Which many will do, causing total chaos in the next merge window. That's the reason why it was not done yet, AFAIK. Now, if we could get a sufficiently big number of kernel regulars (like at least 50% of the ones with more than three patches merged in the last three releases) and all subsystem maintainers (so as to keep the new-feature-craze crowd under control) to pledge to the big bugfix experiment, then it just might work.
Posted Jun 11, 2008 17:59 UTC (Wed)
by proski (subscriber, #104)
[Link]
If some kernel is declared stable, it mean that only bugfixes are accepted. In other words, the merge window is skipped. To make the point, the previous kernel could be tagged as rc1 for the stable kernel.
I don't know it it's going to work, but it may be worth trying once.
Posted Jun 11, 2008 17:34 UTC (Wed)
by cdamian (subscriber, #1271)
[Link] (4 responses)
Posted Jun 11, 2008 18:52 UTC (Wed)
by grundler (guest, #23450)
[Link] (3 responses)
Posted Jun 11, 2008 19:02 UTC (Wed)
by nbarriga (guest, #49347)
[Link] (2 responses)
Posted Jun 11, 2008 22:57 UTC (Wed)
by erwbgy (subscriber, #4104)
[Link] (1 responses)
That should be http://test.kernel.org/ and http://test.kernel.org/autotest for documentation.
Posted Jun 12, 2008 2:23 UTC (Thu)
by grundler (guest, #23450)
[Link]
Posted Jun 11, 2008 20:01 UTC (Wed)
by iabervon (subscriber, #722)
[Link]
Posted Jun 11, 2008 19:39 UTC (Wed)
by job (guest, #670)
[Link]
Posted Jun 11, 2008 21:47 UTC (Wed)
by mikov (guest, #33179)
[Link] (17 responses)
Posted Jun 11, 2008 22:11 UTC (Wed)
by dilinger (subscriber, #2867)
[Link] (14 responses)
Posted Jun 11, 2008 22:28 UTC (Wed)
by mikov (guest, #33179)
[Link] (13 responses)
Posted Jun 11, 2008 22:32 UTC (Wed)
by dilinger (subscriber, #2867)
[Link] (12 responses)
Posted Jun 11, 2008 23:08 UTC (Wed)
by mikov (guest, #33179)
[Link] (11 responses)
Posted Jun 11, 2008 23:58 UTC (Wed)
by walken (subscriber, #7089)
[Link] (10 responses)
Posted Jun 12, 2008 0:24 UTC (Thu)
by mikov (guest, #33179)
[Link] (9 responses)
Posted Jun 12, 2008 3:11 UTC (Thu)
by dilinger (subscriber, #2867)
[Link] (8 responses)
Posted Jun 12, 2008 4:23 UTC (Thu)
by mikov (guest, #33179)
[Link] (7 responses)
Posted Jun 12, 2008 5:19 UTC (Thu)
by dilinger (subscriber, #2867)
[Link] (6 responses)
Posted Jun 12, 2008 14:15 UTC (Thu)
by mikov (guest, #33179)
[Link] (5 responses)
Posted Jun 12, 2008 20:09 UTC (Thu)
by nhippi (subscriber, #34640)
[Link] (1 responses)
Posted Jun 12, 2008 21:22 UTC (Thu)
by mikov (guest, #33179)
[Link]
Posted Jun 13, 2008 17:27 UTC (Fri)
by dilinger (subscriber, #2867)
[Link] (2 responses)
Posted Jun 13, 2008 18:09 UTC (Fri)
by mikov (guest, #33179)
[Link] (1 responses)
Sigh. I explained this a couple of times.
It is not specific to my hardware. As I already said, I have tested this with several different pl2303 converters, including very expensive ones. I have tested it on different machines with different USB chipsets. I have even tested a couple of different kernel versions. I am not an idiot, you know :-)
The description of the problem is simple and I don't see why I have to keep repeating it over and over. Apparently USB1.1 devices have problems when plugged into USB 2.0 hubs.
I agree that it is not exactly the same thing as described in the linked post, but it looks dangerously similar, and as of 2.6.22 the proposed fix was still marked experimental.
I also agree that it is theoretically possible that only the PL2303 driver has this specific problem. I don't think that is the case though.
See above. If you really want to show me that I don't understand anything, try it with a USB 2.0 hub. Run it for 24 hours checking if there is even a single missed byte in either direction. Then tell me that "it works just fine"
Also, I am not refusing to get involved. Did you also not see one of my posts asking what is a good venue to report this problem ?
I didn't mean to discuss this particular issue in depth. I did not ask for advice on fixing it. I used it just as an example. Apparently not a very good one, because my point did not get through.
In Windows, at least theoretically, either the manufacturer or Microsoft is responsible for doing something if there is a problem. In Linux there is generally no responsibility unless you purchase a support contract which is much more expensive than the price of a copy of Windows. What qualification would you use for this ?
Posted Jun 17, 2008 22:21 UTC (Tue)
by phiggins (subscriber, #5605)
[Link]
Posted Jun 12, 2008 1:28 UTC (Thu)
by proski (subscriber, #104)
[Link] (1 responses)
I realize that the problem you have with USB 2.0 is not bisectable, but many other problems are.
Posted Jun 12, 2008 12:13 UTC (Thu)
by ekj (guest, #1524)
[Link]
Posted Jun 12, 2008 2:21 UTC (Thu)
by pr1268 (guest, #24648)
[Link] (5 responses)
First of all, I wish to thank Andrew for his thoughts and time in responding. Such discussion relating to kernel development is refreshing. As for reporting bugs, I've two in particular with 2.6.25[.x] that I've been loath to report: Granted, I was hesitant to report either of these (until now) because I was unsure whether (1) was operator error, and whether (2) was expected new behavior given the patches submitted for 2.6.25. Plus, I didn't want to add yet another message to the (already crowded) LKML. But, I'm curious: would reporting either of these be appropriate? I would certainly love to contribute to the kernel development project--I even subscribed to the LKML (having been inspired by our editor's eagerness to help with the Kill-The-BKL project)--but being a newbie, I could use a little guidance. Thanks again to Andrew for his candor.
Posted Jun 12, 2008 5:00 UTC (Thu)
by grundler (guest, #23450)
[Link] (4 responses)
Posted Jun 12, 2008 6:41 UTC (Thu)
by mingo (guest, #31122)
[Link] (3 responses)
It is a serious upstream kernel regression if "make oldconfig" (used on a .config that worked with a previous version of the kernel) suddenly breaks a working setup. Please report it if you get hit by such a bug/regression and it will be fixed.
We'd be shooting ourselves in the foot if we made it harder to test new kernels.
Ditto for the second bug - if mounting CDs worked well before and it suddenly starts producing spurious "no media" mount failures that's a plain bug/regression.
Please report them on bugzilla.kernel.org.
Also, if you test new kernels, make sure you run the kerneloops client which automatically reports crashes to kerneloops.org.
Posted Jun 12, 2008 7:51 UTC (Thu)
by pr1268 (guest, #24648)
[Link] (2 responses)
Thank you both for the replies. I'm beginning to wonder if the two issues I have with 2.6.25[.x] are related in a weird way: Another reason why I was loath to report the CD/DVD not mounting issue was because I have some unusual IDE/SATA hardware in my system (a Promise PDC20271 ATA controller card, a Silicon Image SATA controller card, one of the two DVD burners is IDE whilst the other is SATA), but Linux has ordinarily given me no grief whatsoever for running this odd configuration (I also have a mix of IDE and SATA hard drives and a software RAID-0, but that's another story). Again, I must stress that this could all be a silly case of operator error (I'm good at finding these kinds of bugs ;-) ), or maybe it is a defect that needs the attention of the kernel developers... I will admit that I'm somewhat of an informal kernel tester; I've compiled and run recent (-stable) kernels for the past 3 1/2 years now (thus explaining why I like Slackware--it works well with vanilla kernels), and I've only had to report one show-stopper (Oops in 2.6.15 due to NULL dereference in usbhid.c). Thanks again for your replies; I'll look into reporting the make oldconfig issue.
Posted Jun 12, 2008 14:25 UTC (Thu)
by iabervon (subscriber, #722)
[Link] (1 responses)
Posted Jun 12, 2008 15:23 UTC (Thu)
by pr1268 (guest, #24648)
[Link]
I opened a bugzilla bug (#10898) on the make oldconfig issue. Apparently this is a regression reported by Linus himself, and a patch is in the works. Make oldconfig worked fine for 2.6.24.x (as I mentioned in the bugzilla description). As for the mount(8) CD/DVD issue, well, I'll test that later this afternoon... Time to go to work... I'm still not discounting the possibility that a funky config kernel build combined with my strange mix of hardware (see above--yes, that's all one PC!) might have caused this anomaly.
Posted Jun 12, 2008 7:32 UTC (Thu)
by dberkholz (guest, #23346)
[Link]
Posted Jun 13, 2008 17:22 UTC (Fri)
by giraffedata (guest, #1954)
[Link] (1 responses)
Am I reading this out of context, or is Andrew taking the position that everything's already been invented?
Posted Jun 19, 2008 13:54 UTC (Thu)
by Duncan (guest, #6647)
[Link]
Posted Jun 26, 2008 6:13 UTC (Thu)
by jturning (guest, #52690)
[Link]
Posted Oct 22, 2008 1:45 UTC (Wed)
by sahilahuja (guest, #54826)
[Link]
Is there a "green list" of hardware supported properly by the linux kernel?
Whether the hardware will be supported by linux or not, is a very important criterion for me whenever I buy hardware. Making that decision shouldn't be as hard as it is now.
If this is a famous enough list, it should spark a new interest in the HW producer's side to get their hardware "properly" supported by the linux kernel.
(A noteworthy effort exists from linux.com, but I still think it should be easier, centralized, and have more involvement from kernel developers.)
Andrew Morton on kernel development
"I do think that it would be nice to have a bugfix-only kernel release."
Yes, please.
Andrew Morton on kernel development
It may be interesting, unless kernel developers ignore the bug-fix only release and work on
new futures by themselves in the meantime.
Andrew Morton on kernel development
It may be interesting, unless kernel developers ignore the bug-fix only release and work on new futures by themselves in the meantime.
It's not a matter of making developers doing something else. It's a priority thing. Most developers work both on new features and on bugfixes. Sometimes bugs are exposed as the code is modified to include new features.
Andrew Morton on kernel development
Andrew Morton on kernel development
I preferred the odd/even system we had before 2.6.
I also gave up on reporting kernel bugs. Usually I am the only person with that bug and
hardware
configuration and nobody will fix it. This is not specific to the kernel though. I think I
never got any
of the bugs which I reported to fedora, red hat or gnome fixed.
Two other things: is the kernel bugzilla used at all? are there any tests like unit tests to
catch
regressions for the kernel? both are pretty standard for any other open source project
nowadays.
Andrew Morton on kernel development
> I also gave up on reporting kernel bugs.
I'm sorry to hear that. I know that reporting bugs is alot of work.
> Usually I am the only person with that bug and hardware
> configuration and nobody will fix it.
If no one else really has that HW, then there could be lots of reasons:
1) They don't care - many developers don't care about parisc, sparc, 100VG or tokenring
networking, scaling up or down (embedded vs large systems),e tc.
2) They don't have documentation for the offending HW.
3) no one else was able to reproduce the bug and it's not obvious what is wrong.
> This is not specific to the kernel though. I think I
> never got any of the bugs which I reported to fedora,
> red hat or gnome fixed.
Before someone else suggests these, maybe the way the bugs are reported has something to do
with the response rate?
There are some good essays/resources out there on how to file useful bugreports. I don't want
to suggest yours are not useful since I've never seen one (or don't know if I have). Just when
you mention problems across all open source projects I wonder.
> Two other things: is the kernel bugzilla used at all?
> are there any tests like unit tests to catch regressions for the kernel?
> both are pretty standard for any other open source project nowadays.
Agreed. But to be clear, the kernel is a bit different than most open source projects since it
controls HW and lots of buggy BIOS flavors.
(1) I'm using bugzilla.kernel.org to track tulip driver bugs. Not everyone is doing that. It's
helped that akpm has (had?) funding (from google?) for someone to help cleanup and poke
maintainers about outstanding bugs. Despite not everyone using it, it's still a better
tracking mechanism than sending an email to lkml. Do both. email to get attention and bugzilla
to track details. But also send bug reports to topic-specific lists since it's more likely
people who care about your HW will notice the report.
(2) Not that I'm aware of. The kernel interacts with HW alot. It's very difficult to emulate
or "mock" that interaction. Not impossible, just hard and the emulation almost never can
capture all the nuances of broken HW (see drivers/net/tg3.c for examples). Secondly, we very
often can only test large subsystems or several subsystems at once. e.g. a file system test
almost always ends up stressing the VM and IO subsystems. Networking stresses DMA and SK buff
allocators. UML and other virtualization of the OS make it possible to test some subsystems
w/o specific HW. However there are smaller pieces of the kernel which can be isolated and
tested: e.g bit ops (i.e. ffs()), resource allocators, etc. It's just a lot of work to
automate the testing of those bits of code. But this is certainly a good area to contribute if
someone wanted to learn how kernel code (doesn't? :)) work.
For testing subsystems, see autotest.kernel.org and http://ltp.sourceforge.net/. autotest is
attempting to find regressions during the development cycle.
Andrew Morton on kernel development
It seems that autotest.kernel.org doesn't exist...
Andrew Morton on kernel development
Andrew Morton on kernel development
Yes - I meant http://test.kernel.org. Sorry about that.
Andrew Morton on kernel development
I think there's a substantial difference to the way he phrased the suggestion here from what
I've seen before. People tend to think of a bugfix-only release as one in which the mainline
only merges bugfixes. Simply making that policy would almost certainly lead to no more
bugfixes than usual, and twice as many features hitting the following release window.
On the other hand, if the process were driven from the other end, it might work: spend some
period collecting a lot of unfixed bugs, and saturate developers' development time with them,
and, in the cycle after that, there ought to be a lot of bugfixes and no new features, simply
because all that will have matured at the merge window will be bugfixes.
So, if there were a period where there was a campaign to collect long-standing bugs and
regressions against non-recent versions, with the aim of having all of these get resolved in a
particular future version, as the main goal for that release, I think that would be useful.
Andrew Morton on kernel development
I've been bitten by some bugs earlier in the 2.6 series, but I have not had any trouble since
around 2.6.18 I believe. It may be luck, it may be hard work from Andrew and everyone else
involved. Thank you, everyone!
Sometimes it is depressing
Sometimes I get depressed when thinking about the kernel. Mostly because I feel powerless to
affect it in anyway - I can't sponsor somebody to work on fixing bugs (that would be the ideal
case) and unfortunately in most cases I don't have the expertise to fix bugs myself.
For example only recently I discovered to my utter amazement that USB 2.0 still doesn't work
well ! I tried to connect a simple USB->Serial converter and it started failing in mysterious
ways - e.g. it would work 80% of the time, but then there would be a lost byte, etc.
There are workarounds (disabling USB 2.0 from the BIOS, unloading the USB 2.0 modules, using
an USB 1.0 hub, etc), but it is depressing that USB 2.0, which is on practically 100% of all
machines, doesn't work. Of course it works nice under Windows.
I eventually dug out a couple of messages from Greg KH explaining that it is known problem for
a long time (I don't remember the exact details), but there is simply not enough interest in
fixing it.
This is *not* an issue of undocumented hardware !
I can't really complain, since I am not paying for Linux, but it is ... I already said it ...
depressing.
Sometimes it is depressing
You don't have to sponsor developers; just send them the misbehaving hardware. Chances are
good that if it's useful hardware, it'll get fixed.
Sometimes it is depressing
I am afraid it is not that simple.
I am sure that there isn't a single developer without a USB 2.0 PC, so there is no point in
sending them anything. USB 2.0 hubs can be bought for about $30 (and PCs have hubs builtin
anyway), add another $10 for a USB->serial converter. I don't mind spending that if it would
improve the kernel.
As I mentioned, this is not a case of undocumented hardware or expensive. The USB 2.0 kernel
subsystem is apparently not quite ready and it can't handle USB 2.0 hubs. At least that is my
understanding - I could be wrong.
Even assuming that it made sense to send hardware, where should I send it ?
Sometimes it is depressing
I *highly* doubt this is a USB 2.0 host problem. More likely, it's a problem w/ the specific
USB device that you're using, or a host bug that's only triggered by your USB device. There
are plenty of buggy USB devices out there.
I've used plenty of USB 2.0 devices with no problems. I've also used USB serial adapters with
no problems at all. However, your specific USB serial adapter is clearly problematic, and
that's not something that other people are likely to see unless they have the same hardware
that you have.
Sometimes it is depressing
The device is fine. The USB converter uses the Prolific chip, which as far as I can tell is
one of the most common ones and highly recommended for Linux. I have several different
converters using it, including a $350 industrial 8-port one. They all fail (also on machines
with different USB chipsets) as long as USB 2.0 is enabled. The failure is fairly subtle, so
it is not always immediately obvious.
Needless to say, all converters work flawlessly under Windows ...
See this post:
http://lkml.org/lkml/2006/6/12/279
To quote from further down the thread:
"Yeah, it's a timing issue with the EHCI TT code. It's never been
"correct" and we have had this problem since we first got USB 2.0
support. You were just lucky in not hitting it before"
BTW, I last tried this with a fairly recent kernel (2.6.22).
Sometimes it is depressing
Eh, I have that chip too.
I don't know if it's got anything to do with linux (my understanding is that the chip asks to
be polled over USB every millisecond, and there are only 1000 frames that can go over the USB
bus by second, so that device won't work if it has to share the USB bus with anything else)
There is an easy workaround: plug this device in a port where it won't have to share the bus
with any other device. I.e. if you have two USB ports on your machine, plug the prolific chip
in one of them and everything else in a hub on the other port.
I had no idea if things are better in windows, I thought it was an issue with the USB device
itself.
BTW, did you try the USB_EHCI_TT_NEWSCHED thing discussed in that thread ?
Sometimes it is depressing
I am fairly certain the problem is not related to sharing the USB bus. I had four of those
converters connected to an ordinary USB hub working 100% reliably, as long as USB 2.0 was
disabled.
Plus, you can buy a fairly expensive (hundreds of $) multi-port converter which internally is
nothing more than a couple of cascaded USB hubs and pl2303 chips. I hope that they wouldn't be
selling such devices if the underlying chip was fundamentally broken.
Lastly, it all works peachy in Windows.
I tried USB_EHCI_TT_NEWSCHED (it is included in 2.6.22), but it didn't fix it. Alas I didn't
have the chance to dig too deep (and I am not an USB expert, although I have done kernel
programming) - sometimes it took many hours to reproduce the errors, and using USB 1.1 solved
my immediate problem.
When I saw Greg KH's explanation that there are problems in the USB 2.0 implementation known
for for years, I lost my hope of improving the situation constructively.
Perhaps I should pick it up again. What is the best forum to report this problem ? Apparently
not the kernel Bugzilla ? :-)
Sometimes it is depressing
You'll note the wording GregKH used.. "should be fixed", etc. Mark Lord had to report back
that it was still broken. If GregKH actually had the hardware available to reproduce it,
development and fix time would be much quicker.
As far as bugs that are known for years; this is free software. The only people that are
going to fix it are ones that are either paid to do so, or have an itch to scratch because
their hardware is not working correctly. The fact that this is a corner case, and has an easy
work around makes it pretty clear why it has taken so long to get it fixed. I fail to see
what's so depressing.
It's hard enough reproducing bugs when you have the hardware, but not having it available
makes fixing bugs many times more difficult (and kills much of the motivation to do anything
about it).
Sometimes it is depressing
I don't think that this is a corner case at all. It is unacceptable to have random devices
fail subtly and quietly when connected to a standard bus. Especially when such a fundamental
and established interface as USB is concerned.
It is disappointing that the kernel has known bugs of this nature which are not being
addressed. The problem is not so much that my particular device doesn't work.
The depressing part is that it _really_ is nobody's fault. The development model is what it
is. There is nothing better and there is nothing we can do about it.
RedHat is not going to pay for fixing this because they don't care about desktops with random
hardware. Canonical is not going to fix it because they don't contribute that much to the
kernel. Nobody is going to pay for fixing it.
There is nothing to be done. That is depressing.
Sometimes it is depressing
It *is* a corner case. A device is plugged into a USB1.1-only hub plugged into a USB2 port.
From the thread, my assumption is that the kernel (ehci) thinks 2.0 is supported because the
host supports it, and thus attempts to talk 2.0 to the device. The hub in the middle screws
things up. Bypass the USB1.1 hub, and things work just fine. If that's _not_ what you're
doing, than you are seeing a different bug.
Sometimes it is depressing
This is not what is happening. The problem occurs when a USB 1.1 device is plugged into a USB
2.0 hub. AFAICT, this matches the description of the bug referenced in Greg KH's post.
This is a frequent case - there are many USB 1.1 devices, but at the same time all hubs that
can be purchased right now are 2.0.
I suspect that most people are not seeing the problem simply because few people actually use
hubs. Since the problem is subtle - a couple of lost bytes every couple of hours - most people
wouldn't recognize it anyway.
Sometimes it is depressing
Sometimes it's depressing to see how many posts some people bother to write about their
problems to a random forum, when with the same amount of energy one could have filed a bug in
bugzilla.kernel.org ...
Sometimes it is depressing
It is even more depressing when the Slashdot trolls start posting on LWN.
First of all, this is not some random forum. Secondly, had you bothered to read the messages,
you'd seen that the bug is already known. Lastly, in case you missed it, the subject is not my
specific problem, but the philosophical futility of reporting bugs in something free.
Incidentally, it appears that you don't even realize how much effort and time it takes to make
a useful bug report. It is ironic that some people find it more acceptable to pollute bugzilla
with useless wining complaints, rather than discussing it in a forum.
Sometimes it is depressing
Once again: no. The original reporter says that when he plugs the pl2303 device directly into
the USB2.0 hub, it works just fine. It's only when it goes through a USB1.1 dock/hub that it
fails.
So, once again: YOU ARE TALKING ABOUT SOMETHING COMPLETELY DIFFERENT FROM THE LINK YOU POSTED.
Most people aren't seeing the problem because most USB1.1 devices work just fine in USB2.0
hubs. The problem described in the link you supplied is a corner case (some weird built-in
serial adapter in a hub/dock thingy). The problem you've described sounds like it's specific
to some portion of your hardware.
I dug through my hardware pile and found a pl2303. It works just fine in a USB2.0 port. If
you want to moan about how depressing kernel development is, that's fine; but claiming that
it's hopeless when you refuse to get involved is just silly.
Sometimes it is depressing
Most people aren't seeing the problem because most USB1.1 devices work just fine in USB2.0
hubs. The problem described in the link you supplied is a corner case (some weird built-in
serial adapter in a hub/dock thingy). The problem you've described sounds like it's specific
to some portion of your hardware.
I dug through my hardware pile and found a pl2303. It works just fine in a USB2.0 port. If
you want to moan about how depressing kernel development is, that's fine; but claiming that
it's hopeless when you refuse to get involved is just silly.
Sometimes it is depressing
What I find far more depressing is when you've paid a company several million dollars for a
support contract and they still don't fix your bugs (I've seen this happen several times).
Every software project--free or not--has finite developer resources. Some bugs will take a
certain amount of time to fix no matter how many dollars or people are thrown at the issue.
You're making it sound like this problem only exists for Linux when I've seen it far more
often with proprietary software. The problem is fundamental and will not go away, but I've
found Linux to do a better job of handling it than anything else I've seen. It's still not
perfect, but it can't be. The only way you can be guaranteed to get your problem fixed is to
have the ability to fix it yourself. With Linux, you theoretically have that option. With
proprietary software, you don't.
Bisecting bugs doesn't require deep knowledge. It requires a fast computer and some time to test kernels for the problem. And you keep the computer after you're done :-)
Sometimes it is depressing
Sometimes it is depressing
That is only true if:
a) You've got a simple test that -always- reproduces the bug on one kernel.
b) You're aware of atleast one kernel where the bug does NOT happen.
Most of the (suspected!) kernel-bugs I've run into in my years of running Linux (since 1.2.13)
has fulfilled neither of these 2.
My issues with kernel development
My issues with kernel development
Yes - Andrew definitely deserves the kudos he gets.
(1) isn't really bug since "make oldconfig" is "expected behavior". Try "make menuconfig" and
see if a menu driven config tool works better for you. Too often, I find the "Help" text
useless and I'm not a kernel newbie. Updating those to be meaningful (e.g. spelling out
uncommon acronyms) would help alot of people.
(2) is a regression and sounds like it's bisect'able. In fact, recent bug on linux-scsi sounds
similar to this though might not be the same:
http://marc.info/?t=121229388800003&r=1&w=2
So reporting the problem to linux-ide and/or linux-scsi might be a good starting point. You
don't have to report problems to LKML since there are plenty of topic specific mailing lists
that have less traffic. See linux-2.6.25/MAINTAINERS file for the various mailing lists. If
you post to the wrong list, people generally will redirect you to the right one. As Andrew
suggests, be persistent.
Lastly, regarding "being a new newbie" try http://kernelnewbies.org which is one of many
starting points. Usually any help with documentation, code review, or testing is something
anyone with a computer can do - especially if you are finding bugs, willing to report them and
test out (likely bad) theories on the bug. This interaction will lead to learning lots of new
stuff.
(1) isn't really bug since "make oldconfig" is "expected behavior". Try "make menuconfig" and
see if a menu driven config tool works better for you. Too often, I find the "Help" text
useless and I'm not a kernel newbie. Updating those to be meaningful (e.g. spelling out
uncommon acronyms) would help alot of people.
Please report it in bugzilla
Please report it in bugzilla
Please report it in bugzilla
It's possible to have user error with "make oldconfig" on the first try (like getting the
wrong config into it), but if you can reproduce it, it's worth reporting. (And if you can't
reproduce it, you'll have a correct config...)
There was someone recently reporting problems with mounting optical media if he waited more
than 30 seconds after inserting it. It might be related, or it might be a coincidence, but you
might want to look into http://lkml.org/lkml/2008/6/6/170. The thread is kind of inconclusive,
but you might be able to help if you've got a different failure pattern (you need to wait,
while other people need to hurry), but also have a problem with timing and optical media
insertion that came up between 2.6.24 and 2.6.25. It's got things to try, anyway.
Make oldconfig issue has been reported in Bugzilla
Does -staging obsolete -mm?
It seems like there's a pretty large overlap between the new -staging tree and much of what
-mm does. Thus the question in the subject. =)
Andrew Morton on kernel development
I keep on hoping that kernel development in general will start to ramp down. There cannot be an infinite number of new features out there! Eventually we should get into more of a maintenance mode where we just fix bugs, tweak performance and add new drivers. Famous last words.
Andrew Morton on kernel development
Maybe a bit of both?
I've seen previous discussion of this theory before on LWN, along with
amazement that it hadn't slowed down yet.
There's a number of dynamics in play here of which I'll only consider a
couple.
The big one is that for many years, Linux was playing catch-up, that is,
the state-of-the-art in kernel technology was ahead of Linux so far that
it had to well more than double-time it in ordered to have any hope of
catching up in something like computer-evolution-reasonable time. That
Linux was actually doing it surprised a LOT of people, and was a major
point behind the SCO suit -- they thought /surely/ IBM or /somebody/ must
be "cheating", in ordered for Linux to be evolving as incredibly fast as
it was, toward at that point and for what they were concerned about, a
real "enterprise" kernel. Well, we all know where /that/ ended up --
there was little if any cheating going on; it was real "organic" growth,
but at a speed nobody could really account for according to previous
models, because the Linux model really /is/ different. At the same time,
however, it /did/ make us more careful, prompting the introduction of
better origins documentation and signed-off-by.
In theory, while various (now) peer kernels may still be more mature than
Linux in some areas, that space is largely gone -- we're caught up, or
close enough so the speed of change should be slowing down toward that of
the more mature kernels as we match and now forge into new territory on
our own. However, this has been predicted since the late 2.4.teen kernels
at least, but it just didn't appear to be happening. In hindsight, we
weren't as mature as we thought we were back then (a common observation in
life, I might add, as one advances in years =8^S) and we still had more
growing to do.
Since the 2.6 series, however, there /have/ been some observable changes
toward this end. While the raw volume of change hasn't really slacked off
yet, the "scariness" of the changes has been decreasing. The first big
change from that was the switch from the odd/even cycle. At first, people
thought that it'd be relatively temporary, a couple years possibly, before
something "big and disruptive" enough to all systems to really need an
alternate development tree in which to coordinate all the changes, forcing
the opening of a new official development tree. That hasn't happened.
We've managed due both to somewhat smaller less-system-wide-disruption
changes, and an accommodation of more medium-scale changes into the
ongoing stable kernel. That this arrangement has continued to work is an
indication of relative maturity both in featureset and in development team
and method. The disruptive scale has been reduced both in absolute terms
and because we are better able to cope with it in stride than ever before.
That was the first big indication the kernel was maturing, altho raw
change continued at if anything an increased pace. A second, more recent
indication that may or may not prove out over time is the lack
of "scariness" in now really a couple of kernels in a row. If the above
change could be said to mark the transition from large to medium-large
sized disruption and the ability to handle it, this new one /may/ be the
first indications of the next level, moving from medium-large to simply
medium sized change. It should be noted that while two kernels in a row
is somewhat notable, it does not a safe trend make as yet. If we see a
continuing trend of this thru the end of the year, say a couple more
kernels in a row, for four, or only three but only one scary one and then
back to "medium", then it's probably safe to say there's a marked trend.
However, that's nowhere near suggesting that everything has been invented
now, only that we're finally catching up with the state of the art
sufficiently, while at the same time enhancing our ability to cope
in-stride with what might formerly have been disruptive, that things will
normally slow down a bit as it becomes /us/ that's doing the pioneering,
breaking the new ground.
Put in the large > med-large > medium language above, that's basically
saying we might /possibly/ expect one more notch, to medium-small in the
ordinary case, before we settle into a continuing sustainable pace as the
new pioneers, where progress is much more hard-fought because nobody's
been there before. I don't believe and would actually hope it doesn't
slow down much beyond that, nor do I believe many people are suggesting
that it will. Even then, there are likely to be occasional clusters of
difficulty and increased change, back into the medium to medium-large zone
for a kernel or three, before settling back into the medium-small zone.
However, the prediction is that as we are increasingly doing our own
pioneering, the average will drop to no higher than medium, with the
outliers being only medium-large, and large-to-hugely disruptive changes
will be a thing of the past as on the forefront it tends to be much more
incremental.
That's my view from this observation point. =8^)
Duncan
Andrew Morton on kernel development
Great interview. Thanks Andrew for all the hard work.
A "green list" of hardware?
Right now, there is no such centralized list on the first page of google search hardware supported by linux kernel.