The opening session at the 2007 Linux kernel developers summit was intended
to be a forum in which distributor kernel maintainers could talk about how
the development community could better support them. That topic did come
up, but not before a lengthy discussion of what the community would like to
see from the distributors. In particular, the panel members (Greg
Kroah-Hartman, Kyle McMartin, Dave Jones, and Deepak Saxena) got an
earful on the difficulties created by enterprise distributions and their
stability policies.
There is some real unhappiness in the community about the long lead times
required to get code into enterprise kernels. These lead times are the result
of pressure from customers, who don't want things to change in their
deployed systems - unless the change is a feature they need, of course.
They want kernels to be certified by the distributors, and by the vendors
whose products run on top of those kernels. If, as has often been
expressed, we want to shorten lead times and have enterprise customers
running something closer to current kernels, we have to build confidence
that kernel updates in stable environments are safe.
One much-criticized distributor practice is the policy of avoiding kernel
ABI changes over the lifetime of a distribution release. The mainline
kernel, of course, has no such policy, so imposing it on enterprise kernels
puts a big drag on the progress of those kernels - and strongly limits
things which can be subsequently added. It makes the distributors and
vendors do a lot of backporting work which would be better spent improving
current kernels - and the releases which result from that backporting work
have as much change as an updated kernel would.
Still, raising confidence in current kernels is not going to be an easy
task. Dave Jones noted that, every time he rebases the Fedora kernel, at
least one user-space package breaks. Deepak Saxena said that there are
very similar issues in the embedded world: customers generally don't want
completely new kernels for systems which have been certified with older
software. He also noted that performance regressions can be a big problem
with kernel upgrades; these kinds of problems, often, do not come up until
a kernel has been put into a specific real-world situation and are very
hard to test for. Partly in response to this problem, some big enterprise
customers are increasingly running their internal regression tests on
current mainline kernels. The tests generally cannot be shared, but the
regressions which turn up can be reported; this kind of work can, over
time, help to build confidence in mainline kernels.
Companies like Intel are also regularly running regression tests and
reporting any bugs found. There are concerns, though, that all this
testing can only help so much. It tends not to be visible to users, and,
thus, fails to help build confidence. It was said that this testing
happens too late, that more testing should be done on the -mm kernels. And
it is very hard to find workload-dependent problems with regression
testing, so there may always be surprises lurking for those who deploy the
wrong kernel.
The discussion moved on to the topic of out-of-tree patches carried by the
enterprise distributors. Many of these are patches which seem doomed to
never find their way into the mainline, mostly because significant problems
have been identified with them. Examples include utrace (currently shipped
by Red Hat) and AppArmor (in SUSE kernels). These in-house developments
are seen as a problem by some, though others see such code as an issue for
the distributors and their customers only.
Then things went quickly back to attempts to impose a stable ABI on
enterprise kernels. There were assertions that this policy exists to make
life easier for purveyors of binary-only modules, but there is more to it
than that. Episodes of working code breaking when recompiled - perhaps due
to a change of compiler - are not unheard of. Nervous customers want to be
able to continue to run working code unchanged, without even the need to
rebuild it.
Ingo Molnar pointed out that system upgrades are a very emotional
decision. There is the ever-present fear that an upgrade could break a
previously-working system, but there is also the gratification that comes
with a welcome new feature. In general, the enterprise kernel problem can
be mitigated by reducing the level of fear that customers feel. One thing
that could be done to that end is to create an option for enterprise Linux
customers to run current mainline kernels if they wish.
Toward the end of the allotted hour, the discussion actually turned to the
question of what the community can do to help the distributors. Dave Jones
complained about his list of bugs (1500 of them) with nothing being done
about them. When developers do respond, they often ask reporters to try
current -rc kernels in the hope that the problem has gone away. There is
not enough effort going into actually figuring out what the cause of the
bug is.
In some cases, the situation is bad enough that some developers fear that
we are losing the users who would otherwise be some of our best testers.
Perversely, the result of this is fewer bug reports - but that is not the
same as fewer bugs. What is really needed, says Ingo Molnar, is some sort
of metrics on how much testing of kernels is really happening. We need
positive reports as well as bug reports. If the number of testers drop, it
will be immediately apparent that there is a problem.
Kyle McMartin said that the biggest problem is drivers - there are a lot of
drivers that users want, but which are not in mainline kernels. Squashfs
was mentioned as a perennially out-of-tree module that everybody uses.
There is a fear that the bar for the merging of drivers has been set too
high, causing developers to choose to just keep their code out of the
mainline. The time required to get code into the mainline is
nondeterministic and the process is scary. Maybe, especially in the case
of drivers, it should be made easier to get code into the mainline.
This was a controversial idea; a really bad driver can create obscure
problems all over the kernel. But the fact of the matter is that an ugly
driver is more likely to be fixed in-tree than out. Linus asserted that
anytime a distributor ships an out-of-tree driver, the process has failed.
We are failing our users, missing an opportunity to get the drivers
improved, and driving away testers who need those drivers just to get
going. So, if Linus has his way, we may see drivers having an easier time
getting into the mainline in the future.
(
Log in to post comments)