Brief items
The current development kernel is 3.0-rc6, released on July 4. It has a new Intel isci driver which adds a significant
chunk of code, but, otherwise it's basic fixes. "It's getting to the
point where I'm thinking I should just release 3.0, because it's been
pretty quiet, and the fixes haven't been earth-shakingly exciting."
See the
full changelog for all the details.
Stable updates: no stable updates have been released in the last
week, and none are in the review process as of this writing.
Comments (3 posted)
The number of times I have to explain to industrial and business
customers that Linux doesn't suck but the defaults are stupid is
astounding, and they then wonder why either the authors or their
vendor is a complete and utter moron.
--
Alan Cox
I have to say, I look over these patches and my mind wants to turn
to things like puppies. And ice cream.
--
Andrew Morton
Your changelog fails the basic test by mentioning "corner case" simply
because the whole futex code consists only of corner cases.
--
Thomas Gleixner
I realize that it's annoying to spend a lot of time on a specific
implementation and then see competing code get
merged. Unfortunately, this happens all the time, and the code we
merge is often not the one that has had the most effort spent on
it, but the one that looks most promising at the time when it gets
merged.
--
Arnd Bergmann
And quite frankly, Christoph Hellwig has now _twice_ said good
things about that driver, which is pretty unusual. It might mean
that the driver is great. Of course, it's way more likely that
space aliens are secretly testing their happy drugs on
Christoph. Or maybe he's just naturally mellowing.
--
Linus Torvalds
Comments (none posted)
By Jonathan Corbet
July 7, 2011
The
poll(),
select(), and
epoll_wait() system
calls all allow an application to ask the kernel whether I/O on any of a
list of file descriptors would block and, optionally, to wait until one or
more descriptors become ready for I/O. Internally, they are all
implemented with the
poll() method in the
file_operations
structure:
unsigned int (*poll) (struct file *filp, struct poll_table_struct *pt);
This function returns a value indicating whether non-blocking I/O is
currently possible; it is also expected to add a wait queue to the "poll
table" (pt) passed in. If no file descriptors are ready for I/O,
the calling process will block on all of the accumulated wait queues.
poll() has long implemented an optimization: if an early
poll() function indicates that I/O is possible, the kernel knows
that it will not be blocking the calling process. So it stops accumulating
wait queues; this state is indicated by passing a null pointer for
pt. That all works well except in one case: what if a driver
needs access to some of the information stored in the poll table?
In particular, the driver might want to know whether the caller is
interested in readiness for read or write access, or whether it is looking
for exceptional events. For example, if the application wants to read from the
descriptor, the driver may need to fire up some device machinery to make
that possible. This situation has not come up very often, but it does tend
to affect Video4Linux drivers. In response, Hans Verkuil has posted a patch slightly changing the way
poll() works.
With the patch, the poll table is never passed as null; instead, the "we
will not be blocking" case is marked internally. So the set of events
requested by the application is always available; Hans has provided a
helper function to access that information:
unsigned long poll_requested_events(const poll_table *p);
There has been little discussion of the patch; it doesn't seem like there
is any real reason for it not to go in for 3.1.
Comments (none posted)
Kernel development news
By Jake Edge
July 7, 2011
Patches to expand the functionality of seccomp ("secure computing") have
been floating around
for two years or more without making any real progress into the mainline.
There are a number of projects that are interested in using an expanded
seccomp, but the patches themselves seem to have run into a "catch-22"
situation. There are conflicting visions of how the feature should be
added, without a clear sense that any of the options will be acceptable
to all of the maintainers involved. That leaves a useful
feature without a clear path into the kernel, which is undoubtedly
frustrating to some.
We first looked at seccomp sandboxing a
little over two years ago, when Adam Langley posted patches that would
provide a way for a process to restrict the system calls that it (and
its children) could make. The idea is to allow processes to sandbox themselves by choosing
which system calls are
available, rather than being restricted to just the four hard-coded system calls
that the existing seccomp implementation allows (read(), write(), exit(), and
sigreturn()). The impetus behind Langley's patches was to provide
an easier mechanism for sandboxing processes in the Chromium web
browser—and to eventually remove the somewhat convoluted sandbox that Chromium
currently uses on Linux.
At the time of that proposal, Ingo Molnar suggested that Ftrace-style filtering would
make the expanded seccomp much more useful. That idea wasn't universally
hailed at the time, and the seccomp feature went mostly dormant until it
was restarted by Will Drewry back in April.
Drewry took Molnar's suggestions and implemented a version of seccomp that
would allow system calls to be enabled, disabled, or filtered with simple
boolean expressions (e.g. sys_read: (fd == 0)).
While Molnar was pleased with the progress, he didn't think it went far enough and suggested
that a perf-like interface be used instead of prctl(), which is
used by the existing seccomp. He had some fairly wide-ranging ideas that
using perf events in a more active way could lead to better kernel security
solutions than the existing Linux Security Modules (LSM) approach
provides. Once again, this idea was not universally popular. The LSM
developers, in particular, were not enamored by that idea.
Nevertheless, Drewry implemented
a proof of concept along the lines of what Molnar had suggested. That
led to complaints from a somewhat
surprising direction, as both Peter Zijlstra and Thomas Gleixner strongly
objected to perf being used in an active role. Their responses didn't
leave room for any
middle ground, with Zijlstra, who is one of the perf maintainers along with
Molnar, saying that he and Gleixner would
NAK "any and all patches that extend perf/ftrace beyond the passive observing role".
All of which led Drewry, who must be feeling a bit whipsawed at this point,
to return to the patchset that seemed to have the most support: using
Ftrace/perf-style filters, but maintaining the prctl() interface
that is currently used by seccomp. Linus Torvalds had expressed some skepticism that the feature would have any
real users, but Drewry outlined how
it would be used by Chromium, and several other developers spoke up in
favor of expanding seccomp, saying that QEMU, Linux containers (LXC), and
others would use the feature. Those endorsements, along with resolving
some other
technical concerns, was enough for Torvalds to remove his
objection to the feature. But, as might be guessed, Molnar is still
not satisfied with the approach.
When Drewry reposted the patchset toward
the end of June, and asked what the next
steps were, Molnar noted that his concerns
were not being addressed: "You are pushing the 'filter engine' approach currently, not the
(much) more unified 'event filters' approach." But Drewry is trying
to find a balance between the needs of the potential users, other
maintainers, and Molnar's requests, which is somewhere between difficult
and impossible:
Based on the support from potential API consumers, I believe there is
interest in this patch series, and I worry that just like with the
last two attempts in the last two years, this series will be relegated
to the lwn archives in anticipation of a future solution that uses
infrastructure that isn't quite ready. I'm trying to approach a
problem that can be addressed today in a flexible, future-friendly
way, rather than try to open up a larger cross-kernel impacting patch
series that I'm unsure of exactly how to integrate sanely and don't
know that I can commit to doing.
But Molnar is adamant that the "filter
engine" approach is short-sighted, citing the diffstats of the various
implementations as evidence:
Not doing it right because "it's too much work", especially as the
trivial 'proof of concept' prototype already gave us something very
promising that worked to a fair degree:
bitmask (2009): 6 files changed, 194 insertions(+), 22 deletions(-)
filter engine (2010): 18 files changed, 1100 insertions(+), 21 deletions(-)
event filters (2011): 5 files changed, 82 insertions(+), 16 deletions(-)
are pretty hollow arguments to me. That diffstat sums up my argument
of proper structure pretty well.
But, as Drewry points out, there is still a
lot of work to be done to get beyond the proof-of-concept and to a fully
fleshed-out solution. Given that the approach has already received several
NAKs, doing all of that work has a very uncertain future. Drewry would
like to see the feature be available soon, and is concerned that working on
the larger problem is likely to delay that significantly, if it can ever
get beyond the objections: "If all the other work is a prerequisite
for system call restriction, I'll be very lucky to see anything this
calendar year assuming I can even write the patches in that time."
Molnar is undeterred, however, suggesting
that there is a path into the kernel through the tree
that he co-maintains:
Do it properly generalized - as shown by the prototype patch.
I can give you all help that is needed for that: we can host
intermediate stages in -tip and we can push upstream step by
step. You won't have to maintain some large in-limbo set of
patches. 95% of the work you've identified will be warmly
welcome by everyone and will be utilized well beyond sandboxing!
That's not a bad starting position to get something controversial
upstream: most of the crazy trees are 95% crazy.
The problem, of course, is that the 5% is the piece that Drewry and others
are most interested in seeing (i.e. the system call restrictions for
sandboxing) in the kernel. So, what Molnar seems to be offering is a
fairly sizable chunk of work that could, in the end, still leave the
"interesting" part out in the cold. Molnar may be confident that he can
overcome the objections from Zijlstra and Gleixner, but Drewry can hardly
be as sanguine. He describes the problem
as he sees it:
It seems like a catch-22. There's not a perfectly clear path forward,
and anything that looks like the perf-style proof of concept will be
NACK'd by other maintainers. While I believe we could lift perf up
off its foundation and create a shared location for storing perf
events and ftrace events so that they will be inherited the same way
(currently nack'd by linus) and walked the same way (kinda), the
syscall interface couldn't currently be shared (also nack'd by perf),
and creating a new one is possible modeled on the perf one, but it's
also unclear what the ABI should be for a generic filtering system.
Both Zijlstra and Gleixner have been absent from the most recent
discussion, so it's a little hard to guess what their thoughts are. In the
absence of any kind of posting softening their stances, though, it would be
a bad idea to believe that they have changed their minds.
It's a problem that we have seen before, where a new feature is, to some
extent, held hostage to requests that a larger problem be solved. The
problem was
discussed at the 2009 Kernel Summit,
where there was agreement that those requests should be advisory in nature,
rather than demands. In this case, Molnar is not really demanding that
the bigger task be done, just that he is uninterested in taking the code
via the -tip tree unless it solves the larger problem.
It is unclear where things go from here. Drewry said that he would look at
trying to do things Molnar's way ("but if my only chance of any form of this being
ACK'd is to write it such that it shares code with perf and has a
shiny new ABI, then I'll queue up the work for when I can start trying
to tackle it"), but it may be a ways off. In the meantime, there
are various projects interested in using the feature.
If falling back to the bitmask version of the feature solves enough of the
problem for those projects, there is the possibility of trying to get that
into the kernel via another tree (e.g. the security tree). There would
undoubtedly be objections from Molnar, but if enough users lined up behind
it, that might be a reasonable approach. It would create an ABI that would need
to be maintained going forward, which is one of Molnar's objections, but it
would solve problems for Chromium and others.
Steven Rostedt suggested adding the seccomp
expansion as a
discussion item for the Kernel Summit in October, which might provide a
path forward. It's likely that most or all of the interested parties will
be there (unlike the Linux Security Summit that will be held with Plumbers
in September, which was
suggested as an alternative). While a face-to-face discussion could be
helpful, it might be a stretch to believe that the disagreement between active
vs. passive perf could be resolved that way. On the other hand, it could
lead to some kind of decree about the proper direction from
Torvalds. That could go a long way toward resolving the issue.
Comments (1 posted)
By Jonathan Corbet
July 5, 2011
LWN recently looked (again) at the
contiguous
memory allocator (CMA) patch set; CMA is intended to provide large,
contiguous DMA buffers to drivers without requiring that memory be set
aside for that exclusive purpose. CMA was recently
reposted with the idea that it is nearly ready
for merging. There is a clear desire to see this code get at least into
the -mm tree, even if it is not yet quite ready for the mainline. Most
reviewers are pleased with CMA; it would seem that there are very few
roadblocks remaining. Except that, as it turns out, one big obstacle
remains.
Over the years, LWN has also looked at ARM's
special memory management
challenges. Recent ARM CPUs are, like those implementing other
architectures, becoming more complex in order to improve performance. So
ARM processors can now do speculative prefetching of memory contents in
surprising ways. This prefetching works well on cached memory, but should not
be used on memory that has been marked as uncached. An additional
complication comes from the fact that virtual memory systems
can have more than one mapping for a given range of memory, and caching is
a feature of the mapping, not the memory itself. So one might well wonder
what happens if different mappings have different caching attributes. On
recent ARM processor designs, what happens is officially undefined; in
practice, it can mean problems like corrupted memory, machine checks, or
simple hangs. As it happens, kernel developers normally go out of their
way to avoid that kind of behavior.
The current CMA mechanism is used as an allocator behind
dma_alloc_coherent(), which creates a cache-coherent DMA buffer.
In the absence of bus-snooping hardware that is able to notice when a DMA
transfer changes memory, "cache-coherent" is likely to mean simply
"uncached." So CMA must, on such systems, create an uncached range of
memory to hand back to the requesting driver. That is easily done, and all
should be well...at least, unless there happens to be another mapping to
the same memory with different caching attributes.
Unfortunately, conflicting mappings can come about easily on a Linux
system. One of the first things the kernel does as it boots is to create a
"linear mapping" which provides kernel-space virtual addresses for most or
all of the memory present in the system. The kernel cannot manipulate
memory directly without such a mapping; putting as much of memory as
possible into a persistent mapping thus makes sense. On a 32-bit system,
just under 1GB of memory can be mapped this way (64-bit systems can always
map all of memory and will be able to do so for quite some time yet). This
kernel-mapped memory is called "low memory"; almost all allocations of
memory for the kernel's use come from the low memory area. Naturally, low
memory is mapped with caching enabled; to do otherwise would destroy the performance of
the system. If a region of low memory is turned into a DMA buffer with an
uncached mapping, the system will have two conflicting mappings for the
same memory and will have moved into "undefined behavior" territory.
These conflicting mappings are the reason behind ARM maintainer Russell
King's strong opposition to the merging of
CMA in its current form. He believes that the code is unsafe on ARM
systems; it should not, he says, be merged until the mapping problem has
been solved.
The interesting thing is that the existing DMA API has the same problem on
ARM; dma_alloc_coherent() uses vanilla alloc_pages() to
obtain a buffer, then changes the caching attributes before giving the
buffer back to the caller. The addition of CMA does not make ARM's DMA API
any more or less safe than it was before; it just perpetuates an existing
problem.
Russell has a patch pending for 3.1 which addresses this problem
by setting aside a chunk of memory which is never mapped into the kernel's
address space. With this memory pool available, coherent DMA mappings can
be set up without endangering the operation of the system.
The whole reason CMA exists, though, is to provide large, contiguous
buffers without the need to set aside memory; Russell's approach thus
defeats the entire purpose. The pressures which have led to the creation
of CMA will not go away anytime soon, so it seems that another solution is
needed. Arnd Bergmann has outlined two
possibilities, neither of which is entirely pleasant:
- CMA could be changed to only allocate from the high memory zone. High
memory is (by definition) not in the kernel's linear mapping, so no
other mappings should exist. The problem with this approach is that
it forces the use of high memory on all systems; ARM-based systems are
reaching the point where some of them need high memory anyway, but
that need is not, yet, universal. Getting enough memory into the high
memory zone to be useful could require moving the boundary and
shrinking low memory; that is not desirable because low memory is
often a limiting resource already. Even if that obstacle can be
overcome, the ARM architecture poses
unique challenges which would make a high memory implementation
hard.
- Memory that has been turned into a coherent DMA buffer could simply be
removed from the kernel's linear mapping until the buffer is no longer
needed. This approach seems simple until one remembers that the
kernel uses huge pages for the linear mapping. Splitting those huge
pages into smaller pages would increase translation lookaside buffer
(TLB) contention, reducing the performance of the system as a whole.
Compared to these alternatives, simply setting aside a chunk of memory at
boot time might not look like such a bad idea after all. CMA developer
Marek Szyprowski's plan appears to be to go
with the second of those two alternatives; he thinks that it can be done
without significantly hurting performance.
In truth, the best tradeoff will almost certainly differ from one platform
to the next. In some situations, memory will be tight enough that a
significant runtime penalty to avoid making static DMA buffers seems
worthwhile; on others, setting aside a bit of memory may not be a real
problem. So what may come of all this is a set of choices to be made
when configuring a kernel. There does not appear to be a single solution
which just works for everybody on the horizon at this time.
Comments (1 posted)
By Jonathan Corbet
July 7, 2011
The developers working on the initial OLPC laptop ran into an interesting
problem: the camera driver would fail to initialize if it was built into
the kernel, but it worked just fine if built as a module. That problem
still exists; it is a symptom of an issue which comes up frequently in
contemporary systems: there is no way to know at build time what
dependencies exist between different hardware units, so there is no way to ensure that
drivers are loaded in the right order. A new patch from Grant Likely tries
to solve that problem in a simple sort of way; it will probably improve the
situation, but a complete solution is still lacking.
The problem with the camera driver is a result of the fact that the
"camera" is, in reality, three devices working in concert: a DMA bridge, a
sensor, and an I2C bus connecting the two. The bridge (which plays the
role of the overall "camera driver") must locate and identify the sensor as
part of its setup routine; if the sensor does not exist, initialization
will fail. But the sensor will not exist until its driver and the I2C bus
driver have been loaded into the system. If all of the drivers are built
into the kernel, the bridge driver's probe() function may be
called first; there will be no sensor, so everything fails.
Contemporary systems - especially those of the mobile variety - are
increasingly built this way. Grant gave another example:
A "sound card" typically consists of multiple devices; one or more
codecs (often i2c or spi attached), a sound bus (often i2s), a dma
controller, and a lump of machine/platform specific code that ties
them all together. Right now the ASoC code is going through all
kinds of gymnastics make each component register with the ASoC
layer and the 'tie together' driver has to wait for each of them to
show up.
The key point to understand is that the various components that make up a
"device" may appear to be entirely unrelated at the hardware level. They
can be on different buses; some of them may be subcomponents of entirely
different devices. A general-purpose kernel has no real way to know what
the real dependencies between devices are until all of the pieces are
present and have started to recognize each other.
Grant's patch takes a simple approach to
solving this problem: drivers which are unable to initialize their devices
as the result of missing resources can request that the operation be
retried at some point in the future. That request is a simple matter of
returning -EAGAIN from the probe() function. The driver
core maintains a simple linked list of drivers that have requested this
sort of deferral; when the time seems right, the deferred probe()
invocations are retried to see if things work any better.
One of the concerns raised with regard to this patch had to do with the
determination of the right time. How might the kernel know when a failed
initialization might work? The event which may change the situation is the
successful addition of a new device to the system, so the current patch
retries all of the deferred calls every time a new device shows up. The
mechanism used for the retries (a workqueue) will tend to coalesce these
attempts when a lot of devices are being registered (during system boot,
for example), but it still strikes some reviewers as being inefficient.
Grant has promised a revision of the patch which improves the situation.
A related question is: when can the kernel conclude that there is no longer
any point in retrying a specific driver's probe() function? In
today's dynamic hardware environment, there never comes a point where one
can say that no more devices will show up. This question has no real
answer; it could be that, in a poorly configured or broken system, the
process will never terminate. The cost of a driver stuck in the deferred
state should be small, though.
Others have questioned the need for this mechanism at all, but the
responses have made it clear that something needs to be done to address
this kind of hardware. A proper solution in the driver core seems like a
better answer than a bunch of one-off hacks in specific drivers. So
something will probably go in.
Someday perhaps we will see a more elegant and efficient mechanism. One
could imagine an API allowing a driver to specify which resources it is
looking for; that driver's probe() function would then be put on
hold until those resources become available. The driver core already
generates events when new devices become available; some code matching
those events to waiting drivers could be the last piece. But there would
be a need to come up with a language by which a driver could express a need
like "a device at address 42 on this I2C bus"; getting that right could
take some work.
Meanwhile, Grant's patch offers a "good enough" solution which appears
capable of solving the problem most of the time. Accepting "good enough"
when it's truly good enough is a key part of pragmatic programming. So
chances are we'll have deferred driver initialization in the kernel
sometime soon; fancier mechanisms may be rather longer in coming.
Comments (4 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>