LWN.net Logo

Kernel development

Brief items

Kernel release status

The current development kernel is 3.0-rc6, released on July 4. It has a new Intel isci driver which adds a significant chunk of code, but, otherwise it's basic fixes. "It's getting to the point where I'm thinking I should just release 3.0, because it's been pretty quiet, and the fixes haven't been earth-shakingly exciting." See the full changelog for all the details.

Stable updates: no stable updates have been released in the last week, and none are in the review process as of this writing.

Comments (3 posted)

Quotes of the week

The number of times I have to explain to industrial and business customers that Linux doesn't suck but the defaults are stupid is astounding, and they then wonder why either the authors or their vendor is a complete and utter moron.
-- Alan Cox

I have to say, I look over these patches and my mind wants to turn to things like puppies. And ice cream.
-- Andrew Morton

Your changelog fails the basic test by mentioning "corner case" simply because the whole futex code consists only of corner cases.
-- Thomas Gleixner

I realize that it's annoying to spend a lot of time on a specific implementation and then see competing code get merged. Unfortunately, this happens all the time, and the code we merge is often not the one that has had the most effort spent on it, but the one that looks most promising at the time when it gets merged.
-- Arnd Bergmann

And quite frankly, Christoph Hellwig has now _twice_ said good things about that driver, which is pretty unusual. It might mean that the driver is great. Of course, it's way more likely that space aliens are secretly testing their happy drugs on Christoph. Or maybe he's just naturally mellowing.
-- Linus Torvalds

Comments (none posted)

What are they polling for?

By Jonathan Corbet
July 7, 2011
The poll(), select(), and epoll_wait() system calls all allow an application to ask the kernel whether I/O on any of a list of file descriptors would block and, optionally, to wait until one or more descriptors become ready for I/O. Internally, they are all implemented with the poll() method in the file_operations structure:

    unsigned int (*poll) (struct file *filp, struct poll_table_struct *pt);

This function returns a value indicating whether non-blocking I/O is currently possible; it is also expected to add a wait queue to the "poll table" (pt) passed in. If no file descriptors are ready for I/O, the calling process will block on all of the accumulated wait queues.

poll() has long implemented an optimization: if an early poll() function indicates that I/O is possible, the kernel knows that it will not be blocking the calling process. So it stops accumulating wait queues; this state is indicated by passing a null pointer for pt. That all works well except in one case: what if a driver needs access to some of the information stored in the poll table?

In particular, the driver might want to know whether the caller is interested in readiness for read or write access, or whether it is looking for exceptional events. For example, if the application wants to read from the descriptor, the driver may need to fire up some device machinery to make that possible. This situation has not come up very often, but it does tend to affect Video4Linux drivers. In response, Hans Verkuil has posted a patch slightly changing the way poll() works.

With the patch, the poll table is never passed as null; instead, the "we will not be blocking" case is marked internally. So the set of events requested by the application is always available; Hans has provided a helper function to access that information:

    unsigned long poll_requested_events(const poll_table *p);

There has been little discussion of the patch; it doesn't seem like there is any real reason for it not to go in for 3.1.

Comments (none posted)

Kernel development news

Seccomp filters: No clear path

By Jake Edge
July 7, 2011

Patches to expand the functionality of seccomp ("secure computing") have been floating around for two years or more without making any real progress into the mainline. There are a number of projects that are interested in using an expanded seccomp, but the patches themselves seem to have run into a "catch-22" situation. There are conflicting visions of how the feature should be added, without a clear sense that any of the options will be acceptable to all of the maintainers involved. That leaves a useful feature without a clear path into the kernel, which is undoubtedly frustrating to some.

We first looked at seccomp sandboxing a little over two years ago, when Adam Langley posted patches that would provide a way for a process to restrict the system calls that it (and its children) could make. The idea is to allow processes to sandbox themselves by choosing which system calls are available, rather than being restricted to just the four hard-coded system calls that the existing seccomp implementation allows (read(), write(), exit(), and sigreturn()). The impetus behind Langley's patches was to provide an easier mechanism for sandboxing processes in the Chromium web browser—and to eventually remove the somewhat convoluted sandbox that Chromium currently uses on Linux.

At the time of that proposal, Ingo Molnar suggested that Ftrace-style filtering would make the expanded seccomp much more useful. That idea wasn't universally hailed at the time, and the seccomp feature went mostly dormant until it was restarted by Will Drewry back in April. Drewry took Molnar's suggestions and implemented a version of seccomp that would allow system calls to be enabled, disabled, or filtered with simple boolean expressions (e.g. sys_read: (fd == 0)).

While Molnar was pleased with the progress, he didn't think it went far enough and suggested that a perf-like interface be used instead of prctl(), which is used by the existing seccomp. He had some fairly wide-ranging ideas that using perf events in a more active way could lead to better kernel security solutions than the existing Linux Security Modules (LSM) approach provides. Once again, this idea was not universally popular. The LSM developers, in particular, were not enamored by that idea.

Nevertheless, Drewry implemented a proof of concept along the lines of what Molnar had suggested. That led to complaints from a somewhat surprising direction, as both Peter Zijlstra and Thomas Gleixner strongly objected to perf being used in an active role. Their responses didn't leave room for any middle ground, with Zijlstra, who is one of the perf maintainers along with Molnar, saying that he and Gleixner would NAK "any and all patches that extend perf/ftrace beyond the passive observing role".

All of which led Drewry, who must be feeling a bit whipsawed at this point, to return to the patchset that seemed to have the most support: using Ftrace/perf-style filters, but maintaining the prctl() interface that is currently used by seccomp. Linus Torvalds had expressed some skepticism that the feature would have any real users, but Drewry outlined how it would be used by Chromium, and several other developers spoke up in favor of expanding seccomp, saying that QEMU, Linux containers (LXC), and others would use the feature. Those endorsements, along with resolving some other technical concerns, was enough for Torvalds to remove his objection to the feature. But, as might be guessed, Molnar is still not satisfied with the approach.

When Drewry reposted the patchset toward the end of June, and asked what the next steps were, Molnar noted that his concerns were not being addressed: "You are pushing the 'filter engine' approach currently, not the (much) more unified 'event filters' approach." But Drewry is trying to find a balance between the needs of the potential users, other maintainers, and Molnar's requests, which is somewhere between difficult and impossible:

Based on the support from potential API consumers, I believe there is interest in this patch series, and I worry that just like with the last two attempts in the last two years, this series will be relegated to the lwn archives in anticipation of a future solution that uses infrastructure that isn't quite ready. I'm trying to approach a problem that can be addressed today in a flexible, future-friendly way, rather than try to open up a larger cross-kernel impacting patch series that I'm unsure of exactly how to integrate sanely and don't know that I can commit to doing.

But Molnar is adamant that the "filter engine" approach is short-sighted, citing the diffstats of the various implementations as evidence:

Not doing it right because "it's too much work", especially as the trivial 'proof of concept' prototype already gave us something very promising that worked to a fair degree:
       bitmask (2009):  6 files changed,  194 insertions(+), 22 deletions(-)
 filter engine (2010): 18 files changed, 1100 insertions(+), 21 deletions(-)
 event filters (2011):  5 files changed,   82 insertions(+), 16 deletions(-)
are pretty hollow arguments to me. That diffstat sums up my argument of proper structure pretty well.

But, as Drewry points out, there is still a lot of work to be done to get beyond the proof-of-concept and to a fully fleshed-out solution. Given that the approach has already received several NAKs, doing all of that work has a very uncertain future. Drewry would like to see the feature be available soon, and is concerned that working on the larger problem is likely to delay that significantly, if it can ever get beyond the objections: "If all the other work is a prerequisite for system call restriction, I'll be very lucky to see anything this calendar year assuming I can even write the patches in that time."

Molnar is undeterred, however, suggesting that there is a path into the kernel through the tree that he co-maintains:

Do it properly generalized - as shown by the prototype patch. I can give you all help that is needed for that: we can host intermediate stages in -tip and we can push upstream step by step. You won't have to maintain some large in-limbo set of patches. 95% of the work you've identified will be warmly welcome by everyone and will be utilized well beyond sandboxing! That's not a bad starting position to get something controversial upstream: most of the crazy trees are 95% crazy.

The problem, of course, is that the 5% is the piece that Drewry and others are most interested in seeing (i.e. the system call restrictions for sandboxing) in the kernel. So, what Molnar seems to be offering is a fairly sizable chunk of work that could, in the end, still leave the "interesting" part out in the cold. Molnar may be confident that he can overcome the objections from Zijlstra and Gleixner, but Drewry can hardly be as sanguine. He describes the problem as he sees it:

It seems like a catch-22. There's not a perfectly clear path forward, and anything that looks like the perf-style proof of concept will be NACK'd by other maintainers. While I believe we could lift perf up off its foundation and create a shared location for storing perf events and ftrace events so that they will be inherited the same way (currently nack'd by linus) and walked the same way (kinda), the syscall interface couldn't currently be shared (also nack'd by perf), and creating a new one is possible modeled on the perf one, but it's also unclear what the ABI should be for a generic filtering system.

Both Zijlstra and Gleixner have been absent from the most recent discussion, so it's a little hard to guess what their thoughts are. In the absence of any kind of posting softening their stances, though, it would be a bad idea to believe that they have changed their minds.

It's a problem that we have seen before, where a new feature is, to some extent, held hostage to requests that a larger problem be solved. The problem was discussed at the 2009 Kernel Summit, where there was agreement that those requests should be advisory in nature, rather than demands. In this case, Molnar is not really demanding that the bigger task be done, just that he is uninterested in taking the code via the -tip tree unless it solves the larger problem.

It is unclear where things go from here. Drewry said that he would look at trying to do things Molnar's way ("but if my only chance of any form of this being ACK'd is to write it such that it shares code with perf and has a shiny new ABI, then I'll queue up the work for when I can start trying to tackle it"), but it may be a ways off. In the meantime, there are various projects interested in using the feature.

If falling back to the bitmask version of the feature solves enough of the problem for those projects, there is the possibility of trying to get that into the kernel via another tree (e.g. the security tree). There would undoubtedly be objections from Molnar, but if enough users lined up behind it, that might be a reasonable approach. It would create an ABI that would need to be maintained going forward, which is one of Molnar's objections, but it would solve problems for Chromium and others.

Steven Rostedt suggested adding the seccomp expansion as a discussion item for the Kernel Summit in October, which might provide a path forward. It's likely that most or all of the interested parties will be there (unlike the Linux Security Summit that will be held with Plumbers in September, which was suggested as an alternative). While a face-to-face discussion could be helpful, it might be a stretch to believe that the disagreement between active vs. passive perf could be resolved that way. On the other hand, it could lead to some kind of decree about the proper direction from Torvalds. That could go a long way toward resolving the issue.

Comments (1 posted)

CMA and ARM

By Jonathan Corbet
July 5, 2011
LWN recently looked (again) at the contiguous memory allocator (CMA) patch set; CMA is intended to provide large, contiguous DMA buffers to drivers without requiring that memory be set aside for that exclusive purpose. CMA was recently reposted with the idea that it is nearly ready for merging. There is a clear desire to see this code get at least into the -mm tree, even if it is not yet quite ready for the mainline. Most reviewers are pleased with CMA; it would seem that there are very few roadblocks remaining. Except that, as it turns out, one big obstacle remains.

Over the years, LWN has also looked at ARM's special memory management challenges. Recent ARM CPUs are, like those implementing other architectures, becoming more complex in order to improve performance. So ARM processors can now do speculative prefetching of memory contents in surprising ways. This prefetching works well on cached memory, but should not be used on memory that has been marked as uncached. An additional complication comes from the fact that virtual memory systems can have more than one mapping for a given range of memory, and caching is a feature of the mapping, not the memory itself. So one might well wonder what happens if different mappings have different caching attributes. On recent ARM processor designs, what happens is officially undefined; in practice, it can mean problems like corrupted memory, machine checks, or simple hangs. As it happens, kernel developers normally go out of their way to avoid that kind of behavior.

The current CMA mechanism is used as an allocator behind dma_alloc_coherent(), which creates a cache-coherent DMA buffer. In the absence of bus-snooping hardware that is able to notice when a DMA transfer changes memory, "cache-coherent" is likely to mean simply "uncached." So CMA must, on such systems, create an uncached range of memory to hand back to the requesting driver. That is easily done, and all should be well...at least, unless there happens to be another mapping to the same memory with different caching attributes.

Unfortunately, conflicting mappings can come about easily on a Linux system. One of the first things the kernel does as it boots is to create a "linear mapping" which provides kernel-space virtual addresses for most or all of the memory present in the system. The kernel cannot manipulate memory directly without such a mapping; putting as much of memory as possible into a persistent mapping thus makes sense. On a 32-bit system, just under 1GB of memory can be mapped this way (64-bit systems can always map all of memory and will be able to do so for quite some time yet). This kernel-mapped memory is called "low memory"; almost all allocations of memory for the kernel's use come from the low memory area. Naturally, low memory is mapped with caching enabled; to do otherwise would destroy the performance of the system. If a region of low memory is turned into a DMA buffer with an uncached mapping, the system will have two conflicting mappings for the same memory and will have moved into "undefined behavior" territory.

These conflicting mappings are the reason behind ARM maintainer Russell King's strong opposition to the merging of CMA in its current form. He believes that the code is unsafe on ARM systems; it should not, he says, be merged until the mapping problem has been solved. The interesting thing is that the existing DMA API has the same problem on ARM; dma_alloc_coherent() uses vanilla alloc_pages() to obtain a buffer, then changes the caching attributes before giving the buffer back to the caller. The addition of CMA does not make ARM's DMA API any more or less safe than it was before; it just perpetuates an existing problem.

Russell has a patch pending for 3.1 which addresses this problem by setting aside a chunk of memory which is never mapped into the kernel's address space. With this memory pool available, coherent DMA mappings can be set up without endangering the operation of the system. The whole reason CMA exists, though, is to provide large, contiguous buffers without the need to set aside memory; Russell's approach thus defeats the entire purpose. The pressures which have led to the creation of CMA will not go away anytime soon, so it seems that another solution is needed. Arnd Bergmann has outlined two possibilities, neither of which is entirely pleasant:

  • CMA could be changed to only allocate from the high memory zone. High memory is (by definition) not in the kernel's linear mapping, so no other mappings should exist. The problem with this approach is that it forces the use of high memory on all systems; ARM-based systems are reaching the point where some of them need high memory anyway, but that need is not, yet, universal. Getting enough memory into the high memory zone to be useful could require moving the boundary and shrinking low memory; that is not desirable because low memory is often a limiting resource already. Even if that obstacle can be overcome, the ARM architecture poses unique challenges which would make a high memory implementation hard.

  • Memory that has been turned into a coherent DMA buffer could simply be removed from the kernel's linear mapping until the buffer is no longer needed. This approach seems simple until one remembers that the kernel uses huge pages for the linear mapping. Splitting those huge pages into smaller pages would increase translation lookaside buffer (TLB) contention, reducing the performance of the system as a whole.

Compared to these alternatives, simply setting aside a chunk of memory at boot time might not look like such a bad idea after all. CMA developer Marek Szyprowski's plan appears to be to go with the second of those two alternatives; he thinks that it can be done without significantly hurting performance.

In truth, the best tradeoff will almost certainly differ from one platform to the next. In some situations, memory will be tight enough that a significant runtime penalty to avoid making static DMA buffers seems worthwhile; on others, setting aside a bit of memory may not be a real problem. So what may come of all this is a set of choices to be made when configuring a kernel. There does not appear to be a single solution which just works for everybody on the horizon at this time.

Comments (1 posted)

Deferred driver probing

By Jonathan Corbet
July 7, 2011
The developers working on the initial OLPC laptop ran into an interesting problem: the camera driver would fail to initialize if it was built into the kernel, but it worked just fine if built as a module. That problem still exists; it is a symptom of an issue which comes up frequently in contemporary systems: there is no way to know at build time what dependencies exist between different hardware units, so there is no way to ensure that drivers are loaded in the right order. A new patch from Grant Likely tries to solve that problem in a simple sort of way; it will probably improve the situation, but a complete solution is still lacking.

The problem with the camera driver is a result of the fact that the "camera" is, in reality, three devices working in concert: a DMA bridge, a sensor, and an I2C bus connecting the two. The bridge (which plays the role of the overall "camera driver") must locate and identify the sensor as part of its setup routine; if the sensor does not exist, initialization will fail. But the sensor will not exist until its driver and the I2C bus driver have been loaded into the system. If all of the drivers are built into the kernel, the bridge driver's probe() function may be called first; there will be no sensor, so everything fails.

Contemporary systems - especially those of the mobile variety - are increasingly built this way. Grant gave another example:

A "sound card" typically consists of multiple devices; one or more codecs (often i2c or spi attached), a sound bus (often i2s), a dma controller, and a lump of machine/platform specific code that ties them all together. Right now the ASoC code is going through all kinds of gymnastics make each component register with the ASoC layer and the 'tie together' driver has to wait for each of them to show up.

The key point to understand is that the various components that make up a "device" may appear to be entirely unrelated at the hardware level. They can be on different buses; some of them may be subcomponents of entirely different devices. A general-purpose kernel has no real way to know what the real dependencies between devices are until all of the pieces are present and have started to recognize each other.

Grant's patch takes a simple approach to solving this problem: drivers which are unable to initialize their devices as the result of missing resources can request that the operation be retried at some point in the future. That request is a simple matter of returning -EAGAIN from the probe() function. The driver core maintains a simple linked list of drivers that have requested this sort of deferral; when the time seems right, the deferred probe() invocations are retried to see if things work any better.

One of the concerns raised with regard to this patch had to do with the determination of the right time. How might the kernel know when a failed initialization might work? The event which may change the situation is the successful addition of a new device to the system, so the current patch retries all of the deferred calls every time a new device shows up. The mechanism used for the retries (a workqueue) will tend to coalesce these attempts when a lot of devices are being registered (during system boot, for example), but it still strikes some reviewers as being inefficient. Grant has promised a revision of the patch which improves the situation.

A related question is: when can the kernel conclude that there is no longer any point in retrying a specific driver's probe() function? In today's dynamic hardware environment, there never comes a point where one can say that no more devices will show up. This question has no real answer; it could be that, in a poorly configured or broken system, the process will never terminate. The cost of a driver stuck in the deferred state should be small, though.

Others have questioned the need for this mechanism at all, but the responses have made it clear that something needs to be done to address this kind of hardware. A proper solution in the driver core seems like a better answer than a bunch of one-off hacks in specific drivers. So something will probably go in.

Someday perhaps we will see a more elegant and efficient mechanism. One could imagine an API allowing a driver to specify which resources it is looking for; that driver's probe() function would then be put on hold until those resources become available. The driver core already generates events when new devices become available; some code matching those events to waiting drivers could be the last piece. But there would be a need to come up with a language by which a driver could express a need like "a device at address 42 on this I2C bus"; getting that right could take some work.

Meanwhile, Grant's patch offers a "good enough" solution which appears capable of solving the problem most of the time. Accepting "good enough" when it's truly good enough is a key part of pragmatic programming. So chances are we'll have deferred driver initialization in the kernel sometime soon; fancier mechanisms may be rather longer in coming.

Comments (4 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Architecture-specific

Security-related

Virtualization and containers

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds