Kernel development [LWN.net]

Kernel release status

The current stable kernel release is 2.6.18, released by Linus on September 19. Do read the announcement; it appears to have some changelog entries which did not come directly from git. There is a vast amount of new stuff in this release, including priority-inheriting futexes, a new generic interrupt handling layer, a new core time subsystem, the kernel locking validator, the SMPnice work, a bunch of virtual memory work, a huge serial ATA update, the removal of devfs, and much more. See the KernelNewbies LinuxChanges page for a much more detailed list, the LWN 2.6 kernel API changes page for information on internal programming interface changes, or the long-format changelog for thousands of patches' worth of detail.

The current -mm release is 2.6.18-rc7-mm1. Says Andrew:

It took maybe ten hours solid work to get this dogpile vaguely compiling and limping to a login prompt on x86, x86_64 and powerpc. I guess it's worth briefly testing if you're keen.

He also notes that this kernel will not run on distributions with an older version of udev due to some driver core changes, a situation which was discussed here back in August. Other changes to -mm include a "probably wrong" change to the kmap() API to make it handle coherency issues, a new GFP_THISNODE memory allocation flag, the removal of the questionable HDAPS driver for unstated reasons (though it is worth noting that one of the last patches into 2.6.18 made it clear that anonymous code contributions cannot be accepted), the SLIM and integrity measurement security modules, and a number of fixes.

For 2.6.16 users: Adrian Bunk released 2.6.16.29 with a number of fixes on September 13.

The current 2.4 prepatch is 2.4.34-pre3, released on September 19. The main change this time around is the inclusion of the gcc 4.0 patches.

Comments (none posted)

Tracing infrastructures

Sometimes, things just do not go according to plan. Mathieu Desnoyers is the current maintainer of the Linux Trace Toolkit, a kernel event tracing package which has, despite a significant user base, remained outside of the mainline for many years. He recently posted a new LTT release with the following introduction:

Following an advice Christoph gave me this summer, submitting a smaller, easier to review patch should make everybody happier.

What resulted was a thread of hundreds of messages, many of which could be considered to be impolite even by linux-kernel standards. Clearly, LTT has hit a nerve - especially surprising given that the points of real disagreement are minimal.

At times, people have questioned whether the kernel needs any sort of tracing facility at all. That particular question would appear to have been resolved (affirmatively); the disagreement now would appear to be whether that tracing should be static or dynamic. Static tracing works by putting explicit tracepoints into the source code (they look like function calls); the tracing framework can then enable or disable those tracepoints at run time as desired. In a dynamic system, instead, tracepoints are injected into a running system, usually in the form of a breakpoint instruction.

The kernel already has dynamic tracing in the form of KProbes; LTT, instead, uses (primarily) a static model. So the biggest question, at least on the surface, has been over whether Linux needs a static tracing package in addition to the dynamic mechanism it has now. This debate revolves around a few points:

Overhead, part 1: when tracing is not being used (the normal situation on most systems), dynamic tracepoints clearly have lower overhead: they do not exist at all. For all the work that is done to make static tracepoints be fast when they are not in use, they still exist, and will still have a (small) runtime cost.
Overhead, part 2: when tracing is being used, static tracepoints will tend to be faster. The breakpoint mechanism used by KProbes can (in the current implementation) take about ten times as many CPU cycles as a static tracepoint. There are projects in the works (djprobes, in particular) which can reduce this overhead considerably; Ingo Molnar also, as part of the discussion, posted a series of patches which cut the KProbes overhead roughly in half.
One might wonder why overhead concerns people in this case. Tracing is often used to track frequent events, so a higher tracepoint overhead can slow things down in a noticeable manner. More to the point, though, heavyweight tracepoints can change the timing of events, leading to the dreaded "heisenbugs" which vanish when the developer actively looks for them.
Maintenance overhead: some developers are concerned that the addition of static tracepoints to the kernel code will complicate the maintenance of that code. Tracepoints clutter the code itself, and they must continue to work into the indefinite future. In a sense, each one can be thought of as a little system call which, once placed, cannot be changed. Developers also worry that there will be pressure to add increasing numbers of these tracepoints over time.
On the other hand, dynamic tracepoints impose a different sort of overhead: everybody who is interested in a set of tracepoints must take on the maintenance of those tracepoints. As the kernel changes, the tracepoints will need to move around to follow those changes. Keeping a set of dynamic tracepoints current can, in fact, be a nontrivial and tiresome job. Tools like SystemTap help in this regard, but they are far from a complete solution at this time. Static tracepoints placed into the kernel code, instead, will continue to work as that code changes.
Flexibility: dynamic tracepoints can be placed anywhere at any time, but static tracepoints require, at a minimum, a source code edit, rebuild, and reboot. Dynamic tracepoints can more easily support runtime filtering of events as well. On the other hand, static tracepoints currently are better at accessing local variables.
Architecture support: KProbes are not currently implemented on all architectures, so they are not available to all Linux users. Static tracepoints tend to require less architecture-specific trickiness, and are thus easier to support universally. On the other hand, it has been argued, the addition of static tracepoints would take away much of the incentive architecture maintainers might have to make KProbes work.

Reading through the discussion, one could be forgiven for going into a state of complete despair. The interesting thing, though, is that the level of disagreement is lower than one might think. There is a near consensus among the participants that there is a place for both static and dynamic tracepoints. Static tracing of events of interest will help a lot of people - user-space developers and system administrators, not just kernel developers - understand what is going on in the system. Making all of these people figure out where to place, for example, a tracepoint to report scheduler changes in a specific kernel makes things a lot harder.

The key point, however, is that the value of the static point is not really its static placement, but the fact that it is a clear indicator of where the tracepoint needs to be. So it has been suggested that an answer which might please everybody is to insert "markers" rather than tracepoints. These markers, which could live in a different section of the kernel image, are simply signs pointing out where a dynamic tracepoint should be inserted, should the need exist. To this end, Mathieu has posted a simple marker patch; it was promptly fired upon for implementation issues, but there are few people who are opposed to the idea.

So markers may well be the way this work goes forward. If the LTT code could be reworked around the marker concept, then the way might be clear for a discussion of what else needs to happen before that code could be merged (there are a number of issues to talk about there which have been, thus far, overshadowed by the current debate). After suitable consideration, a carefully-selected set of markers/tracepoints could be added to the mainline kernel, enabling anybody to easily hook into and monitor well-known events. Once the smoke clears, there might just be a viable solution which will please almost everybody.

Comments (9 posted)

Another container implementation

Containers have been an area of increased developer interest over the last year or so. The container concept offers many of the advantages of full paravirtualization, but at a much lower cost, allowing more virtual machines to be run on the same host. The only problem is getting everybody to agree about just what a container is. The recent container patch set from Rohit Seth is another attempt to flesh out this concept.

Many approaches to containers are oriented around process trees - one process explicitly encloses itself within a container, and becomes the "init" process there; the container is then populated with the children of the initial process. Rohit's patch maintains part of that functionality - when a process calls fork(), the child will belong to the same container as the parent (if any), but the mechanism is a bit more flexible than that. Arbitrary processes can be added to - and removed from - a container at any time.

Such changes are effected through a configfs interface. If configfs is mounted on /config, the system administrator can work with containers by moving into /config/containers. A new container is created by making a new directory there; containers, thus, are identified through a simple, flat namespace. A container's directory contains several files:

addtask: writing a process ID into this file will add the corresponding process to the container. Processes already belonging to a container cannot be added directly to a new container; they must be explicitly removed from the old one first.
rmtask: a process may be removed from a container by writing its ID to this file.
page_limit: the maximum number of active memory pages which may be used by the container.

There are also a few informational files for getting statistics about how the container is operating.

The memory limit works by adding a container pointer to each mm_struct and address_space structure on the system. As pages are used or freed, the container's total count is updated accordingly. Should the container go over its limit, a separate process (a workqueue) goes to work freeing up pages belonging to the container. If the limit is exceeded in a big way, processes within the container will (when they try to add pages) be put on hold briefly to let the reaper catch up.

Rohit's containers are thus concerned with controlling aggregate resource usage. In this sense, they resemble the resource beancounters patch - but they do not use any of the beancounter code. These containers also lack one other feature found in most other implementations: any sort of namespace control. Processes placed into one of these containers will still see - and have access to - the entire system.

So these containers are only a partial solution to the problem, at least at this point. Namespace control features could presumably be added later on, though how that control would interact with the ability to add and remove processes at arbitrary times would be interesting to see. Meanwhile we have another approach to (at least part of) the problem to look at.

Comments (none posted)

nopage() and nopfn()

The nopage() address space operation is charged with handling a major page fault within an address range. For address spaces backed by files, there is a generic nopage() method which causes the needed page to be read into memory. Device drivers also occasionally provide nopage() as part of their implementation of mmap(). In the driver case, a page fault is usually handled by finding the struct page corresponding to a memory-mapped buffer and passing that back to the kernel.

There are a couple of errors which can be signaled by nopage(): NOPAGE_SIGBUS for truly bad addresses and NOPAGE_OOM for situations where an out-of-memory situation caused the attempt to handle the fault to fail. What is missing is the ability to indicate that nopage() was interrupted by a signal and the operation should be retried. That is not a situation which normally comes up in nopage() handlers which, if they must wait, usually do so in a non-interruptible manner. Benjamin Herrenschmidt has run into this issue, however, and has proposed a small change allowing a new NOPAGE_RETRY value. The response would be just as one would expect - the operation is retried later on, after the signal is handled.

It turns out that Google has a similar patch which it applies internally, though the motivations are different. In Google's case, the patch exists to work around a performance problem that has been experienced there. This patch has not been submitted for merging because of potential denial of service problems and the fact that its author considers it to be a bit of a hack.

Some form of this patch may well be merged eventually, but some more work seems called for first. The two patches make it clear that there are multiple reasons for returning NOPAGE_RETRY, so it might make sense to make that reason available to the higher levels of the page fault handler. That would allow some potential efficiency problems to be addressed, though the DOS scenario still presents potential problems.

Meanwhile, one of the longstanding limitations of nopage() is that it can only handle situations where the relevant physical memory has a corresponding struct page. Those structures exist for main memory, but they do not exist when the memory is, for example, on a peripheral device and mapped into a PCI I/O memory region. Some architectures also do very strange things with special memory and multiple views of the same memory. In such cases, drivers must explicitly map the memory into user space with remap_pfn_range() instead of using nopage().

Jes Sorensen has, for some time, been carrying a patch which adds another address space operation called nopfn(). It is called in response to page faults only if there is no nopage() operation available; its job is to return a physical address (in the form of a page frame number) for the page which will satisfy the fault. That address will be stored directly into the process's page table, with no struct page required, and no reference counting performed. Jes has an IA-64 special memory driver which shows how this operation would be used.

The idea has not been universally popular in the past - Linus has opposed it, as have others. To some it looks like a needless complication of the virtual memory subsystem; these people would rather see code use remap_pfn_range() or create special page structures as needed. There are a number of situations where the nopfn() is said to work better, however, and the pressures for its inclusion do not appear to be going away. So it will be interesting to see whether this one makes it into 2.6.19 or not.

Comments (none posted)

Linus Torvalds Arrr! Linux 2.6.18 ?

Ingo Molnar 2.6.18-rt1 ?

Linus Torvalds Linux v2.6.18-rc7 ?

Andrew Morton 2.6.18-rc7-mm1 ?

Adrian Bunk Linux 2.6.16.29 ?

Willy Tarreau Linux 2.4.34-pre3 ?

Dimitri Sivanich Migration of standard timers ?

Balbir Singh Aggregated beancounters (v3) ?

Evgeniy Polyakov kevent: Generic event handling mechanism. ?

Thomas Gleixner 2.6.18-hrt-dyntick1 ?

Junio C Hamano GIT 1.4.2.1 ?

Akinobu Mita fault-injection capabilities (v3) ?

Mathieu Desnoyers LTTng-core (basic tracing infrastructure) 0.5.108 ?

Mathieu Desnoyers Linux Kernel Markers ?

Mathieu Desnoyers Linux Kernel Markers 0.2 for Linux 2.6.17 ?

Michael Reed The Linux Test project ltp-20060918 Released ?

Keith Owens Announce: kdb v4.4 is available for kernel 2.6.18 ?

Eugeny S. Mints PowerOP, Intro 0/2 ?

Amy Fong Add Broadcom PHY support ?

Amit S. Kale NetXen: 1G/10G Ethernet Driver ?

Jes Sorensen mspec driver ?

Benjamin Herrenschmidt MMIO accessors & barriers documentation #2 ?

Mark Fasheh What's in ocfs2.git ?

Nick Piggin block: explicit plugging ?

Dave Hansen filesystem helpers for custom 'struct file's ?

Stephen Hemminger Mark frame diverter for future removal. ?

Benjamin Herrenschmidt page fault retry with NOPAGE_RETRY ?

Jes Sorensen do_no_pfn() ?

Daniele Lacamera TCP Pacing ?

John W. Linville WE-21 patches/branch for final review ?

Diego Beltrami :[XFRM] BEET mode ?

Yoshifuji Hideaki IPV6: Updates for net-2.6.19 ?

David Madore security: add a mount option to make caps inheritable, re-enable CAP_SETPCAP ?

Rohit Seth -Containers: Introduction ?

Rohit Seth : Containers(V2)- Introduction ?

Pablo Neira Ayuso conntrackd 0.9.0 released ?

Ray Lehtiniemi headergraphs - kernel header dependency visualizer ?

Kernel development

Brief items

Kernel release status

Kernel development news

Tracing infrastructures

Another container implementation

nopage() and nopfn()

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Janitorial

Memory management

Networking

Security-related

Virtualization and containers

Miscellaneous