Kernel development

Brief items

Kernel release status

The current development kernel is 3.3-rc6, released on March 3. Linus says that things are calming down: "In fact, it's been calm enough that this *might* be the last -rc, but we'll see how the upcoming week goes. If it stays calm (and hopefully even calms down some more), there doesn't seem to be any major reason to drag out the release cycle any more."

Stable updates: The 2.6.32.58 stable kernel update was released on March 4. It contains a long set of important fixes, as usual. Greg says: "This is the last 2.6.32 kernel I will be releasing. The 2.6.32 kernel is now in 'extended-longterm' maintenance, with no set release schedule from now on. I STRONGLY encourage any users of the 2.6.32 kernel series to move to the 3.0 series at this point in time." Willy Tarreau has reaffirmed that he will take over 2.6.32 maintenance - but at a slower pace.

Comments (none posted)

Quotes of the week

Somehow google managed to replace my name with my son's in various people's address books. Now it's gone and done it with akpm@linux-foundation.org as well as akpm@google.com. I'm in a losing fight trying to prevent those two identities from getting intermingled, absorbed and generally borgified. And how did google know that we're related? I can only think that it snarfed that connection from facebook.

-- ~~Andrew~~ Matt Morton

If people have to think for a moment about the concise expression, well, great, I don't mind people having to think about kernel code.

-- Hugh Dickins

About two years ago, before Alex was my fiancée, we were discussing things people do for good causes. She declared she would never shave off her hair. She says she doesn't remember the conversation...

-- But Rusty Russell always remembers things like that; you have been warned

Comments (14 posted)

Kernel development news

Two approaches to kernel memory usage accounting

By Jonathan Corbet
March 7, 2012

The kernel's memory usage controller allows a system administrator to place limits on the amount of memory used by a given control group. It is a useful tool for systems where memory usage policies must be applied - often systems where virtualization or containers are being used - but it has one notable shortcoming: it only tracks user-space memory. The memory used by the kernel on behalf of a control group is not tracked. For some workloads, the amount of memory involved may be considerable; a control group that accesses large numbers of files, for example, will create a lot of entries in the kernel's directory entry ("dentry") cache. Without the ability to control this kind of memory use in the kernel, the memory controller remains a partial solution.

Given that, it should not be surprising that a patch set adding the ability to track and limit kernel memory use exists. What may be a little more surprising is the fact that two independent patch sets exist, each of which adds that feature in its own way. Both were posted for consideration in late February.

The first set was posted by Glauber Costa, the author of the related per-cgroup TCP buffer limits controller. Glauber's patch works at the slab allocator level; only the SLUB allocator is supported at this time. With this approach, developers must explicitly mark a slab cache for usage tracking with this interface:

    struct memcg_cache_struct {
	int index;
	struct kmem_cache *cache;
	int (*shrink_fn)(struct shrinker *shrink, struct shrink_control *sc);
	struct shrinker shrink;
    };

    void register_memcg_cache(struct memcg_cache_struct *cache);

Once a slab cache has been passed to register_memcg_cache(), it is essentially split into an array of parallel caches, one belonging to each control group managed by the memory controller. With some added infrastructure, each of these per-cgroup slab caches is able to track how much memory has been allocated from it; this information can be used to cause allocations to fail should the control group's limits be exceeded. More usefully, the controller can, when limits are exceeded, call the shrink_fn() associated with the cache; that function's job is to find memory to free, bringing the control group back below its limit.

Glauber's patch set includes a sample implementation for the dentry cache. When a control group creates enough dentries to run past its limits, the shrinker function can clean some of them up. That may slow down processes in the affected control group, but it should prevent a dentry-intensive process from affecting processes in other control groups.

The second patch set comes from Suleiman Souhlal. Here, too, the slab allocator is the focus point for memory allocation tracking, but this patch works with the "slab" allocator instead of SLUB. One other significant difference with Suleiman's patch is that it tracks allocations from all caches, rather than just those explicitly marked for such tracking. There is a new __GFP_NOACCOUNT. flag to explicitly prevent tracking, but, as a whole, it's an opt-out system rather than opt-in. One might argue that, if tracking kernel memory usage is important, one should track all of it. But, as Suleiman acknowledges, the ability to track allocations from all caches "is also the main source of complexity in the patchset".

Under this scheme, slab caches operate as usual until an allocation is made from a specific cache while under the control of a specific cgroup. At that point, the cache is automatically split into per-cgroup caches without the intervention (or knowledge) of the caller. Of course, this splitting requires taking locks and allocating memory - activities that can have inconvenient results if the system is running in an atomic context at the time. In such situations, the splitting of the cache will be pushed off into a workqueue while the immediate allocation is satisfied from the pre-split cache. Much of the complexity in Suleiman's patch set comes from this magic splitting that works regardless of the calling context.

There is no shrinker interface in this patch set, though that is clearly planned for the future.

When a control group is deleted, both implementations shift the accounting up to the parent group. That operation, too, can involve some complexity; the processes that performed the allocation may, like their control group, be gone when the allocations are finally freed. Glauber's patch does no tracking for the root control group; as a result of that decision (and some careful programming), the cost of the kernel memory tracking feature is almost zero if it is not actually being used. Suleiman's patch does track usage for the root cgroup, but that behavior can be disabled with a kernel configuration option.

Neither patch appears to be ready for merging into the mainline prior to the 3.5 development cycle - and, probably, not even then. There are a lot of details to be worked out, the mechanism needs to work with both slab and SLUB (at least), and, somehow, the two patch sets need to turn into a single solution. The two developers are talking to each other and express interest in working together, but there will almost certainly need to be guidance from others before the two patches can be combined. If users of this feature feel that tracking allocations from all slab caches is important, then, clearly, whatever is merged will need to have that feature. If, instead, picking a few large users is sufficient, then a solution requiring the explicit marking of caches to be tracked will do. Thus far, there has not been a whole lot of input from people other than the two developers; until that happens, it will be hard to know which approach will win out in the end.

Comments (none posted)

Statistics for the 3.3 development cycle

By Jonathan Corbet
March 6, 2012

As of this writing, the 3.3 development cycle is at 3.3-rc6 and things are starting to look pretty stable. So it must be about time for our traditional summary of interesting statistics for the 3.3 kernel. It has been an active cycle, with some 10,350 changesets merged from just over 1,200 developers. Some 563,000 lines of code were added to the kernel, but 395,000 lines were removed, for a net growth of about 168,000 lines.

The most active developers this time around were:

Most active 3.3 developers

By changesets

Mark Brown 281 2.7%

Mauro Carvalho Chehab 271 2.6%

Axel Lin 240 2.3%

Al Viro 200 2.0%

Tejun Heo 123 1.2%

Tomi Valkeinen 101 1.0%

Russell King 100 1.0%

Matthew Wilcox 99 1.0%

Ben Skeggs 94 0.9%

Johannes Berg 93 0.9%

Stanislaw Gruszka 93 0.9%

Kuninori Morimoto 92 0.9%

Eliad Peller 90 0.9%

Takashi Iwai 88 0.9%

Eric Dumazet 87 0.8%

Dan Carpenter 86 0.8%

Franky Lin 77 0.8%

Kalle Valo 74 0.7%

Lars-Peter Clausen 73 0.7%

Artem Bityutskiy 68 0.7%

By changed lines

Greg Kroah-Hartman 88664 11.7%

Stanislaw Gruszka 38012 5.0%

Mathieu Desnoyers 25968 3.4%

Mauro Carvalho Chehab 21063 2.8%

Alan Cox 20948 2.8%

Kumar Gala 12083 1.6%

Aurelien Jacquiot 9998 1.3%

Mark Brown 9208 1.2%

Evgeniy Polyakov 7979 1.1%

David Daney 7684 1.0%

Manuel Lauss 7316 1.0%

Kuninori Morimoto 7115 0.9%

Dmitry Kasatkin 6880 0.9%

Jussi Kivilinna 6861 0.9%

Ben Skeggs 6699 0.9%

Axel Lin 6251 0.8%

Jesse Gross 5940 0.8%

Takashi Iwai 5140 0.7%

Rob Clark 4962 0.7%

Bart Van Assche 4711 0.6%

Mark Brown regularly appears in the list of top contributors; for 3.3, he contributed large numbers of patches in the sound and multi-function device subsystems. Mauro Carvalho Chehab is usually better known for routing vast numbers of Video4Linux2 changes into the kernel; this time, he wrote a substantial portion of those patches himself. Axel Lin's contributions were also in the sound subsystem. So, in other words, the top three contributors to the 3.3 kernel were all working with multimedia which is, clearly, a area with a lot of development going on. Al Viro is not a media developer; his work was mostly cleaning up interfaces deep within the virtual filesystem layer. Tejun Heo continues to dig into code all over the kernel; this time around he fixed up the memblock allocator, made a number of process freezer improvements, reworked the CFQ block I/O scheduler, and made a number of control group changes.

In the lines-changed column, Greg Kroah-Hartman heads the list again. This time around, almost all of his changes were deletions; a lot of code was removed from staging this time around, often because it graduated to the mainline kernel. Stanislaw Gruszka made a lot of changes to the iwlegacy network driver. Mathieu Desnoyers made the list for having added the LTTng tracing subsystem; unfortunately, that code was subsequently removed and will not appear in the 3.3 release. Alan Cox made the top five for his work with the gma500 graphics driver and its move out of the staging tree.

Just over 200 companies have been identified as having supported contributions to the 3.3 kernel. The most active companies this time around were:

Most active 3.3 employers

By changesets

(None) 1322 12.9%

Red Hat 1290 12.6%

Intel 897 8.8%

(Unknown) 524 5.1%

Novell 450 4.4%

Texas Instruments 422 4.1%

IBM 357 3.5%

Wolfson Microelectronics 282 2.8%

Qualcomm 249 2.4%

(Consultant) 243 2.4%

MiTAC 240 2.3%

Broadcom 231 2.3%

Samsung 216 2.1%

Google 211 2.1%

Oracle 183 1.8%

Freescale 163 1.6%

Wizery Ltd. 111 1.1%

Parallels 108 1.1%

Renesas Electronics 104 1.0%

(Academia) 102 1.0%

By lines changed

Novell 113910 15.1%

Red Hat 111338 14.7%

(None) 81133 10.7%

Intel 68378 9.0%

Texas Instruments 35696 4.7%

Samsung 27220 3.6%

EfficiOS 25990 3.4%

Freescale 22266 2.9%

(Unknown) 19307 2.6%

(Consultant) 18529 2.5%

IBM 16026 2.1%

Wolfson Microelectronics 13688 1.8%

Qualcomm 11736 1.6%

Broadcom 11180 1.5%

Mellanox 8856 1.2%

Cavium 7903 1.0%

Renesas Electronics 7574 1.0%

Google 7135 0.9%

MiTAC 6491 0.9%

Nicira Networks 6004 0.8%

This table has yielded few surprises in recent years; for the most part, the companies listed here remain the same from one cycle to the next. The continued growth in contributions from companies in the mobile and embedded areas is worth calling out, though. These companies are not just contributing support for their hardware; increasingly, they are also contributing to the core kernel and driving its evolution in the directions needed for their particular market. Once upon a time, it was common to hear that Linux kernel development was dominated by the needs of large enterprise deployments; few people make that claim now.

One other trend your editor has noted over time is a slow decline in the percentage of changes coming from people working on their own time. Here is a chart showing the numbers for all kernels since 2.6.25:

[Volunteer contributions]

The numbers are somewhat noisy, but the trend over the last four years suggests that volunteers are not contributing as much as they once were. It is unclear why that might be. One possibility is that the kernel has reached a point where there are few easy jobs left; the complexity of contemporary kernel development may be discouraging volunteers. Or it may simply be that anybody who demonstrates an ability to get code into the kernel tends not to remain a volunteer for long unless that is what they really want to be; all the rest end up getting hired. The truth may be a combination of both - or something else altogether.

Volunteer developers are important; they help tie the kernel to the wider community and some of them will become next year's professional developers and subsystem maintainers. A kernel that is unattractive to volunteers may find itself short of developers in the future. Thus far, there is nothing to suggest that any such developer shortage is happening; the 3.3 kernel, with 1,200 contributors, is as strong as any in that regard. That said, this trend is worth watching.

As a whole, though, the kernel remains a fast-paced and seemingly healthy project. The 3.3 release should happen sometime in mid-March, right on schedule. There is already a lot of interesting code lining up for merging in 3.4; expect to see another set of big numbers when the 3.4 version of this article appears in roughly 80 days time.

Comments (14 posted)

The x86 NMI iret problem

March 7, 2012

This article was contributed by Steven Rostedt

Interrupts are a source of unpredictable concurrency that can cause no end of trouble for kernel developers. Even most kernel hackers, though, do not need to deal with non-maskable interrupts (NMIs), which bring some additional challenges of their own. Some shortcomings in the NMI implementation in x86 processors have long imposed limits on what can be done in NMI handlers. Recently, those limits have been lifted. This article describes the difficulties imposed by NMIs, covers why the related limitations were getting in the way, and discusses the solution in gory detail.

Normal interrupts

The CPU of a computer is a complex Turing-compatible machine that seemingly processes instructions in order of how they are laid out in memory. The hardware may optimize the actual order of how the instructions are read, but in a practical sense, the CPU acts as if it is reading the instructions the way the programmer had placed them (from the view of the current processor). When an event happens on an external device, such as a USB drive, a network card, or timer, it needs to notify the CPU that it must stop its current sequence of instructions and jump to another set of instructions to process the new event. This new sequence of instructions is called a handler, and the device uses an interrupt to notify the CPU.

If the CPU is currently processing instructions that use data that is also used by the interrupt handler when that interrupt comes in, the interrupt handler could corrupt the data that the CPU was in the process of modifying. To prevent this from happening, the programmer disables interrupts for the duration of the critical path that uses the vulnerable data. With normal interrupts, the programmer can synchronize the processing of instructions in the normal workflow of the CPU with the instructions in the interrupt handler using the ability to disable the interrupt.

Non-maskable interrupts

There are some special interrupts that can trigger even when the CPU has interrupts disabled. These non-maskable interrupts are used by tools like profiling and watchdogs. For profiling, information about where the CPU is spending its time is recorded, and, by ignoring disabled interrupts, the profiler can record time spent with interrupts disabled. If profiling used normal interrupts, it could not report that time. Similarly, a watchdog needs to detect if the kernel is stuck in a location where interrupts were disabled. Again, if a watchdog used normal interrupts, it would not be useful in such situations because it would never trigger when the interrupts were disabled.

As you can imagine, having code that can trigger at any time needs special thought in writing. For one thing, it cannot take any locks that are used anywhere else (although it can take locks that are used only in NMI context to synchronize NMIs across CPUs, but that should be avoided if possible). Ideally, an NMI handler should be as simple as possible to prevent race conditions that can be caused by code that is not expecting to be re-entrant.

Although NMIs can trigger when interrupts are disabled and even when the CPU is processing a normal interrupt, there is a specific time when an NMI will not trigger: when the CPU is processing another NMI. On most architectures, the CPU will not process a second NMI until the first NMI has finished. When a NMI triggers and calls the NMI handler, new NMIs must wait till the handler of the first NMI has completed. NMI handlers do not need to worry about nesting, and the Linux NMI handlers are written with this fact in mind.

The x86 NMI `iret` flaw

First NMI on x86_64

On x86, like other architectures, the CPU will not execute another NMI until the first NMI is complete. The problem with the x86 architecture, with respect to NMIs, is that an NMI is considered complete when an iret instruction is executed. iret is the x86 instruction that is used to return from an interrupt or exception. When an interrupt or exception triggers, the hardware will automatically load information onto the stack that will allow the handler to return back to what it interrupted in the state that it was interrupted. The iret instruction will use the information on the stack to reset the state.

The flaw on x86 is that an NMI will be considered complete if an exception is taken during the NMI handler, because the exception will return with an iret. If the NMI handler triggers either a page fault or breakpoint, the iret used to return from those exceptions will re-enable NMIs. The NMI handler will not be put back to the state that it was at when the exception triggered, but instead will be put back to a state that will allow new NMIs to preempt the running NMI handler. If another NMI comes in, it will jump into code that is not designed for re-entrancy. Even worse, on x86_64, when an NMI triggers, the stack pointer is set to a fixed address (per CPU). If another NMI comes in before the first NMI handler is complete, the new NMI will write all over the preempted NMIs stack. The result is a very nasty crash on return to the original NMI handler. The NMI handler for i386 uses the current kernel stack, like normal interrupts do, and does not have this specific problem.

A common example where this can be seen is to add a stack dump of a task into an NMI handler. To debug lockups, a kernel developer may put in a show_state() (shows the state of all tasks like the sysrq-t does) into the NMI watchdog handler. When the watchdog detects that the system is locked up, the show_state() triggers, showing the stack trace of all tasks. The reading of the stack of all tasks is carefully done because a stack frame may point to a bad memory area, which will trigger a page fault.

The kernel expects that a fault may happen here and handles it appropriately. But the page fault handler still executes an iret instruction. This will re-enable NMIs. The print-out of all the tasks may take some time, especially if it is going out over the serial port. This makes it highly possible for another NMI to trigger before the output is complete, causing the system to crash. The poor developer will be left with a partial dump and not have a backtrace of all the tasks. There is a good chance that the task that caused the problem will not be displayed, and the developer will have to come up with another means to debug the problem.

Because of this x86 NMI iret flaw, NMI handlers must neither trigger a page fault nor hit a breakpoint. It may sound like page faults should not be an issue, but this restriction prevents NMI handlers from using memory allocated by vmalloc(). The vmalloc() code in the kernel maps virtual memory in the kernel address space. The problem is that the memory is mapped into a task's page table when it is first used. If an NMI handler uses the memory, and that happens to be the first time the current task (the one executing when the NMI took place) referenced the memory, it will trigger a page fault.

Nested NMI on x86_64

A vmalloc() page fault does not need to take locks as all it does is fill in the task's page table, thus there should be no problem with using vmalloc() memory in an NMI handler. But because the iret from the page fault will enable NMIs again, vmalloc() memory must be avoided in NMIs to prevent the above race. Kernel modules are loaded using vmalloc(), and the text sections of a loaded module are in virtual memory that are page faulted in on use. If a module were to register an NMI handler callback, that callback could cause the NMI to become re-entrant.

As breakpoints also return with an iret, they must not be placed in NMI handlers either. This prevents kprobes from being placed in NMI handlers. Kprobes are used by ftrace, perf, and several other tracing tools to insert dynamic tracepoints into the kernel. But if a kprobe is added to a function called by a NMI handler, it may become re-entrant due to the iret called by the breakpoint handler.

Why do we care?

For years NMIs were not allowed to take page faults or hit breakpoints; why do we want them today? In July 2010, the issue came up on linux-kernel, when Mathieu Desnoyers proposed solving the problem of using vmalloc() memory in NMIs. Desnoyers's solution was to have the page fault handler become NMI aware. On return from a page fault, the handler would check if it was triggered in NMI context and, if so, simply not do an iret but instead use a normal ret instruction. The ret instruction is the x86 assembly command to return from a function. Unlike iret, ret only pops the return address that it must jump to off the stack and does not put the system back to its original state. In Desnoyers's solution, the state would be restored directly with added instructions to get back to the NMI handler from the page fault without the need of iret.

Linus Torvalds was not happy with this solution. NMIs, because they can happen anywhere, need special treatment that is unlike other areas of the kernel. Torvalds did not want that treatment to spread to other areas of the kernel, such as the page fault handler. He preferred to make the NMI code even more complex, but to at least contain it only to the NMI handler. NMIs are a special case anyway, and not used for the normal operation of the kernel, whereas page faults are a crucial hot path in the kernel and should not be encumbered with non-important NMI handling.

The immediate solution was to change perf to not have to use vmalloc() memory within its NMI handler. Of course Desnoyers's goal was not to just fix perf, but to give LTTng the ability to use vmalloc() memory in an NMI handler. But handling page faults in the NMI handler is not the only reason to fix the x86 NMI iret problem. There is also strong reason to allow NMI handlers to use breakpoints.

Removing stop machine

There are a few areas in the kernel that require the use of stop_machine(), which is one of the most intrusive acts that the kernel can do to the system. In short, a call to stop_machine() stops execution on all other CPUs so that the calling CPU has exclusive access to the entire system. For machines with thousands of CPUs, a single call to stop_machine() can introduce a very large latency. Currently one of the areas that uses stop_machine() is the runtime modification of code.

The Linux kernel has a history of using self-modifying code. That is, code that changes itself at run time. For example, distributions do not like to ship more than one kernel, so self-modifying code is used to change the kernel at boot to optimize it for its environment. In the old days, distributions would ship a separate kernel for a uniprocessor machine and another for a multiprocessor machine. The same is true for a paravirtual kernel (one that can only run as a guest) and a kernel to run on real hardware. Because the maintenance of supporting multiple kernels is quite high, work has been done to modify the kernel on boot to change it if it finds that it is running on an uniprocessor machine (spin locks and other multiprocessor synchronizations are changed to be nops). If the kernel is loaded as a virtual guest for a paravirtualized environment, it will convert the kernels low-level instructions to use hypercalls.

Modifying code at boot time is not that difficult. The modifications are performed early on before other processors are initialized and before other services are started. At this stage of the boot process the system is just like a uniprocessor system. Changing the instruction text is simple as there is no worry about needing to flush the caches of other processors.

Today, there are several utilities in the Linux kernel that modify the code after boot. These modifications can happen at any time, generally due to actions by the system's administrator. The ftrace function tracer can change the nops that are stubbed at the beginning of almost every function into a call to trace those functions. Netfilter, which is used by iptables, uses jump labels to enable and disable filtering of network packets. Tracepoints used by both perf and ftrace also use jump labels to keep the impact of tracepoints to a minimum when they are not enabled. Kprobes uses breakpoints to place dynamic tracepoints into the code, but when possible it will modify the code into a direct jump in order to optimize the probe.

Modifying code at run time takes much more care than modifying code during boot. On x86 and some other architectures, if code is modified on one CPU while it is being executed on another CPU, it can generate a General Protection Fault (GPF) on the CPU executing the modified code. This usually results in a system crash. The way to get around this is to call stop_machine() to have all CPUs stop what they are doing in order to let a single CPU modify the code as if it were a uniprocessor. It gets a little more complex to handle NMIs happening on the stopped CPUs but that's out of scope for this article.

Being able to modify code without stop_machine() is a very desirable result. There happens to be a way to do the modification without requiring that the rest of the system stops what it was doing and wait for the modification to finish. That solution requires the use of breakpoints.

The way it works is to insert a breakpoint at the location that will be changed. A breakpoint on x86 is only one byte. The instruction that is changed is usually 5 bytes, as it is a jump to some location or a 5 byte nop. The breakpoint, being a single byte, may be substituted as the first byte of the instruction without disrupting the other CPUs. The breakpoint is inserted on the first byte of the instruction and, if another CPU hits that instruction, it will trigger the breakpoint and the breakpoint handler will simply return to the next instruction, skipping the instruction that is in the process of being changed.

Ftrace nop

     55                      push   %rbp
     48 89 e5                mov    %rsp,%rbp
     0f 1f 44 00 00          nop (5 bytes)
     65 48 8b 04 25 80 c8    mov    %gs:0xc880,%rax

Add breakpoint

     55                      push   %rbp
     48 89 e5                mov    %rsp,%rbp
     cc 1f 44 00 00          <brk> nop
     65 48 8b 04 25 80 c8    mov    %gs:0xc880,%rax

After the insertion of the breakpoint, a sync of all CPUs is required in order to make sure that the breakpoint can be seen across the CPUs. To synchronize the CPUs, an interprocessor interrupt (IPI) with an empty handler is sent to all the other CPUs. The interrupt on a CPU will flush the instruction pipeline. When another CPU reads the breakpoint it will jump to the breakpoint handler without processing the other 4 bytes of the instruction that is about to be updated. The handler will set the instruction pointer to return to the instruction after the one being modified. This keeps the modification of the rest of the instruction out of the view of the other CPUs.

After all the other CPUs had their pipelines flushed by the IPI sent to them, the rest of the instruction (4 bytes) may be modified:

Replace end of instruction

     55                      push   %rbp
     48 89 e5                mov    %rsp,%rbp
     cc af 71 00 00          <brk> <mcount>
     65 48 8b 04 25 80 c8    mov    %gs:0xc880,%rax

Another sync is called across the CPUs. Then the breakpoint is removed and replaced with the first byte of the new instruction:

Remove breakpoint with new instruction

     55                      push   %rbp
     48 89 e5                mov    %rsp,%rbp
     e8 af 71 00 00          callq   <mcount>
     65 48 8b 04 25 80 c8    mov    %gs:0xc880,%rax

This works because adding and removing a breakpoint to code does not have the effect of causing a GPF on other CPUs. The syncs are required between each step because the other CPUs must still have a consistent view of the rest of the instruction. Since tracepoints and function tracing change the code within an NMI handler, breakpoints must be allowed within the handler.

Handling the x86 NMI `iret` flaw

Torvalds did not just reject Desnoyers's proposal and leave us without a solution. He, instead, came up with a solution himself. Torvalds's solution was to create a per-CPU pointer to the NMI stack frame to use. When an NMI came in, it would check the per-CPU NMI stack frame pointer. If it is NULL, then the NMI is not nested; it will update the pointer to hold its return stack frame and continue to process the NMI handler. This part of the NMI would never be in danger of nesting as no breakpoints or page faults could happen here. If, instead, the stack frame pointer is already set, the new NMI is a nested NMI. That is, a previous NMI triggered an exception that returned with an iret allowing for another NMI to nest. The nested NMI would then update the data in the per-CPU NMI frame pointer such that the interrupted NMI would fault on its return.

Then, the nested NMI would return back to the previous NMI without doing an iret and keep the CPU in NMI context (i.e. preventing new NMIs). When the first NMI returns, it would trigger a fault. In hardware, if an NMI were to trigger while the CPU was handling a previous NMI handler (before an iret is issued), the NMI would trigger a latch, which would be released when the iret is issued. That would cause the CPU to run the NMI handler again. In Torvalds's solution, the fault from the iret would act like a software latch and the fault handler would re-run the NMI handler. The atomic nature of an iret would also prevent races returning from the first NMI.

Torvalds's solution seemed like the perfect workaround until an effort was made to implement it. Torvalds suggested having the nested NMI handler cause the preempted NMI handler to fault when it issued the iret, then have the fault handler for the iret repeat the NMI as it would only fault if a nested NMI had happened. Unfortunately, this is not the case.

The iret of all exceptions, including NMIs, already has a fault handler. User-space applications can set their stack pointer to any arbitrary value. As long as the application does not reference the stack pointer, it will run fine. If an interrupt or NMI comes in, though, when it resets the stack pointer via the iret, the iret may fault because the user-space application had a bad stack pointer. Thus the iret already has a fault handler to handle this case, and entering the fault handler from an iret will not only be caused by nested NMIs but for other cases as well. Determining which case actually occurred is not a trivial task.

Another issue was that it required access to per-CPU data. This is a bit tricky from the NMI handler because of the way Linux implements per-CPU data on x86. That data is referenced by the %gs register. Because NMIs can trigger anywhere, it takes a bit of work to validate the %gs register. That would make for too much wasted effort just to know if the NMI is nested or not.

So, in coming up with a solution to the problem, it is best not to go with a faulting iret. Instead, other tricks are available to the NMI handler. Because the NMI stack is per-CPU, Peter Zijlstra suggested the idea to use a portion of the stack to save a variable, which essentially makes that variable into a poor-man's per-CPU variable. When the first NMI comes in, it will copy its interrupt stack frame (the information needed to return back to the state that it interrupted), onto its stack. Not just once, but it will make two copies! Then it will set the special variable on the stack. This variable will be set when an NMI is processing its handler. On return, it will clear the variable and call the iret to return back to the state that it interrupted.

Now if the NMI triggers a page fault or hits a breakpoint, the iret of the exception re-enables NMIs. If another NMI comes in after that happens, it will first check the special variable on the stack. If the variable is set, then this is definitely a nested NMI handler and a jump is made to the code to handle a nested NMI. Astute readers will realize that this is not enough. What happens if the nested NMI triggers after the first NMI cleared the variable but before it returned with the iret?

If the nested NMI were to continue as if it was not nested, it will still corrupt the first NMI's stack and the first NMI iret would incorrectly read the nested NMIs stack frame. If the variable is not set, then the saved stack pointer on the stack is examined. If that stack pointer is within the current stack (the per-CPU NMI stack) then this can be considered a nested NMI. Note that just checking the stack may not be enough because there are cases where the NMI handler may change its stack pointer. Both the special on-stack variable and the location of the interrupted stack need to be examined before determining that this is a nested NMI.

Processing a nested NMI

OK, so it is determined that the NMI that came in is nested, now what? To simulate the hardware's behavior, the first NMI must somehow be notified that it must repeat the NMI handler. To perform this atomically, the return stack of the first NMI is updated to point into a trampoline. This is similar to what Torvalds proposed, except that, instead of using the exception handling of a faulting iret, the information on the stack is updated such that the iret will simply jump to the code to handle a nested NMI.

Note that a nested NMI must not update the previous NMI if the previous NMI is executing in this trampoline area. The instruction pointer is examined by the nested NMI, and if it is determined that it preempted a previous NMI while it was on the trampoline it simply returns without doing anything. The previous NMI is about to trigger another NMI handler anyway. This is still similar to what the hardware does, as if more than one NMI were to trigger while the CPU was processing an NMI, only one NMI would be repeated. The first NMI can only be on the trampoline if it was previously interrupted by a nested NMI, thus the second NMI that happened while the first is on the trampoline may be discarded like the hardware would do with the second of two NMIs triggering while processing a previous NMI.

Remember that the initial NMI saved its stack frame twice, which means it has three copies of the interrupt stack frame. The first frame is written by the hardware on entering of the NMI. But if there's a nested NMI, this frame will be overwritten by it. One copy is used to return from the NMI handler. But if there is a nested NMI, it will update that copy to not return to where it triggered, but instead to return to the trampoline that sets up the repeat of the NMI. In this case, we need the third copy of the interrupt stack frame to copy back into the location that the NMI will use to return to the location where the first NMI triggered.

The trampoline will set the special variable on the stack again to notify new NMIs coming in that an NMI is in progress, and then jump back to start the next NMI again. This time NMIs are still enabled because the nested NMI still uses a iret and doesn't use the ret trick to keep the CPU in NMI context. The reason for not doing so is because it makes the code even more complex. Now that there's a safe way to handle nested NMIs there's really no reason to prevent new ones after a previous NMI triggered. Future code may change things to keep NMI context when returning back to the first NMI from a nested NMI.

The above craziness was the solution for x86_64. What about i386 (x86_32)? Well, as i386 does not have a separate per-CPU stack for the NMI, and just uses the current kernel stack where it was interrupted, the solution is pretty straightforward, and even better, it's handled in the C code. On entry of the NMI C handler (called from assembly) a check is made against a per_cpu variable to see if it is nested or not. If it is, then that variable is switched to NMI_LATCHED and returns, otherwise it is set to NMI_EXECUTING. On exit of the NMI, a cmpxchg() is used to atomically update the per-CPU variable from NMI_EXECUTING to NMI_NOT_RUNNING. If it fails the update then it means that a nested NMI came in and a repeat of the C handler needs to be done.

Conclusion

NMI handling has always been a bane of kernel development. They are an anomaly that causes many fine developers hours of sleepless nights. Linus Torvalds is correct, it is best to keep the beast contained. This code may sound like a horrible hack, but because it is contained tightly with the NMI source it is actually an elegant solution. If critical code throughout the kernel needs to accommodate NMIs then the kernel will soon become an unmaintainable disaster. Keeping the NMI code in one place also lets us know where a problem is if one were to arise. All this work has been well worth the effort as it has opened the doors to the removal of yet another bane of the kernel source: stop_machine().

Comments (10 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 3.3-rc6 ?

Thomas Gleixner 3.2.9-rt15 ?

Thomas Gleixner 3.2.9-rt16 ?

Thomas Gleixner 3.2.9-rt17 ?

Steven Rostedt 3.0.23-rt38 ?

Greg KH Linux 2.6.32.58 ?

Architecture-specific

Wade Farnsworth ARM support for syscall tracing ?

Rob Herring ARM: provide runtime hook for ioremap/iounmap ?

Yinghai Lu PCI, x86: pci root bus hotplug support ?

Arnd Bergmann Pre-merge-window status of the arm-soc tree ?

Core kernel code

Li Zefan cgroup: add xattr support ?

John Stultz Reduce timekeeping lock hold time v2 ?

Mike Turquette common clk framework ?

Lai Jiangshan srcu: implement call_srcu() ?

Development tools

Barry Song MM: CMA: add a simple kernel module as the helper to test CMA ?

Xiao Guangrong KVM: perf: kvm events analysis tool ?

Device drivers

Lin Ming ACPI D3Cold state and SATA ZPODD support ?

Ying-Chun Liu (PaulLiu) mfd: Add anatop mfd driver ?

Dmitry Tarnyagin cw1200: mac80211-based driver for ST-Ericsson CW1200 device. ?

Viresh Kumar I2C: Add bus recovery infrastructure ?

Martin K. Petersen Write same support v3 ?

Ondrej Zary PnP support for the new ISA radio framework ?

Laurent Pinchart MT9M032 sensor driver ?

Sakari Ailus [PATCH v4 0/34] V4L2 subdev and sensor control changes, SMIA++ driver and N9 camera board code ?

Jiri Slaby TTY buffer in tty_port -- prep no. 1 ?

Grant Likely drivercore: Add driver probe deferral mechanism ?

Andreas Fenkart ARC VMAC driver. ?

Tomasz Stanislawski Integration of videobuf2 with dmabuf ?

Documentation

Michael Kerrisk (man-pages) man-pages-3.36 is released ?

Filesystems and block I/O

Mandeep Singh Baines dm: verity target ?

Darrick J. Wong ext4: Add metadata checksumming ?

Miklos Szeredi vfs: atomic open RFC ?

Memory management

Aneesh Kumar K.V memcg: add HugeTLB resource tracking ?

Daniel Vetter [RFC] kernel cpu access support for dma_buf ?

Mel Gorman [PATCH] cpuset: mm: Reduce large amounts of memory barrier related damage v2 ?

Networking

pablo@netfilter.org [RFC] cttimeout v2 ?

Richard Weinberger Netfilter ring buffer support ?

Security-related

Mimi Zohar ima: appraisal extension ?

Virtualization and containers

Takuya Yoshikawa KVM: srcu-less dirty logging -v2 ?

Benchmarks and bugs

Rafael J. Wysocki 3.3-rc6: Reported regressions from 3.2 ?

Rafael J. Wysocki 3.3-rc6: Reported regressions 3.1 -> 3.2 ?

Miscellaneous

Lucas De Marchi kmod 6 ?

Page editor: Jonathan Corbet
Next page: Distributions>>