The x86 NMI iret problem
Normal interrupts
The CPU of a computer is a complex Turing-compatible machine that seemingly processes instructions in order of how they are laid out in memory. The hardware may optimize the actual order of how the instructions are read, but in a practical sense, the CPU acts as if it is reading the instructions the way the programmer had placed them (from the view of the current processor). When an event happens on an external device, such as a USB drive, a network card, or timer, it needs to notify the CPU that it must stop its current sequence of instructions and jump to another set of instructions to process the new event. This new sequence of instructions is called a handler, and the device uses an interrupt to notify the CPU.
If the CPU is currently processing instructions that use data that is also used by the interrupt handler when that interrupt comes in, the interrupt handler could corrupt the data that the CPU was in the process of modifying. To prevent this from happening, the programmer disables interrupts for the duration of the critical path that uses the vulnerable data. With normal interrupts, the programmer can synchronize the processing of instructions in the normal workflow of the CPU with the instructions in the interrupt handler using the ability to disable the interrupt.
Non-maskable interrupts
There are some special interrupts that can trigger even when the CPU has interrupts disabled. These non-maskable interrupts are used by tools like profiling and watchdogs. For profiling, information about where the CPU is spending its time is recorded, and, by ignoring disabled interrupts, the profiler can record time spent with interrupts disabled. If profiling used normal interrupts, it could not report that time. Similarly, a watchdog needs to detect if the kernel is stuck in a location where interrupts were disabled. Again, if a watchdog used normal interrupts, it would not be useful in such situations because it would never trigger when the interrupts were disabled.
As you can imagine, having code that can trigger at any time needs special thought in writing. For one thing, it cannot take any locks that are used anywhere else (although it can take locks that are used only in NMI context to synchronize NMIs across CPUs, but that should be avoided if possible). Ideally, an NMI handler should be as simple as possible to prevent race conditions that can be caused by code that is not expecting to be re-entrant.
Although NMIs can trigger when interrupts are disabled and even when the CPU is processing a normal interrupt, there is a specific time when an NMI will not trigger: when the CPU is processing another NMI. On most architectures, the CPU will not process a second NMI until the first NMI has finished. When a NMI triggers and calls the NMI handler, new NMIs must wait till the handler of the first NMI has completed. NMI handlers do not need to worry about nesting, and the Linux NMI handlers are written with this fact in mind.
The x86 NMI iret flaw
![[Stack layout]](https://static.lwn.net/images/2012/stack-s.png)
On x86, like other architectures, the CPU will not execute another NMI until the first NMI is complete. The problem with the x86 architecture, with respect to NMIs, is that an NMI is considered complete when an iret instruction is executed. iret is the x86 instruction that is used to return from an interrupt or exception. When an interrupt or exception triggers, the hardware will automatically load information onto the stack that will allow the handler to return back to what it interrupted in the state that it was interrupted. The iret instruction will use the information on the stack to reset the state.
The flaw on x86 is that an NMI will be considered complete if an exception is taken during the NMI handler, because the exception will return with an iret. If the NMI handler triggers either a page fault or breakpoint, the iret used to return from those exceptions will re-enable NMIs. The NMI handler will not be put back to the state that it was at when the exception triggered, but instead will be put back to a state that will allow new NMIs to preempt the running NMI handler. If another NMI comes in, it will jump into code that is not designed for re-entrancy. Even worse, on x86_64, when an NMI triggers, the stack pointer is set to a fixed address (per CPU). If another NMI comes in before the first NMI handler is complete, the new NMI will write all over the preempted NMIs stack. The result is a very nasty crash on return to the original NMI handler. The NMI handler for i386 uses the current kernel stack, like normal interrupts do, and does not have this specific problem.
A common example where this can be seen is to add a stack dump of a task into an NMI handler. To debug lockups, a kernel developer may put in a show_state() (shows the state of all tasks like the sysrq-t does) into the NMI watchdog handler. When the watchdog detects that the system is locked up, the show_state() triggers, showing the stack trace of all tasks. The reading of the stack of all tasks is carefully done because a stack frame may point to a bad memory area, which will trigger a page fault.
The kernel expects that a fault may happen here and handles it appropriately. But the page fault handler still executes an iret instruction. This will re-enable NMIs. The print-out of all the tasks may take some time, especially if it is going out over the serial port. This makes it highly possible for another NMI to trigger before the output is complete, causing the system to crash. The poor developer will be left with a partial dump and not have a backtrace of all the tasks. There is a good chance that the task that caused the problem will not be displayed, and the developer will have to come up with another means to debug the problem.
Because of this x86 NMI iret flaw, NMI handlers must neither trigger a page fault nor hit a breakpoint. It may sound like page faults should not be an issue, but this restriction prevents NMI handlers from using memory allocated by vmalloc(). The vmalloc() code in the kernel maps virtual memory in the kernel address space. The problem is that the memory is mapped into a task's page table when it is first used. If an NMI handler uses the memory, and that happens to be the first time the current task (the one executing when the NMI took place) referenced the memory, it will trigger a page fault.
![[Stack layout]](https://static.lwn.net/images/2012/stack2-s.png)
As breakpoints also return with an iret, they must not be placed in NMI handlers either. This prevents kprobes from being placed in NMI handlers. Kprobes are used by ftrace, perf, and several other tracing tools to insert dynamic tracepoints into the kernel. But if a kprobe is added to a function called by a NMI handler, it may become re-entrant due to the iret called by the breakpoint handler.
Why do we care?
For years NMIs were not allowed to take page faults or hit breakpoints; why do we want them today? In July 2010, the issue came up on linux-kernel, when Mathieu Desnoyers proposed solving the problem of using vmalloc() memory in NMIs. Desnoyers's solution was to have the page fault handler become NMI aware. On return from a page fault, the handler would check if it was triggered in NMI context and, if so, simply not do an iret but instead use a normal ret instruction. The ret instruction is the x86 assembly command to return from a function. Unlike iret, ret only pops the return address that it must jump to off the stack and does not put the system back to its original state. In Desnoyers's solution, the state would be restored directly with added instructions to get back to the NMI handler from the page fault without the need of iret.
Linus Torvalds was not happy with this solution. NMIs, because they can happen anywhere, need special treatment that is unlike other areas of the kernel. Torvalds did not want that treatment to spread to other areas of the kernel, such as the page fault handler. He preferred to make the NMI code even more complex, but to at least contain it only to the NMI handler. NMIs are a special case anyway, and not used for the normal operation of the kernel, whereas page faults are a crucial hot path in the kernel and should not be encumbered with non-important NMI handling.
The immediate solution was to change perf to not have to use vmalloc() memory within its NMI handler. Of course Desnoyers's goal was not to just fix perf, but to give LTTng the ability to use vmalloc() memory in an NMI handler. But handling page faults in the NMI handler is not the only reason to fix the x86 NMI iret problem. There is also strong reason to allow NMI handlers to use breakpoints.
Removing stop machine
There are a few areas in the kernel that require the use of stop_machine(), which is one of the most intrusive acts that the kernel can do to the system. In short, a call to stop_machine() stops execution on all other CPUs so that the calling CPU has exclusive access to the entire system. For machines with thousands of CPUs, a single call to stop_machine() can introduce a very large latency. Currently one of the areas that uses stop_machine() is the runtime modification of code.
The Linux kernel has a history of using self-modifying code. That is, code that changes itself at run time. For example, distributions do not like to ship more than one kernel, so self-modifying code is used to change the kernel at boot to optimize it for its environment. In the old days, distributions would ship a separate kernel for a uniprocessor machine and another for a multiprocessor machine. The same is true for a paravirtual kernel (one that can only run as a guest) and a kernel to run on real hardware. Because the maintenance of supporting multiple kernels is quite high, work has been done to modify the kernel on boot to change it if it finds that it is running on an uniprocessor machine (spin locks and other multiprocessor synchronizations are changed to be nops). If the kernel is loaded as a virtual guest for a paravirtualized environment, it will convert the kernels low-level instructions to use hypercalls.
Modifying code at boot time is not that difficult. The modifications are performed early on before other processors are initialized and before other services are started. At this stage of the boot process the system is just like a uniprocessor system. Changing the instruction text is simple as there is no worry about needing to flush the caches of other processors.
Today, there are several utilities in the Linux kernel that modify the code after boot. These modifications can happen at any time, generally due to actions by the system's administrator. The ftrace function tracer can change the nops that are stubbed at the beginning of almost every function into a call to trace those functions. Netfilter, which is used by iptables, uses jump labels to enable and disable filtering of network packets. Tracepoints used by both perf and ftrace also use jump labels to keep the impact of tracepoints to a minimum when they are not enabled. Kprobes uses breakpoints to place dynamic tracepoints into the code, but when possible it will modify the code into a direct jump in order to optimize the probe.
Modifying code at run time takes much more care than modifying code during boot. On x86 and some other architectures, if code is modified on one CPU while it is being executed on another CPU, it can generate a General Protection Fault (GPF) on the CPU executing the modified code. This usually results in a system crash. The way to get around this is to call stop_machine() to have all CPUs stop what they are doing in order to let a single CPU modify the code as if it were a uniprocessor. It gets a little more complex to handle NMIs happening on the stopped CPUs but that's out of scope for this article.
Being able to modify code without stop_machine() is a very desirable result. There happens to be a way to do the modification without requiring that the rest of the system stops what it was doing and wait for the modification to finish. That solution requires the use of breakpoints.
The way it works is to insert a breakpoint at the location that will be changed. A breakpoint on x86 is only one byte. The instruction that is changed is usually 5 bytes, as it is a jump to some location or a 5 byte nop. The breakpoint, being a single byte, may be substituted as the first byte of the instruction without disrupting the other CPUs. The breakpoint is inserted on the first byte of the instruction and, if another CPU hits that instruction, it will trigger the breakpoint and the breakpoint handler will simply return to the next instruction, skipping the instruction that is in the process of being changed.
Ftrace nop
55 push %rbp 48 89 e5 mov %rsp,%rbp 0f 1f 44 00 00 nop (5 bytes) 65 48 8b 04 25 80 c8 mov %gs:0xc880,%rax
Add breakpoint
55 push %rbp 48 89 e5 mov %rsp,%rbp cc 1f 44 00 00 <brk> nop 65 48 8b 04 25 80 c8 mov %gs:0xc880,%raxAfter the insertion of the breakpoint, a sync of all CPUs is required in order to make sure that the breakpoint can be seen across the CPUs. To synchronize the CPUs, an interprocessor interrupt (IPI) with an empty handler is sent to all the other CPUs. The interrupt on a CPU will flush the instruction pipeline. When another CPU reads the breakpoint it will jump to the breakpoint handler without processing the other 4 bytes of the instruction that is about to be updated. The handler will set the instruction pointer to return to the instruction after the one being modified. This keeps the modification of the rest of the instruction out of the view of the other CPUs.
After all the other CPUs had their pipelines flushed by the IPI sent to them, the rest of the instruction (4 bytes) may be modified:
Replace end of instruction
55 push %rbp 48 89 e5 mov %rsp,%rbp cc af 71 00 00 <brk> <mcount> 65 48 8b 04 25 80 c8 mov %gs:0xc880,%rax
Another sync is called across the CPUs. Then the breakpoint is removed and replaced with the first byte of the new instruction:
Remove breakpoint with new instruction
55 push %rbp 48 89 e5 mov %rsp,%rbp e8 af 71 00 00 callq <mcount> 65 48 8b 04 25 80 c8 mov %gs:0xc880,%rax
This works because adding and removing a breakpoint to code does not have the effect of causing a GPF on other CPUs. The syncs are required between each step because the other CPUs must still have a consistent view of the rest of the instruction. Since tracepoints and function tracing change the code within an NMI handler, breakpoints must be allowed within the handler.
Handling the x86 NMI iret flaw
Torvalds did not just reject Desnoyers's proposal and leave us without a solution. He, instead, came up with a solution himself. Torvalds's solution was to create a per-CPU pointer to the NMI stack frame to use. When an NMI came in, it would check the per-CPU NMI stack frame pointer. If it is NULL, then the NMI is not nested; it will update the pointer to hold its return stack frame and continue to process the NMI handler. This part of the NMI would never be in danger of nesting as no breakpoints or page faults could happen here. If, instead, the stack frame pointer is already set, the new NMI is a nested NMI. That is, a previous NMI triggered an exception that returned with an iret allowing for another NMI to nest. The nested NMI would then update the data in the per-CPU NMI frame pointer such that the interrupted NMI would fault on its return.
Then, the nested NMI would return back to the previous NMI without doing an iret and keep the CPU in NMI context (i.e. preventing new NMIs). When the first NMI returns, it would trigger a fault. In hardware, if an NMI were to trigger while the CPU was handling a previous NMI handler (before an iret is issued), the NMI would trigger a latch, which would be released when the iret is issued. That would cause the CPU to run the NMI handler again. In Torvalds's solution, the fault from the iret would act like a software latch and the fault handler would re-run the NMI handler. The atomic nature of an iret would also prevent races returning from the first NMI.
Torvalds's solution seemed like the perfect workaround until an effort was made to implement it. Torvalds suggested having the nested NMI handler cause the preempted NMI handler to fault when it issued the iret, then have the fault handler for the iret repeat the NMI as it would only fault if a nested NMI had happened. Unfortunately, this is not the case.
The iret of all exceptions, including NMIs, already has a fault handler. User-space applications can set their stack pointer to any arbitrary value. As long as the application does not reference the stack pointer, it will run fine. If an interrupt or NMI comes in, though, when it resets the stack pointer via the iret, the iret may fault because the user-space application had a bad stack pointer. Thus the iret already has a fault handler to handle this case, and entering the fault handler from an iret will not only be caused by nested NMIs but for other cases as well. Determining which case actually occurred is not a trivial task.
Another issue was that it required access to per-CPU data. This is a
bit tricky from the NMI handler because of the way Linux implements per-CPU
data on x86. That data is referenced by the %gs register. Because NMIs
can trigger anywhere, it takes a bit of work to validate the %gs
register. That would make for too much wasted effort just to know if the NMI is
nested or not.
So, in coming up with a solution to the problem, it is best not to go with a faulting iret. Instead, other tricks are available to the NMI handler. Because the NMI stack is per-CPU, Peter Zijlstra suggested the idea to use a portion of the stack to save a variable, which essentially makes that variable into a poor-man's per-CPU variable. When the first NMI comes in, it will copy its interrupt stack frame (the information needed to return back to the state that it interrupted), onto its stack. Not just once, but it will make two copies! Then it will set the special variable on the stack. This variable will be set when an NMI is processing its handler. On return, it will clear the variable and call the iret to return back to the state that it interrupted.
Now if the NMI triggers a page fault or hits a breakpoint, the iret of the exception re-enables NMIs. If another NMI comes in after that happens, it will first check the special variable on the stack. If the variable is set, then this is definitely a nested NMI handler and a jump is made to the code to handle a nested NMI. Astute readers will realize that this is not enough. What happens if the nested NMI triggers after the first NMI cleared the variable but before it returned with the iret?
If the nested NMI were to continue as if it was not nested, it will still corrupt the first NMI's stack and the first NMI iret would incorrectly read the nested NMIs stack frame. If the variable is not set, then the saved stack pointer on the stack is examined. If that stack pointer is within the current stack (the per-CPU NMI stack) then this can be considered a nested NMI. Note that just checking the stack may not be enough because there are cases where the NMI handler may change its stack pointer. Both the special on-stack variable and the location of the interrupted stack need to be examined before determining that this is a nested NMI.
![[]](https://static.lwn.net/images/2012/stack-nest-nmi-1-s.png)
Processing a nested NMI
OK, so it is determined that the NMI that came in is nested, now what? To simulate the hardware's behavior, the first NMI must somehow be notified that it must repeat the NMI handler. To perform this atomically, the return stack of the first NMI is updated to point into a trampoline. This is similar to what Torvalds proposed, except that, instead of using the exception handling of a faulting iret, the information on the stack is updated such that the iret will simply jump to the code to handle a nested NMI.
Note that a nested NMI must not update the previous NMI if the previous NMI
is executing in this trampoline area. The instruction pointer is examined
by the nested NMI, and if it is determined that it preempted a previous NMI
while it was on the trampoline it simply returns without doing anything.
The previous NMI is about to trigger another NMI handler anyway. This is
still similar to what the hardware does, as if more than one NMI were to
trigger while the CPU was processing an NMI, only one NMI would be
repeated. The first NMI can only be on the trampoline if it was previously
interrupted by a nested NMI, thus the second NMI that happened while the
first is on the trampoline may be discarded like the hardware would do with
the second of two NMIs triggering while processing a previous NMI.
Remember that the initial NMI saved its stack frame twice, which means
it has three copies of the interrupt stack frame. The first frame is
written by the hardware on entering of the NMI. But if there's a
nested NMI, this frame will be overwritten by it. One copy is used to
return from the NMI handler. But if there is a nested NMI, it will update
that copy to not return to where it triggered, but instead to return to the
trampoline that sets up the repeat of the NMI. In this case, we need the
third copy of the interrupt stack frame to copy back into the location that
the NMI will use to return to the location where the first NMI triggered.
The trampoline will set the special variable on the stack again to notify new NMIs coming in that an NMI is in progress, and then jump back to start the next NMI again. This time NMIs are still enabled because the nested NMI still uses a iret and doesn't use the ret trick to keep the CPU in NMI context. The reason for not doing so is because it makes the code even more complex. Now that there's a safe way to handle nested NMIs there's really no reason to prevent new ones after a previous NMI triggered. Future code may change things to keep NMI context when returning back to the first NMI from a nested NMI.
The above craziness was the solution for x86_64. What about i386 (x86_32)? Well, as i386 does not have a separate per-CPU stack for the NMI, and just uses the current kernel stack where it was interrupted, the solution is pretty straightforward, and even better, it's handled in the C code. On entry of the NMI C handler (called from assembly) a check is made against a per_cpu variable to see if it is nested or not. If it is, then that variable is switched to NMI_LATCHED and returns, otherwise it is set to NMI_EXECUTING. On exit of the NMI, a cmpxchg() is used to atomically update the per-CPU variable from NMI_EXECUTING to NMI_NOT_RUNNING. If it fails the update then it means that a nested NMI came in and a repeat of the C handler needs to be done.
Conclusion
NMI handling has always been a bane of kernel development. They are
an anomaly that causes many fine developers hours of sleepless
nights. Linus Torvalds is correct, it is best to keep the beast
contained. This code may sound like a horrible hack, but because it is
contained tightly with the NMI source it is actually an elegant solution. If
critical code throughout the kernel needs to accommodate NMIs then the
kernel will soon become an unmaintainable disaster. Keeping the NMI code in
one place also lets us know where a problem is if one were to arise. All
this work has been well worth the effort as it has opened the doors to the
removal of yet another bane of the kernel source: stop_machine().
Index entries for this article | |
---|---|
GuestArticles | Rostedt, Steven |
Posted Mar 9, 2012 10:54 UTC (Fri)
by mlankhorst (subscriber, #52260)
[Link] (1 responses)
Posted Mar 9, 2012 11:06 UTC (Fri)
by kugel (subscriber, #70540)
[Link]
The article was a very interesting read. I'm convinced the solution is an elegant hack :-)
Posted Mar 11, 2012 22:03 UTC (Sun)
by Julie (guest, #66693)
[Link] (1 responses)
Thanks, Steve.
Posted Mar 12, 2012 13:01 UTC (Mon)
by nevets (subscriber, #11875)
[Link]
But I need to give credit to Jon and especially Jake for making it clearly-written. It goes through several drafts before it makes the final cut. Without their help, it would look like an article written by a kernel hacker ;-)
Posted Mar 12, 2012 18:04 UTC (Mon)
by dag- (guest, #30207)
[Link] (1 responses)
I noticed from the article little information to what kernel started using this new NMI implementation. The article mentions July 2010 when a specific discussion started, but little information to when the solution was implemented and shipped as part of a kernel.
Is this 3.3 material, or have we been using this for many months already ? :-)
Posted Mar 12, 2012 20:56 UTC (Mon)
by nevets (subscriber, #11875)
[Link]
Posted Mar 15, 2012 1:53 UTC (Thu)
by slashdot (guest, #22014)
[Link] (2 responses)
cmpxchg() is emulated by disabling interrupts, but of course NMIs aren't disabled, so there's a race that can read to a loss of the nested NMI on 386 CPUs.
This seems fixable by replacing cmpxchg() with local_dec_and_test() after setting NOT_RUNNING = 0, EXECUTING = 1, LATCHED = 2, which doesn't have a race condition, since the LATCHED -> EXECUTING transition is harmless.
Posted Mar 15, 2012 2:00 UTC (Thu)
by slashdot (guest, #22014)
[Link] (1 responses)
Posted Mar 16, 2012 15:07 UTC (Fri)
by nevets (subscriber, #11875)
[Link]
I actually like your solution. The assignment is still needed, but only for the first instance of the NMI, but the label could be moved.
It's great that you brought this up on LWN, but it would be much better if you posted a patch :-)
Posted May 6, 2012 21:13 UTC (Sun)
by svk (subscriber, #84503)
[Link]
svk
The x86 NMI iret problem
The x86 NMI iret problem
The x86 NMI iret problem
Thanks Julie for the nice complement.
The x86 NMI iret problem
The x86 NMI iret problem
The x86 NMI iret problem
Broken on 386
Broken on 386
Broken on 386
Broken on "broken" BIOSes?