March 7, 2012
This article was contributed by Steven Rostedt
Interrupts are a source of unpredictable concurrency that can cause no end
of trouble for kernel developers. Even most kernel hackers, though, do not
need to deal with non-maskable interrupts (NMIs), which bring some
additional challenges of their own. Some shortcomings in the NMI
implementation in x86 processors have long imposed limits on what can be
done in NMI handlers. Recently, those limits have been lifted. This
article describes the difficulties imposed by NMIs, covers why the related
limitations were getting in the way, and discusses the solution in gory
detail.
Normal interrupts
The CPU of a computer is a complex Turing-compatible machine that seemingly
processes instructions in order of how they are laid out in memory. The
hardware may optimize the actual order of how the instructions are read,
but in a practical sense, the CPU acts as if it is reading the
instructions the way the programmer had placed them (from the view of the
current processor). When an event happens on an external device, such as a
USB drive, a network card, or timer, it needs to notify the CPU that it must
stop its current sequence of instructions and jump to another set of
instructions to process the new event. This new sequence of instructions is
called a handler, and the device uses an interrupt to notify the CPU.
If the CPU is currently processing instructions that use data that is
also used by the interrupt handler when that interrupt comes in, the
interrupt handler could corrupt the data that the CPU was in the process of
modifying. To prevent this from happening, the programmer disables
interrupts for the duration of the critical path that uses the vulnerable
data. With normal interrupts, the programmer can synchronize the processing
of instructions in the normal workflow of the CPU with the instructions in
the interrupt handler using the ability to disable the interrupt.
Non-maskable interrupts
There are some special interrupts that can trigger even
when the CPU has interrupts disabled. These non-maskable interrupts are
used by tools like profiling and
watchdogs. For profiling, information about where the CPU is spending its
time is recorded, and, by ignoring disabled interrupts, the profiler can
record time spent with interrupts disabled. If profiling used normal interrupts,
it could not report that time. Similarly, a watchdog
needs to detect if the kernel is stuck in a location where interrupts were
disabled. Again, if a watchdog used normal interrupts, it would not be
useful in such situations because it would never trigger when the
interrupts were disabled.
As you can imagine, having code that can trigger at any time needs
special thought in writing. For one thing, it cannot take any locks that
are used anywhere else (although it can take locks that are used only in
NMI context to synchronize NMIs across CPUs, but that should be avoided if
possible). Ideally, an NMI handler should be as simple as possible to
prevent race conditions that can be caused by code that is not expecting to be
re-entrant.
Although NMIs can trigger when interrupts are disabled and even when
the CPU is processing a normal interrupt, there is a specific time when an
NMI will not trigger: when the CPU is processing another NMI. On most
architectures, the CPU will not process a second NMI until the first NMI has
finished. When a NMI triggers and calls the NMI handler, new NMIs must wait
till the handler of the first NMI has completed. NMI handlers do not need
to worry
about nesting, and the Linux NMI handlers are written with this fact in
mind.
The x86 NMI iret flaw
First NMI on x86_64
On x86, like other architectures, the CPU will not execute another NMI
until the first NMI is complete. The problem with the x86 architecture,
with respect to NMIs, is that an NMI is considered complete when an
iret instruction is executed. iret is the x86
instruction that is used to return from an interrupt or exception.
When an interrupt or exception triggers, the hardware will automatically load
information onto the stack that will allow the handler to return back to what it interrupted in the state
that it was interrupted. The iret instruction will use the information on the stack to reset the state.
The flaw on x86 is that an NMI will be considered complete if an
exception is taken during the NMI handler, because the exception will
return with an iret. If the NMI handler triggers either a page
fault or breakpoint, the iret used to return from those exceptions
will re-enable NMIs. The NMI handler will not be put back to the state that
it was at when the exception triggered, but instead will be put back to a
state that will allow new NMIs to preempt the running NMI handler. If
another NMI comes in, it will jump into code that is not designed for
re-entrancy. Even worse, on x86_64, when an NMI triggers, the stack pointer
is set to a fixed address (per CPU). If another NMI comes in before the
first NMI handler is complete, the new NMI will write all over the
preempted NMIs stack. The result is a very nasty crash on return to the
original NMI handler. The NMI handler for i386 uses the current kernel
stack, like normal interrupts do, and does not have this specific problem.
A common example where this can be seen is to add a stack dump of a
task into an NMI handler. To debug lockups, a kernel developer may put in
a show_state() (shows the state of all tasks like the
sysrq-t does) into the NMI watchdog handler. When the watchdog
detects that the system is locked up, the show_state() triggers,
showing the stack trace of all tasks. The reading of the stack of all tasks
is carefully done because a stack frame may point to a bad memory area,
which will trigger a page fault.
The kernel expects that a fault may happen
here and handles it appropriately. But the page fault handler still executes an
iret instruction. This will re-enable NMIs. The print-out of all
the tasks may take some time, especially if it is going out over the serial
port. This makes it highly possible for another NMI to trigger before the
output is complete, causing the system to crash. The poor developer will be
left with a partial dump and not have a backtrace of all the tasks. There
is a good chance that the task that caused the problem will not be
displayed, and the developer will have to come up with another means to
debug the problem.
Because of this x86 NMI iret flaw, NMI handlers must neither
trigger a page fault nor hit a breakpoint. It may sound like page faults
should not be an issue, but this restriction prevents NMI handlers from
using memory allocated by vmalloc(). The vmalloc() code
in the kernel maps virtual memory in the kernel address space. The problem
is that the memory is mapped into a task's page table when it is first
used. If an NMI handler uses the memory, and that happens to be the first
time the current task (the one executing when the NMI took place)
referenced the memory, it will trigger a page fault.
Nested NMI on x86_64
A
vmalloc() page fault does not need to take locks as
all it does is fill in the task's page table, thus there should be no
problem with using
vmalloc() memory in an NMI handler. But because
the
iret from the page fault will enable NMIs again,
vmalloc() memory must be avoided in NMIs to prevent the above
race. Kernel modules are loaded using
vmalloc(), and the text
sections of a loaded module are in virtual memory that are page faulted in
on use. If a module were to register an NMI handler callback, that callback
could cause the NMI to become re-entrant.
As breakpoints also return with an iret, they must not be
placed in NMI handlers either. This prevents kprobes from being placed in NMI
handlers. Kprobes are used by ftrace, perf, and several other tracing tools
to insert dynamic tracepoints into the kernel. But if a kprobe is added to
a function called by a NMI handler, it may become re-entrant due to the
iret called by the
breakpoint handler.
Why do we care?
For years NMIs were not allowed to take page faults or hit breakpoints; why
do we want them today? In July 2010, the issue came
up on linux-kernel, when Mathieu Desnoyers proposed solving the problem
of using
vmalloc() memory in NMIs. Desnoyers's solution was to have the
page fault handler become NMI aware. On return from a page fault, the
handler would check if it was triggered in NMI context and, if so, simply not
do an iret but instead use a normal ret instruction. The
ret instruction
is the x86 assembly command to return from a function. Unlike
iret, ret only pops the return address that it must jump
to off the stack
and does not put the system back to its original state. In
Desnoyers's solution, the state would be restored directly with added
instructions to get back to the NMI handler from the page fault without the
need of iret.
Linus Torvalds was not happy with this
solution. NMIs, because they can happen anywhere, need special
treatment that is unlike other areas of the kernel. Torvalds did not want
that treatment to spread to
other areas of the kernel, such as the page fault handler. He preferred
to make the NMI code even more complex, but to at least
contain it only to the NMI handler. NMIs are a special case anyway, and
not used for the normal operation of the kernel, whereas page faults
are a crucial hot path in the kernel and should not be
encumbered with non-important NMI handling.
The immediate solution was to change perf to not have to use
vmalloc() memory within its NMI handler. Of course Desnoyers's
goal was not to just fix perf, but to give LTTng the ability to use
vmalloc() memory in an NMI handler. But handling page faults in
the NMI handler is not the only reason to fix the x86 NMI iret
problem. There is also strong reason to allow NMI handlers to use breakpoints.
Removing stop machine
There are a few areas in the kernel that require the use of
stop_machine(), which is one of the most
intrusive acts that the kernel can do to the system. In short, a call to
stop_machine() stops execution on all other CPUs so that the
calling CPU has exclusive access to the entire system. For machines with
thousands of CPUs, a single call to stop_machine() can introduce a
very large latency. Currently one of the areas that uses
stop_machine() is the runtime modification of code.
The Linux kernel has a history of using self-modifying code. That is,
code that changes itself at run time. For example, distributions
do not like
to ship more than one kernel, so self-modifying code is used to change the
kernel at boot to optimize it for its environment. In the old days,
distributions would ship a separate kernel for a uniprocessor machine and
another for a multiprocessor machine. The same is true for a paravirtual
kernel (one that can only run as a guest) and a kernel to run on real
hardware. Because the maintenance of supporting multiple kernels is quite
high, work has been done to modify the kernel on boot to change it if it
finds that it is running on an uniprocessor machine (spin locks and other
multiprocessor synchronizations are changed to be nops). If the
kernel is loaded as a virtual guest for a paravirtualized environment, it
will convert the kernels low-level instructions to use hypercalls.
Modifying code at boot time is not that difficult. The modifications are
performed early on before other processors are initialized and before other
services are started. At this stage of the boot process the system is just
like a uniprocessor system. Changing the instruction text is simple as
there is no
worry about needing to flush the caches of other processors.
Today, there
are several utilities in the Linux kernel that modify the code after
boot. These modifications can happen at any time, generally due to actions by
the system's administrator. The ftrace function tracer can change the
nops that are stubbed at the beginning of almost every function
into a call to trace those functions. Netfilter, which is used by
iptables, uses jump
labels to enable and disable filtering of network packets. Tracepoints
used by both perf and ftrace also
use jump labels to keep the impact of tracepoints to a minimum when they
are not enabled. Kprobes uses breakpoints to place dynamic tracepoints
into the code, but when possible it will modify the code into a direct jump
in order to optimize the probe.
Modifying code at run time takes much more care than modifying code
during boot. On x86 and some other architectures, if code is modified on
one CPU while it is being executed on another CPU, it can generate a General
Protection Fault (GPF) on the CPU executing the modified code. This usually
results in a system crash. The way to get around this is to call
stop_machine() to have all CPUs stop what they are doing in order
to let a single CPU modify the code as if it were a uniprocessor. It gets a
little more complex to handle NMIs happening on the stopped CPUs but that's
out of scope for this article.
Being able to modify code without
stop_machine() is a very desirable result. There happens to be a
way to do the modification without requiring that the rest of the system
stops what it was doing and wait for the modification to finish. That
solution requires the use of breakpoints.
The way it works is to
insert a breakpoint at the location that will be changed. A breakpoint on
x86 is only one byte. The instruction that is changed is usually 5 bytes,
as it is a jump to some location or a 5 byte nop. The breakpoint,
being a single byte, may be substituted as the first byte of the instruction
without disrupting the other CPUs.
The breakpoint is inserted on the first byte of the instruction and, if
another CPU hits that instruction, it will trigger the breakpoint and the
breakpoint handler will simply return to the next instruction, skipping the
instruction that is in the process of being changed.
Ftrace nop
55 push %rbp
48 89 e5 mov %rsp,%rbp
0f 1f 44 00 00 nop (5 bytes)
65 48 8b 04 25 80 c8 mov %gs:0xc880,%rax
Add breakpoint
55 push %rbp
48 89 e5 mov %rsp,%rbp
cc 1f 44 00 00 <brk> nop
65 48 8b 04 25 80 c8 mov %gs:0xc880,%rax
After the insertion of the breakpoint, a sync of all CPUs is required in
order to make sure that the breakpoint can be seen across the CPUs. To
synchronize the CPUs, an interprocessor interrupt (IPI) with an empty
handler is sent to all the other CPUs. The interrupt on a CPU will flush
the instruction pipeline. When another CPU reads the breakpoint it will
jump to the
breakpoint handler without processing the other 4 bytes of the instruction
that is about to be updated. The handler will set the instruction pointer
to return to the instruction after the one being modified. This keeps the
modification of the rest of the instruction out of the view of the other
CPUs.
After all the other CPUs had their pipelines flushed by the IPI sent to
them, the rest of the instruction (4 bytes) may be modified:
Replace end of instruction
55 push %rbp
48 89 e5 mov %rsp,%rbp
cc af 71 00 00 <brk> <mcount>
65 48 8b 04 25 80 c8 mov %gs:0xc880,%rax
Another sync is called across the CPUs. Then the breakpoint is removed
and replaced with the first byte of the new instruction:
Remove breakpoint with new instruction
55 push %rbp
48 89 e5 mov %rsp,%rbp
e8 af 71 00 00 callq <mcount>
65 48 8b 04 25 80 c8 mov %gs:0xc880,%rax
This works because adding and removing a breakpoint to code does not
have the effect of causing a GPF on other CPUs. The syncs are required
between each step because the other CPUs must still have a consistent view
of the rest of the instruction. Since tracepoints and function tracing change
the code within an NMI handler, breakpoints must be allowed within the
handler.
Handling the x86 NMI iret flaw
Torvalds did not just reject Desnoyers's proposal and leave us without
a solution. He, instead, came
up with a solution himself. Torvalds's solution was to create a per-CPU
pointer to the NMI stack frame to use. When an NMI came in, it would check
the per-CPU NMI stack frame pointer. If it is NULL, then the NMI is not
nested; it
will update the pointer to hold its return stack frame and continue to
process the NMI handler. This part of the NMI would never be in danger of
nesting as no breakpoints or page faults could happen here. If, instead,
the stack frame pointer is already set, the new NMI is a nested NMI. That is, a
previous NMI triggered an exception that returned with an iret
allowing for another NMI to nest. The nested NMI would then update the
data in the per-CPU NMI frame pointer such that the interrupted NMI would
fault on its return.
Then, the nested NMI would return back to the previous
NMI without doing an iret and keep the CPU in NMI context
(i.e. preventing new NMIs). When the first NMI returns, it would trigger a
fault. In
hardware, if an NMI were to trigger while the CPU was handling a previous
NMI handler (before an iret is issued), the NMI would trigger a
latch, which would be released when the iret is issued. That
would cause the
CPU to run the NMI handler again. In Torvalds's solution, the fault from the
iret would act like a software latch and the fault handler would
re-run the NMI handler. The atomic nature of an iret would also
prevent races
returning from the first NMI.
Torvalds's solution seemed like the perfect workaround until an effort
was made to implement it. Torvalds suggested having the nested NMI
handler cause the preempted NMI handler to fault when it issued the
iret, then have the fault handler for the iret repeat the
NMI as it would only fault if a nested NMI had happened. Unfortunately, this is
not the case.
The iret of all exceptions, including NMIs, already
has a fault handler. User-space applications can set their stack pointer to
any arbitrary value. As long as the application does not reference the
stack pointer, it will run fine. If an interrupt or NMI comes in, though,
when it
resets the stack pointer via the iret, the iret may fault
because the user-space application had a bad stack pointer. Thus the
iret already has a fault handler to handle this case, and entering
the fault handler from an iret will not only be caused by nested NMIs
but for other cases as well. Determining which case actually occurred is
not a trivial task.
Another issue was that it required access to per-CPU data. This is a
bit tricky from the NMI handler because of the way Linux implements per-CPU
data on x86. That data is referenced by the %gs register. Because NMIs
can trigger anywhere, it takes a bit of work to validate the %gs
register. That would make for too much wasted effort just to know if the NMI is
nested or not.
So, in coming up with a solution to the problem, it is best not to go
with a faulting iret. Instead, other tricks are available to the
NMI handler. Because the NMI stack is per-CPU, Peter Zijlstra suggested the
idea to use a portion of the stack to save a variable, which essentially
makes that variable into a poor-man's per-CPU variable. When the
first NMI comes in, it will copy its interrupt stack frame (the information
needed to return back to the state that it interrupted), onto its
stack. Not just
once, but it will make two copies! Then it will set the special variable on
the stack. This variable will be set when an NMI is processing its
handler. On return, it will clear the variable and call the iret
to return back to the state that it interrupted.
Now if the NMI triggers a page fault or hits a breakpoint, the
iret of the exception re-enables NMIs. If another NMI comes in
after that happens, it
will first check the special variable on the stack. If the variable is set,
then this is definitely a nested NMI handler and a jump is made to the code
to handle a nested NMI. Astute readers will realize that this is not
enough. What happens if the nested NMI triggers after the first NMI cleared
the variable but before it returned with the iret?
If the nested
NMI were to continue as if it was not nested, it will still corrupt the
first NMI's stack and the first NMI iret would incorrectly read
the nested NMIs stack frame. If the variable is not set, then the saved
stack pointer on the stack is examined. If that stack pointer is within the
current stack (the per-CPU NMI stack) then this can be considered a nested
NMI. Note that just checking the stack may not be enough because there are
cases where the NMI handler may change its stack pointer. Both the special
on-stack variable and the location of the interrupted stack need to be
examined before determining that this is a nested NMI.
Processing a nested NMI
OK, so it is determined that the NMI that came in is nested, now what?
To simulate the hardware's behavior, the first NMI must somehow be notified
that it
must repeat the NMI handler. To perform this atomically, the return stack
of the first NMI is updated to point into a trampoline. This is similar to
what Torvalds proposed, except that, instead of using the exception
handling of a
faulting iret, the information on the stack is updated such that
the iret will simply jump to the code to handle a nested NMI.
Note that a nested NMI must not update the previous NMI if the previous NMI
is executing in this trampoline area. The instruction pointer is examined
by the nested NMI, and if it is determined that it preempted a previous NMI
while it was on the trampoline it simply returns without doing anything.
The previous NMI is about to trigger another NMI handler anyway. This is
still similar to what the hardware does, as if more than one NMI were to
trigger while the CPU was processing an NMI, only one NMI would be
repeated. The first NMI can only be on the trampoline if it was previously
interrupted by a nested NMI, thus the second NMI that happened while the
first is on the trampoline may be discarded like the hardware would do with
the second of two NMIs triggering while processing a previous NMI.
Remember that the initial NMI saved its stack frame twice, which means
it has three copies of the interrupt stack frame. The first frame is
written by the hardware on entering of the NMI. But if there's a
nested NMI, this frame will be overwritten by it. One copy is used to
return from the NMI handler. But if there is a nested NMI, it will update
that copy to not return to where it triggered, but instead to return to the
trampoline that sets up the repeat of the NMI. In this case, we need the
third copy of the interrupt stack frame to copy back into the location that
the NMI will use to return to the location where the first NMI triggered.
The trampoline will set the special variable on the stack again to
notify new NMIs coming in that an NMI is in progress, and then jump back to
start the next NMI again. This time NMIs are still enabled because the
nested NMI still uses a iret and doesn't use the ret
trick to keep the CPU in NMI context. The reason for not doing so is
because it makes the code even more complex. Now that there's a safe way to
handle nested NMIs there's really no reason to prevent new ones after a
previous NMI triggered. Future code may change things to keep NMI
context when returning back to the first NMI from a nested NMI.
The above craziness was the solution for x86_64. What about i386
(x86_32)? Well, as i386 does not have a separate per-CPU stack for the NMI,
and just uses the current kernel stack where it was interrupted, the
solution is pretty straightforward, and even better, it's handled in the C
code. On entry of the NMI C handler (called from assembly) a check is made
against a per_cpu variable to see if it is nested or not. If it is, then
that variable is switched to NMI_LATCHED and returns, otherwise it
is set
to NMI_EXECUTING. On exit of the NMI, a cmpxchg() is used
to atomically
update the per-CPU variable from NMI_EXECUTING to NMI_NOT_RUNNING. If
it fails the update then it means that a nested NMI came in and a repeat of
the C handler needs to be done.
Conclusion
NMI handling has always been a bane of kernel development. They are
an anomaly that causes many fine developers hours of sleepless
nights. Linus Torvalds is correct, it is best to keep the beast
contained. This code may sound like a horrible hack, but because it is
contained tightly with the NMI source it is actually an elegant solution. If
critical code throughout the kernel needs to accommodate NMIs then the
kernel will soon become an unmaintainable disaster. Keeping the NMI code in
one place also lets us know where a problem is if one were to arise. All
this work has been well worth the effort as it has opened the doors to the
removal of yet another bane of the kernel source: stop_machine().
(
Log in to post comments)