LWN.net Logo

The source of the e1000e corruption bug

By Jonathan Corbet
October 21, 2008
When LWN last looked at the e1000e hardware corruption bug, the source of the problem was, at best, unclear. Problems within the driver itself seemed like a likely culprit, but it did not take long for those chasing this problem to realize that they needed to look further afield. For a while, the X server came under scrutiny, as did a number of other system components. When the real problem was found, though, it turned out to be a surprise for everybody involved.

Tracking down intermittent problems is hard. When those problems result in the destruction of hardware, finding them is even harder. Even the most dedicated testers tend to balk when faced with the prospect of shipping their systems back to the manufacturer for repairs. So the task of finding this issue fell to Intel; engineers there locked themselves into a lab with a box full of e1000e adapters and set about bisecting the kernel history to identify the patch which caused the problem. Some time (and numerous fried adapters) later, the bisection process turned up an unlikely suspect: the ftrace tracing framework.

Developers working on tracing generally put a lot of effort into minimizing the impact of their code on system performance. Every last bit of runtime overhead is scrutinized and eliminated if at all possible. As a general rule, bricking the hardware is a level of overhead which goes well beyond the acceptable parameters. So the ftrace developers, once informed of the bisection result, put in some significant work of their own to figure out what was going on.

One of the features offered by ftrace is a simple function call tracing operation; ftrace will output a line with the called function (and its caller) every time a function call is made. This tracing is accomplished by using the venerable profiling mechanism built into gcc (and most other Unix-based compilers). When code is compiled with the -pg option, the compiler will place a call to mcount() at the beginning of every function. The version of mcount() provided by ftrace then logs the relevant information on every call.

As noted above, though, tracing developers are concerned about overhead. On most systems, it is almost certain that, at any given time, nobody will be doing function call tracing. Having all those mcount() calls happening anyway would be a measurable drag on the system. So the ftrace hackers looked for a way to eliminate that overhead when it is not needed. A naive solution to this problem might look something like the following. Rather than put in an unconditional call to mcount(), get gcc to add code like this:

    if (function_tracing_active)
        mcount();

But the kernel makes a lot of function calls, so even this version will have a noticeable overhead; it will also bloat the size of the kernel with all those tests. So the favored approach tends to be different: run-time patching. When function tracing is not being used, the kernel overwrites all of the mcount() calls with no-op instructions. As it happens, doing nothing is a highly optimized operation in contemporary processors, so the overhead of a few no-ops is nearly zero. Should somebody decide to turn function tracing on, the kernel can go through and patch all of those mcount() calls back in.

Run-time patching can solve the performance problem, but it introduces a new problem of its own. Changing the code underneath a running kernel is a dangerous thing to do; extreme caution is required. Care must be taken to ensure that the kernel is not running in the affected code at the time, processor caches must be invalidated, and so on. To be safe, it is necessary to get all other processors on the system to stop and wait while the patching is taking place. The end result is that patching the code is an expensive thing to do.

The way ftrace was coded was to patch out every mcount() call point as it was discovered through an actual call to mcount(). But, as noted above, run-time patching is very expensive, especially if it is done a single function at a time. So ftrace would make a list of mcount() call sites, then fix up a bunch of them later on. In that way, the cost of patching out the calls was significantly reduced.

The problem now is that things might have changed between the time when an mcount() call is noticed and when the kernel gets around to patching out the call. It would be very unfortunate if the kernel were to patch out an mcount() call which no longer existed in the expected place. To be absolutely sure that unrelated data was not being corrupted, the ftrace code used the cmpxchg operation to patch in the no-ops. cmpxchg atomically tests the contents of the target memory against the caller's idea of what is supposed to be there; if the two do not match, the target location will be left with its old value at the end of the operation. So the no-ops will only be written to memory if the current contents of that memory are a call to mcount().

This all seems pretty safe, except that it fell down in one obscure, but important case. One obvious place where an mcount() call could go away is in loadable modules. This can happen if the module is unloaded, of course, but there is another important case too: any code marked as initialization code will be removed once initialization is complete. So a module's initialization function (and any other code marked __init) could leave a dangling reference in the "mcount() calls to be patched out" list maintained by ftrace.

The final piece of this puzzle comes from this little fact: on 32-bit architectures, memory returned from vmalloc() and ioremap() share the same address space. Both functions create mappings to memory from the same range of addresses. Space for loadable modules is allocated with vmalloc(), so all module code is found within this shared address space. Meanwhile, the e1000e driver uses ioremap() to map the adapter's I/O memory and NVRAM into the kernel's address space. The end result is this fatal sequence of events:

  1. A module is loaded into the system. As part of the module's initialization, a number of mcount() calls are made; these call sites are noted for later patching.

  2. Module initialization completes, and the module's __init functions are removed from memory. The address space they occupied is freed up for future use.

  3. The e1000e driver maps its I/O memory and NVRAM into the address range recently occupied by the above-mentioned initialization code.

  4. Ftrace gets around to patching out the accumulated list of mcount() calls. But some of those "calls" are now, actually, I/O memory belonging to the e1000e device.

Remember that the ftrace code was very careful in its patching, using cmpxchg to avoid overwriting anything which is not an mcount() call. But, as Steven Rostedt noted in his summary of the problem:

The cmpxchg could have saved us in most cases (via luck) - but with ioremap-ed memory that was exactly the wrong thing to do - the results of cmpxchg on device memory are undefined. (and will likely result in a write)

The end result is a write to the wrong bit of I/O memory - and a destroyed device.

In hindsight, this bug is reasonably clear and understandable, but it's not at all surprising that it took a long time to find. One should note that there were, in fact, two different bugs here. One of them is ftrace's attempt to write to a stale pointer. But the other one was just as important: the e1000e driver should never have left its hardware configured in a mode where a single stray write could turn it into a brick. One never knows where things might go wrong; hardware should never be left in such a vulnerable state if it can be helped.

The good news is that both bugs have been fixed. The e1000e hardware was locked down before 2.6.27 was released, and the 2.6.27.1 update disables the dynamic ftrace feature. The ftrace code has been significantly rewritten for 2.6.28; it no longer records mcount() call sites on the fly, no longer uses cmpxchg, and, one hopes, is generally incapable of creating such mayhem again.


(Log in to post comments)

The source of the e1000e corruption bug

Posted Oct 23, 2008 2:26 UTC (Thu) by modernjazz (guest, #4185) [Link]

There seem to be other bugs that were fixed by disabling CONFIG_DYNAMIC_FTRACE: see, e.g.,
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263059
It's interesting that this was discovered by studying what might be the scariest case (bricking the hardware), rather than in a much "easier" case of studying hangs-on-boot. It goes to show you, intense motivation can overcome a lot of the barriers of inconvenience!

The source of the e1000e corruption bug

Posted Oct 23, 2008 3:31 UTC (Thu) by nevets (subscriber, #11875) [Link]

Another issue was that ftrace was not a suspect at the time. I corrected any bugs that were passed on to me.

We were designing a new (more robust) version of ftrace in the linux-tip tree. This new version does not have the problems that the old version (in 2.6.27) had. But since the new version was a new design, we held off pushing it to Linus.

Unfortunately, all our testing of the old design never showed any of these issues. It took going out to a larger audience to have them appear.

The source of the e1000e corruption bug

Posted Oct 23, 2008 9:03 UTC (Thu) by alonz (subscriber, #815) [Link]

Wouldn't it be better to simply dump the entire contents of the mcount buffer whenever any code is unmapped, instead of just disabling this (useful) optimization in a kernel that is likely to have a long life?

The source of the e1000e corruption bug

Posted Oct 23, 2008 12:11 UTC (Thu) by nevets (subscriber, #11875) [Link]

Wouldn't it be better to simply dump the entire contents of the mcount buffer whenever any code is unmapped, instead of just disabling this (useful) optimization in a kernel that is likely to have a long life?

From a safety point of view, no. Anything other than disabling it was unacceptable in the stable release. If we found a simple bug (off by one, or array out of bounds) then we could have fixed it. But the bug was a design issue (which has changed in 2.6.28).

How would we know for sure that we got every place that kernel text was freed? How do we know that we don't add more bugs with this "dump the mcount on release".

Now if you would like to have dynamic ftrace in 2.6.27, it would not be hard for me to port the new design. I've already ported it to 2.6.24-rt. Just do not expect this backport to show up in the stable branch.

64 bit archs safe?

Posted Oct 23, 2008 8:52 UTC (Thu) by zdzichu (subscriber, #17118) [Link]

The final piece of this puzzle comes from this little fact: on 32-bit architectures, memory returned from vmalloc() and ioremap() share the same address space.

So those of us running x86_64 kernels were safe? What about other 64bit architectures with PCI-Express bus?

64 bit archs safe?

Posted Oct 23, 2008 12:03 UTC (Thu) by nevets (subscriber, #11875) [Link]

So those of us running x86_64 kernels were safe?

Yes, those running on x86_64 were safe from this bug because the init code and the NVM never shared the same address space.

What about other 64bit architectures with PCI-Express bus?

I will not say yes for sure. But most likely. The mapping of iospace is arch specific. But I don't see why a 64bit address space arch would share the iospace with anything els.

Non-pessimal patching is possible

Posted Oct 23, 2008 14:39 UTC (Thu) by jreiser (subscriber, #11027) [Link]

Extreme caution is required. Yes, but "live" patching can be done, perhaps including this case. I have done it when all writes are naturally aligned, when the updated code makes sense after any subset of individual writes, and when the requirements for multi-processor synchronization can be postponed (as for Read-Copy-Update).

In the particular case of x86, "call mcount" is five bytes: the one-byte opcode 0xe8, followed by four bytes of displacement. With a one-byte write, this can be changed to "test $displ,%eax" [opcode 0xe9] or "cmp $displ,%eax" [opcode 0x3d]. In this case both of these are equivalent to a no-op because of the software convention that the condition code is not busy (either as input or as output) at call. So, as long as mcount does not care about "extra" or "missing" calls [from caches or other processors] during a patch update, then live patching works and can be done inexpensively. Depending on the instruction-stream decoder, surrounding instructions, cache-line boundaries, etc., then the average time cost per patch site is most likely 0, 1/3, or 1/2 cycle; the maximum is 1 cycle.

Non-pessimal patching is possible

Posted Oct 23, 2008 15:35 UTC (Thu) by Trou.fr (subscriber, #26289) [Link]

Actually, those instructions wouldn't be really noops :
- they would take time to be executed
- they update the flags

Such instructions wouldn't be accepted as nop replacements (at least I wouldn't)

Non-pessimal patching is possible

Posted Oct 23, 2008 17:59 UTC (Thu) by jimparis (subscriber, #38647) [Link]

Updating flags would be fine, because the compiler has already assumed that the flags would get messed up by the original function call.

Non-pessimal patching is possible

Posted Oct 23, 2008 19:07 UTC (Thu) by nevets (subscriber, #11875) [Link]

Talking with Intel, they told me that updating code that might be running on another CPU is dangerous. Even in my tests, I found that the other CPU would take an GPF if it was executing code that changed.

Basically they told me "don't do that". Modifying code on the fly is out of the question. Luckily we do not need to do that anymore. The nop patching is now done on system boot up before SMP is even initialized. The dynamic code now only updates .text section that never leaves once it is there (except for module unloading). In the case of module unloading, we now have a hook to remove the references in the ftrace table.

We still check on code patching if what we modify is what we expect. If we fail here, we print a nasty warning and disable the function tracer. So far in my testing, I have not hit this warning. If anyone sees a warning coming out of the ftrace code, I hope they report it ASAP. And please CC me (rostedt@goodmis.org).

Note: Some of this code is still in queue to be pulled.

Non-pessimal patching is possible

Posted Nov 2, 2008 17:37 UTC (Sun) by kasperd (guest, #11842) [Link]

Do you check the pointers for validity before they are inserted in the table? If the pointer is not from the static kernel code or from module code, then it is worth investigating.

The source of the e1000e corruption bug

Posted Oct 23, 2008 18:02 UTC (Thu) by jimparis (subscriber, #38647) [Link]

> So the no-ops will only be written to memory if the current contents of that memory are a call to mcount().
> This all seems pretty safe, except that it fell down in one obscure, but important case

There's another bad case: When the memory was freed and reallocated with vmalloc, then filled with non-code data that includes the same byte sequence as the original call to mcount(). No I/O remapping required and now you've corrupted something. Although the chance of having some random data in kernel memory exactly match the pattern that was there before is probably vanishingly small, it's still there. I'm glad to hear the ftrace code has been reworked to not do this anymore.

The source of the e1000e corruption bug

Posted Oct 23, 2008 18:34 UTC (Thu) by nevets (subscriber, #11875) [Link]

Why do you think I stressed in my quote:

The cmpxchg could have saved us in most cases (via luck) - but with ioremap-ed memory that was exactly the wrong thing to do - the results of cmpxchg on device memory are undefined. (and will likely result in a write)

;-)

The source of the e1000e corruption bug

Posted Oct 23, 2008 19:37 UTC (Thu) by jimparis (subscriber, #38647) [Link]

Aah, I did misread that, as "if we're lucky then the memory was not ioremapped and so the cmpxchg saves us". Nevermind, carry on :)

The source of the e1000e corruption bug

Posted Oct 24, 2008 9:45 UTC (Fri) by NAR (subscriber, #1313) [Link]

if (function_tracing_active) mcount();

But the kernel makes a lot of function calls, so even this version will have a noticeable overhead;

Exactly how noticable? I was wondering, because the erlang VM also has a similar trace capability that can be turned on and off at runtime. I don't know how it's implemented, but I doubt there's NOOP-ing of instructions involved - still, it is used in fairly performance-critical applications. I just can't help the feeling that NOOP-ing was done because modifying code on-the-fly is sexy, not because it's that much faster.

The source of the e1000e corruption bug

Posted Oct 24, 2008 12:42 UTC (Fri) by nevets (subscriber, #11875) [Link]

Our first version was not to replace the calls by nops, but by jmps (jmp three bytes forward, two bytes for the jmp call, three nops to skip). This itself showed a 1 or 2% overhead. Not much, but enough to make it not acceptable.

Now adding a branch to the equation will definitely bring the overhead up. Remember, this is called at every function call inside the kernel.

The source of the e1000e corruption bug

Posted Oct 24, 2008 12:47 UTC (Fri) by madhatter (subscriber, #4665) [Link]

This is a fascinating writeup, and so clear even I can understand it. It's now my second-favourite account of "how we delved the technical depths of a nasty problem" after the one at http://www.justpasha.org/folk/rm.html (though there the delving is done while fixing the problem, and here the delving is done while understanding it).

Nicely done, Jon; thanks.

The source of the e1000e corruption bug

Posted Oct 25, 2008 19:35 UTC (Sat) by Velmont (guest, #46433) [Link]

This is a big reason why I subscribe LWN.net. I'm more of a graphics and «user» person, and not so much a developer (although I'm thoroughly fascinated by it).

With Jonathan Corbet's extremely nice, easy to follow and oh-so-nice-and-technical articles I'm always getting a better understanding of computer tech and how the Linux kernel work. It's very interesting.

Just two bugs?

Posted Nov 2, 2008 17:54 UTC (Sun) by kasperd (guest, #11842) [Link]

I would count hardware being able to brick itself as a bug as well.

Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds