User: Password:
|
|
Subscribe / Log in / New account

Probable e1000e corruption culprit found (and 2.6.27.1 released)

From:  Steven Rostedt <rostedt-AT-goodmis.org>
To:  LKML <linux-kernel-AT-vger.kernel.org>, stable-AT-kernel.org
Subject:  [PATCH -stable] disable CONFIG_DYNAMIC_FTRACE due to possible memory corruption on module unload
Date:  Wed, 15 Oct 2008 18:21:44 -0400 (EDT)
Message-ID:  <alpine.DEB.1.10.0810151815110.18960@gandalf.stny.rr.com>
Cc:  Linus Torvalds <torvalds-AT-linux-foundation.org>, Andrew Morton <akpm-AT-linux-foundation.org>, Arjan van de Ven <arjan-AT-infradead.org>, gregkh-AT-suse.de, jesse.brandeburg-AT-intel.com, Thomas Gleixner <tglx-AT-linutronix.de>, Ingo Molnar <mingo-AT-elte.hu>


While debugging the e1000e corruption bug with Intel, we discovered
today that the dynamic ftrace code in mainline is the likely source of
this bug.

For the stable kernel we are providing the only viable fix patch: labeling
CONFIG_DYNAMIC_FTRACE as broken. (see the patch below)

We will follow up with a backport patch that contains the fixes. But since
the fixes are not a one liner, the safest approach for now is to
disable the code in question.

The cause of the bug is due to the way the current code in mainline
handles dynamic ftrace.  When dynamic ftrace is turned on, it also
turns on CONFIG_FTRACE which enables the -pg config in gcc that places
a call to mcount at every function call. With just CONFIG_FTRACE this
causes a noticeable overhead.  CONFIG_DYNAMIC_FTRACE works to ease this
overhead by dynamically updating the mcount call sites into nops.

The problem arises when we trace functions and modules are unloaded.
The first time a function is called, it will call mcount and the mcount
call will call ftrace_record_ip. This records the calling site and
stores it in a preallocated hash table. Later on a daemon will
wake up and call kstop_machine and convert any mcount callers into
nops.

The evolution of this code first tried to do this without the kstop_machine
and used cmpxchg to update the callers as they were called. But I
was informed that this is dangerous to do on SMP machines if another
CPU is running that same code. The solution was to do this with
kstop_machine.

We still used cmpxchg to test if the code that we are modifying is
indeed code that we expect to be before updating it - as a final
line of defense.

But on 32bit machines, ioremapped memory and modules share the same
address space. When a module would load its code into memory and execute
some code, that would register the function.

On module unload, ftrace incorrectly did not zap these functions from
its hash (this was the bug). The cmpxchg could have saved us in most
cases (via luck) - but with ioremap-ed memory that was exactly the wrong
thing to do - the results of cmpxchg on device memory are undefined.
(and will likely result in a write)

The pending .28 ftrace tree does not have this bug anymore, as a general push
towards more robustness of code patching, this is done differently: we do not
use cmpxchg and we do a WARN_ON and turn the tracer off if anything deviates
from its expected state. Furthermore, patch sites are statically identified
during build time so there's no runtime discovery of dynamic code areas
anymore, and no room for code unmaps to cause the hash to become out of date.

We believe the fragility of dynamic patching has been sufficiently
addressed in the development code via the static patching method, but further
suggestions to make it more robust are welcome.

Signed-off-by: Steven Rostedt <srostedt@goodmis.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/trace/Kconfig |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Index: linux-compile.git/kernel/trace/Kconfig
===================================================================
--- linux-compile.git.orig/kernel/trace/Kconfig	2008-10-02 10:18:49.000000000 -0400
+++ linux-compile.git/kernel/trace/Kconfig	2008-10-15 17:29:34.000000000 -0400
@@ -103,7 +103,8 @@ config CONTEXT_SWITCH_TRACER
 	  all switching of tasks.
 
 config DYNAMIC_FTRACE
-	bool "enable/disable ftrace tracepoints dynamically"
+	bool "enable/disable ftrace tracepoints dynamically (BROKEN)"
+	depends on BROKEN
 	depends on FTRACE
 	depends on HAVE_DYNAMIC_FTRACE
 	default y

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


(Log in to post comments)

IS 2.6.27 safe ?

Posted Oct 16, 2008 15:12 UTC (Thu) by xav (subscriber, #18536) [Link]

An uninformed reader (i.e. me) could believe that the current workaround against the e1000 bug included in .27 isn't efficient against this bug.
Is that true ?

IS 2.6.27 safe ?

Posted Oct 16, 2008 15:14 UTC (Thu) by corbet (editor, #1) [Link]

The 2.6.27 workaround will prevent the bug from destroying your hardware. But the buggy code is still there and could potentially create other difficulties. So upgrading to 2.6.27.1 (or just configuring out ftrace) might be a prudent thing to do.

yes, 2.6.27 is safe wrt. e1000e

Posted Oct 16, 2008 15:24 UTC (Thu) by mingo (subscriber, #31122) [Link]

Correct - while .27 is safe wrt. e1000e EEPROM corruption (write modifications to the EEPROM were disabled by the driver), updating to latest .27.1 will get you the memory corruption fix as well.

Note that it is CONFIG_DYNAMIC_FTRACE=y (a performance feature) that was causing the sporadic memory corruption - plain ftrace was still fine. Both options were default-off.

The development ftrace code does not have this bug.

EEPROM restore utility?

Posted Oct 16, 2008 16:25 UTC (Thu) by kalbakk (guest, #54118) [Link]

I was hit by this bug on a Thinkpad T61p. Fortunately I have a copy of the eeprom from a previous ethtool dump, but the data is quite useless without some restore method.

Any news on the proposed EEPROM restore utility?

EEPROM restore utility?

Posted Oct 16, 2008 20:31 UTC (Thu) by proski (subscriber, #104) [Link]

Did you have CONFIG_DYNAMIC_FTRACE enabled? That's really the first question we should be asking to make sure that we don't have other issues affecting e1000e hardware.

EEPROM restore utility?

Posted Oct 16, 2008 21:06 UTC (Thu) by kalbakk (guest, #54118) [Link]

yeah, CONFIG_DYNAMIC_FTRACE was enabled.

It's even worse now. Stupid me, taking advice from some random guy on Ubuntu launchpad: "Try diagnostic with the Windows driver."

I ran the hardware diagnostic in the Windows driver. EEPROM verified ok, but 'register test' failed. Then the card died, now it doesn't even init on the PCI bus. lspci shows no sign of it. Guess I'll just return the damn laptop.

EEPROM restore utility?

Posted Oct 19, 2008 3:46 UTC (Sun) by sbergman27 (guest, #10767) [Link]

"""
Then the card died, now it doesn't even init on the PCI bus. lspci shows no sign of it. Guess I'll just return the damn laptop.
"""

Think about that next time you complain about a problem and someone "responds" by asking you where you were when the project was asking for beta testers. Tell them just how dangerous testing can be. Good thing you were in warranty. Which brings up another question. The vendor likely sold the machine with an OS that did not destroy the hardware. What should their options be?

random advice :(

Posted Nov 20, 2008 21:36 UTC (Thu) by gvy (guest, #11981) [Link]

Hope you're at least not going to listen anymore to win-advice of random guys on *shrug* launchpads...

Probable e1000e corruption culprit found (and 2.6.27.1 released)

Posted Oct 16, 2008 22:46 UTC (Thu) by paragw (guest, #45306) [Link]

>But on 32bit machines, ioremapped memory and modules share the same address space. When a module would load its code into memory and execute some code, that would register the function.

Presumably that is not the case on 64-bit and so it's mostly a conserve address space type of reason for 32-bit to share the ioremap and module code address space?

Purely from a how to avoid this in future at design stage itself standpoint - Wondering which one of the following is the real culprit here?

1) Not knowing cmpexchg semantics on ioremapped memory and not knowing that ioremapped memory and modules share the same address space or both and thus failing to take extra care?

2) Kernel not designed well enough to prevent/avoid reuse of address space between text and ioremap space given the consequences of a mistake are deadly?

3) Just some 32-bit limitation that mandates the bad practice of sharing ioremap address space with text?

4) Something else?

Any insights appreciated.

Probable e1000e corruption culprit found (and 2.6.27.1 released)

Posted Oct 16, 2008 23:38 UTC (Thu) by nevets (subscriber, #11875) [Link]

Being the author of the code, I will admit it was a little of all of the above.

Remember, that the original developmental design modified the code as it was hit. This was determined to easily cause failures if the code being modified was executing on another CPU (even when protected with locks). Later, we found that using kstop_machine to halt the system so that only one CPU was running, we could now modify the code without worrying about other CPUS (note, NMIs may still be an issue, but we have a way to handle that too). But to use kstop_machine, the updates had to be done at a later time, after the callers to mcount were recorded.

Knowing that this code touches every part of the kernel, we tried very hard to shutdown the tracer when an anomaly was detected.

Going back to your ideas of where I failed.

1) I did not realize that kernel text and NVM could share the same address space. This was my own ignorance, and the fact that I test mostly on x86_64 does not help the matter.

Ironically, we kept the cmpxchg that was used by the "on-the-fly" code to be added protection. With the kstop_machine, it was not needed, but we kept it in because we were paranoid. This also shows that I was ignorant to the dangers of cmpxchg on non cache memory. I should have known, I have read the specification on this in the past.

2+3) The limited supply of the 32 bit address space forces the kernel to share iomapped memory with kernel text (for modules). Again, this is not an issue for 64 bit architectures.

4) The hardware should never let the software permanently disable it. The fact that a random bug was able to harm the NVM is a design flaw of the hardware itself. This could have been caused by some random bug anywhere in the kernel, that did a cmpxchg to a bad address that happened to be mapped in an IO region.

With robustness always in mind, we have been redesigning the code to handle many more cases. The latest ftrace code thats in our development tree, tries very hard to avoid writing into kernel addresses that may have changed. We even redesigned the code to record the mcount calling addresses at compile time and simply modify the code into nops at early bootup, before any other CPU is running.

Unfortunately, most of these safe guards that would have definitely prevented the issue, required design changes that, (again) ironically, we thought was too intrusive to push into mainline after the merge window had closed.

I currently have a patch that backports some of the ftrace safeguards that are in our development tree, and I will post once we finish testing it. As for the stable branch, we recommend keeping dynamic ftrace disabled.

Probable e1000e corruption culprit found (and 2.6.27.1 released)

Posted Oct 17, 2008 0:06 UTC (Fri) by paragw (guest, #45306) [Link]

Thanks for explaining in detail - it is always very educating and interesting to learn from what follows failures.

On the hardware bug point - I think if we consider that it's just a bunch of memory that allows to be written to, will it not complicate the hardware interface if it were to validate what was being written to? Or does other pieces of hardware provide some type of protection against such things? (Coming to think of it - if the BIOS allows flashing and kernel bug overwrites BIOS - would that be BIOS's fault?)

It's hard to think what the hardware bug is here and how it could prevent it without going through unreasonable complexity - the fundamental problem is that we allow ioremap space to be shared - I would think that at least having a option of not sharing it for people who don't use all of the 32-bit address space will provide much more security than any other measures. Don't you think we need a CONFIG_SHARED_IOREMAP_SPACE=n, like now?

Probable e1000e corruption culprit found (and 2.6.27.1 released)

Posted Oct 17, 2008 0:25 UTC (Fri) by nevets (subscriber, #11875) [Link]

Hardware simply should not be permanently disabled simply by writing to some random register. Remember, this didn't cause the device to send out corrupted data. Simply writing into the NVM caused the card to turn into a brick.

The issue with the cmpxchg, is that it will always write data, even if the data it finds did not match. It still performs a write. It will write back the same data it read if the compare fails. By performing the cmpxchg in the IO address, it did not matter what it read. It would corrupt the data it found.

The fact that the old workaround was to make the NVM read only, was not just a work around for this bug. But a proper fix to the driver code as well. This will keep any other random bugs from bricking the card.

As for the CONFIG_SHARED_IOREMAP_SPACE, I don't think that is needed. That may have protected the bug with ftrace, but it does not protect other bugs writing into bad memory areas.

Probable e1000e corruption culprit found (and 2.6.27.1 released)

Posted Oct 17, 2008 0:38 UTC (Fri) by paragw (guest, #45306) [Link]

>As for the CONFIG_SHARED_IOREMAP_SPACE, I don't think that is needed. That may have protected the bug with ftrace, but it does not protect other bugs writing into bad memory areas.

Well it could have saved some people's cards from bricking - who knows how many other cards allow bricking if bad stuff is written to some magic ioremap()able area. Difference between writing to other areas vs. writing to ioremap space is that as we witnessed the later is fatal, former is always recoverable on boot.

Probable e1000e corruption culprit found (and 2.6.27.1 released)

Posted Oct 17, 2008 5:45 UTC (Fri) by Cato (subscriber, #7643) [Link]

A simple idea to safeguard the hardware interface is to ensure that immediately before writing any hardware registers (mapped to memory), the driver must also write a 'magic number' to a fixed location (and then clear it afterwards). This minimise the period during which a bug could stomp on the hardware although it probably doesn't eliminate it unless it can be run without any interrupts.

I am not a kernel programmer so the above may be implausible/impractical though.

Magic number safeguard

Posted Oct 21, 2008 8:55 UTC (Tue) by i3839 (guest, #31386) [Link]

Such protection mechanism, a specific sequence of actions that needs to be done before doing anything potentially dangerous should and can be implemented by the hardware, and is already done in e.g. some microcontrollers to protect flash/eprom from accidental writes.

It can't be done in software in the kernel. Or rather, it can, but is useless, because only functions that think they're going to do something dangerous do the checking, while in this case it's a regular cpu instruction that caused the corruption.

Probable e1000e corruption culprit found (and 2.6.27.1 released)

Posted Oct 17, 2008 0:16 UTC (Fri) by paragw (guest, #45306) [Link]

Also am I right in saying that this bug was also present on x86_64 and the only difference was that it scribbled on non io-remapped memory? Surprising though there were no consequences of this bug on 64-bit.

Probable e1000e corruption culprit found (and 2.6.27.1 released)

Posted Oct 17, 2008 0:43 UTC (Fri) by nevets (subscriber, #11875) [Link]

Actually no, not really.

The code did a cmpxchg which simply looks at the contents of the address, and if it matches to what it thinks it does, then it writes the new data. Otherwise, it does not write the new data (I did recently learn that cmpxchg will always write something: either the same data it read, on failure, or the new data if the read data matches).

The code that did this cmpxchg also had fault protection on to catch writing to non-writable memory. Since the cmpxchg tests 32 bits, there still exists a 1 in 4 billion chance that it somehow could be pointing into a kernel data space that has the same data as a call to mcount, and it would update the data.

NOTE!!! I can not stress this more. The new code to ftrace, that is queued for 28, goes through considerable pains to prevent this from happening.

1) All functions labeled with __init (the code that gets removed at boot up) is also labeled with notrace (to prevent this code from being placed in the ftrace update tables).

2) Modules now call a ftrace_release function that will allow ftrace to remove all functions it knows about from its update tables, when the module unloads.

3) On x86 (and soon on PPC) we record the mcount call sites at compile time. At early boot up (before SMP starts), these call sites are modified into nops. Only when ftrace is enabled and starts tracing is the kstop_machine needed to update the call sites. But this time because of the above two points, the changes made are only to locations in the kernel that we know is text.

4) We do not use cmpxchg anymore. We use __copy_from/to_user to read and write the data. We read it, compare it, then write it. If any of these steps fail, you will see a big nasty warning.

Again, if we didn't think these changes were so intrusive, we would have pushed them into 27.

Is CONFIG_HAVE_DYNAMIC_FTRACE safe?

Posted Oct 17, 2008 4:11 UTC (Fri) by tetromino (subscriber, #33846) [Link]

I do not have CONFIG_DYNAMIC_FTRACE set. However, I do have

CONFIG_HAVE_FTRACE=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_TRACING=y
# CONFIG_FTRACE is not set

in my .config. Am I safe from the ftrace-induced memory corruption?

Is CONFIG_HAVE_DYNAMIC_FTRACE safe?

Posted Oct 17, 2008 4:33 UTC (Fri) by nevets (subscriber, #11875) [Link]

Yes you are safe.

The CONFIG_HAVE_DYNAMIC_FTRACE is just a way for the architecture to tell the rest of the kernel that it supports dynamic ftrace. The x86 architecture supports it, so that will always be enabled on an x86 build.


Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds