Not logged in
Log in now
Create an account
Subscribe to LWN
An unexpected perf feature
LWN.net Weekly Edition for May 16, 2013
A look at the PyPy 2.0 release
PostgreSQL 9.3 beta: Federated databases and more
LWN.net Weekly Edition for May 9, 2013
>But on 32bit machines, ioremapped memory and modules share the same address space. When a module would load its code into memory and execute some code, that would register the function.
Presumably that is not the case on 64-bit and so it's mostly a conserve address space type of reason for 32-bit to share the ioremap and module code address space?
Purely from a how to avoid this in future at design stage itself standpoint - Wondering which one of the following is the real culprit here?
1) Not knowing cmpexchg semantics on ioremapped memory and not knowing that ioremapped memory and modules share the same address space or both and thus failing to take extra care?
2) Kernel not designed well enough to prevent/avoid reuse of address space between text and ioremap space given the consequences of a mistake are deadly?
3) Just some 32-bit limitation that mandates the bad practice of sharing ioremap address space with text?
4) Something else?
Any insights appreciated.
Probable e1000e corruption culprit found (and 18.104.22.168 released)
Posted Oct 16, 2008 23:38 UTC (Thu) by nevets (subscriber, #11875)
Remember, that the original developmental design modified the code as it was hit. This was determined to easily cause failures if the code being modified was executing on another CPU (even when protected with locks). Later, we found that using kstop_machine to halt the system so that only one CPU was running, we could now modify the code without worrying about other CPUS (note, NMIs may still be an issue, but we have a way to handle that too). But to use kstop_machine, the updates had to be done at a later time, after the callers to mcount were recorded.
Knowing that this code touches every part of the kernel, we tried very hard to shutdown the tracer when an anomaly was detected.
Going back to your ideas of where I failed.
1) I did not realize that kernel text and NVM could share the same address space. This was my own ignorance, and the fact that I test mostly on x86_64 does not help the matter.
Ironically, we kept the cmpxchg that was used by the "on-the-fly" code to be added protection. With the kstop_machine, it was not needed, but we kept it in because we were paranoid. This also shows that I was ignorant to the dangers of cmpxchg on non cache memory. I should have known, I have read the specification on this in the past.
2+3) The limited supply of the 32 bit address space forces the kernel to share iomapped memory with kernel text (for modules). Again, this is not an issue for 64 bit architectures.
4) The hardware should never let the software permanently disable it. The fact that a random bug was able to harm the NVM is a design flaw of the hardware itself. This could have been caused by some random bug anywhere in the kernel, that did a cmpxchg to a bad address that happened to be mapped in an IO region.
With robustness always in mind, we have been redesigning the code to handle many more cases. The latest ftrace code thats in our development tree, tries very hard to avoid writing into kernel addresses that may have changed. We even redesigned the code to record the mcount calling addresses at compile time and simply modify the code into nops at early bootup, before any other CPU is running.
Unfortunately, most of these safe guards that would have definitely prevented the issue, required design changes that, (again) ironically, we thought was too intrusive to push into mainline after the merge window had closed.
I currently have a patch that backports some of the ftrace safeguards that are in our development tree, and I will post once we finish testing it. As for the stable branch, we recommend keeping dynamic ftrace disabled.
Posted Oct 17, 2008 0:06 UTC (Fri) by paragw (subscriber, #45306)
On the hardware bug point - I think if we consider that it's just a bunch of memory that allows to be written to, will it not complicate the hardware interface if it were to validate what was being written to? Or does other pieces of hardware provide some type of protection against such things? (Coming to think of it - if the BIOS allows flashing and kernel bug overwrites BIOS - would that be BIOS's fault?)
It's hard to think what the hardware bug is here and how it could prevent it without going through unreasonable complexity - the fundamental problem is that we allow ioremap space to be shared - I would think that at least having a option of not sharing it for people who don't use all of the 32-bit address space will provide much more security than any other measures. Don't you think we need a CONFIG_SHARED_IOREMAP_SPACE=n, like now?
Posted Oct 17, 2008 0:25 UTC (Fri) by nevets (subscriber, #11875)
The issue with the cmpxchg, is that it will always write data, even if the data it finds did not match. It still performs a write. It will write back the same data it read if the compare fails. By performing the cmpxchg in the IO address, it did not matter what it read. It would corrupt the data it found.
The fact that the old workaround was to make the NVM read only, was not just a work around for this bug. But a proper fix to the driver code as well. This will keep any other random bugs from bricking the card.
As for the CONFIG_SHARED_IOREMAP_SPACE, I don't think that is needed. That may have protected the bug with ftrace, but it does not protect other bugs writing into bad memory areas.
Posted Oct 17, 2008 0:38 UTC (Fri) by paragw (subscriber, #45306)
Well it could have saved some people's cards from bricking - who knows how many other cards allow bricking if bad stuff is written to some magic ioremap()able area. Difference between writing to other areas vs. writing to ioremap space is that as we witnessed the later is fatal, former is always recoverable on boot.
Posted Oct 17, 2008 5:45 UTC (Fri) by Cato (subscriber, #7643)
I am not a kernel programmer so the above may be implausible/impractical though.
Magic number safeguard
Posted Oct 21, 2008 8:55 UTC (Tue) by i3839 (guest, #31386)
It can't be done in software in the kernel. Or rather, it can, but is useless, because only functions that think they're going to do something dangerous do the checking, while in this case it's a regular cpu instruction that caused the corruption.
Posted Oct 17, 2008 0:16 UTC (Fri) by paragw (subscriber, #45306)
Posted Oct 17, 2008 0:43 UTC (Fri) by nevets (subscriber, #11875)
The code did a cmpxchg which simply looks at the contents of the address, and if it matches to what it thinks it does, then it writes the new data. Otherwise, it does not write the new data (I did recently learn that cmpxchg will always write something: either the same data it read, on failure, or the new data if the read data matches).
The code that did this cmpxchg also had fault protection on to catch writing to non-writable memory. Since the cmpxchg tests 32 bits, there still exists a 1 in 4 billion chance that it somehow could be pointing into a kernel data space that has the same data as a call to mcount, and it would update the data.
NOTE!!! I can not stress this more. The new code to ftrace, that is queued for 28, goes through considerable pains to prevent this from happening.
1) All functions labeled with __init (the code that gets removed at boot up) is also labeled with notrace (to prevent this code from being placed in the ftrace update tables).
2) Modules now call a ftrace_release function that will allow ftrace to remove all functions it knows about from its update tables, when the module unloads.
3) On x86 (and soon on PPC) we record the mcount call sites at compile time. At early boot up (before SMP starts), these call sites are modified into nops. Only when ftrace is enabled and starts tracing is the kstop_machine needed to update the call sites. But this time because of the above two points, the changes made are only to locations in the kernel that we know is text.
4) We do not use cmpxchg anymore. We use __copy_from/to_user to read and write the data. We read it, compare it, then write it. If any of these steps fail, you will see a big nasty warning.
Again, if we didn't think these changes were so intrusive, we would have pushed them into 27.
Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds