Being the author of the code, I will admit it was a little of all of the above.
Remember, that the original developmental design modified the code as it was hit. This was determined to easily cause failures if the code being modified was executing on another CPU (even when protected with locks). Later, we found that using kstop_machine to halt the system so that only one CPU was running, we could now modify the code without worrying about other CPUS (note, NMIs may still be an issue, but we have a way to handle that too). But to use kstop_machine, the updates had to be done at a later time, after the callers to mcount were recorded.
Knowing that this code touches every part of the kernel, we tried very hard to shutdown the tracer when an anomaly was detected.
Going back to your ideas of where I failed.
1) I did not realize that kernel text and NVM could share the same address space. This was my own ignorance, and the fact that I test mostly on x86_64 does not help the matter.
Ironically, we kept the cmpxchg that was used by the "on-the-fly" code to be added protection. With the kstop_machine, it was not needed, but we kept it in because we were paranoid. This also shows that I was ignorant to the dangers of cmpxchg on non cache memory. I should have known, I have read the specification on this in the past.
2+3) The limited supply of the 32 bit address space forces the kernel to share iomapped memory with kernel text (for modules). Again, this is not an issue for 64 bit architectures.
4) The hardware should never let the software permanently disable it. The fact that a random bug was able to harm the NVM is a design flaw of the hardware itself. This could have been caused by some random bug anywhere in the kernel, that did a cmpxchg to a bad address that happened to be mapped in an IO region.
With robustness always in mind, we have been redesigning the code to handle many more cases. The latest ftrace code thats in our development tree, tries very hard to avoid writing into kernel addresses that may have changed. We even redesigned the code to record the mcount calling addresses at compile time and simply modify the code into nops at early bootup, before any other CPU is running.
Unfortunately, most of these safe guards that would have definitely prevented the issue, required design changes that, (again) ironically, we thought was too intrusive to push into mainline after the merge window had closed.
I currently have a patch that backports some of the ftrace safeguards that are in our development tree, and I will post once we finish testing it. As for the stable branch, we recommend keeping dynamic ftrace disabled.