The "rare write" mechanism
One of the ways to harden the kernel is by tightening permissions on memory to write-protect as much run-time data as possible. This means the kernel makes some data structures read-only to prevent malicious or accidental corruption. However, inevitably, most data structures need read/write access at some point. Because of this, a blanket read-only policy for these structures wouldn't work. Therefore, we need a mechanism that keeps sensitive data structures read-only when "at rest", but allows writes when the need arises.
Kees Cook proposed such a mechanism based on similar functionality that exists in the PaX/grsecurity patch set. Cook calls it "rare write", but that is a bit of misnomer considering that write access to some of the target data structures is not always rare. The resulting discussion on the Linux kernel mailing list fleshed out how the mechanism could be implemented, from a standpoint of both architecture-neutrality and performance.
Cook's proposed series of eleven patches for the rare_write() infrastructure consists of sample implementations for x86 and ARM, plus usage examples. Two new kernel configuration options, HAVE_ARCH_RARE_WRITE and HAVE_ARCH_RARE_WRITE_MEMCPY, were added to let architectures define whether or not they have implemented the new mechanism. Cook's implementation contains architecture-specific code that relies on CPU features on x86 and ARM that selectively enable and disable write access to areas of memory. Thus, the data write routines cannot be preemptible, as interrupting the kernel after it enables writing on read-only memory would result in leaving a window of vulnerability open. Cook noted that his code is inlined to discourage return-oriented programming attacks on the write-enable routines, which would otherwise be a juicy target for such attacks.
The architecture-specific nature of the code became a point of contention in the resulting discussion on the linux-kernel and kernel-hardening mailing lists. But first, let's look at how Cook's proposed mechanism works.
The kernel stores data in different ELF sections depending on how that data will be accessed: .data or .bss for normal, read-write data, in .rodata for read-only data, or in .data..ro_after_init for data that will become read-only after initialization. Cook's newly introduced __wr_rare annotation, which for now is just a #define alias for __ro_after_init, would be used to mark data structures to be converted to read-only. Write access would be enabled via a new rare_write() macro, or in a critical section protected by rare_write_begin() and rare_write_end() macros. Cook gave a simple example of the usage of the single rare_write() call by converting a function in net/core/sock_diag.c from:
sock_diag_handlers[family] = NULL;To:
rare_write(sock_diag_handlers[family], NULL);
There are also helper functions to deal with linked lists that utilize the rare_write_begin() and rare_write_end() macro pair internally.
Implementation
On x86 CPUs, read-only data is mapped onto pages marked with the read-only bit set in their page-table entries. CR0 is a control register on the x86 CPUs, and toggling the write-protect bit (bit 16) of CR0 allows this read-only protection to be either enabled or disabled. When the write-protect bit has been cleared on a CPU, any data, regardless of the page-table permissions, can be written by the kernel when running on that CPU. After some back and forth discussion, Cook's implementation of the rare write functions on x86 became the following:
static __always_inline unsigned long __arch_rare_write_begin(void) { unsigned long cr0; cr0 = read_cr0() ^ X86_CR0_WP; BUG_ON(cr0 & X86_CR0_WP); write_cr0(cr0); return cr0 ^ X86_CR0_WP; } static __always_inline unsigned long __arch_rare_write_end(void) { unsigned long cr0; cr0 = read_cr0() ^ X86_CR0_WP; BUG_ON(!(cr0 & X86_CR0_WP)); write_cr0(cr0); return cr0 ^ X86_CR0_WP; }
Mathias Krause raised a
concern that non-maskable interrupts would still be able to interrupt
the CPU inside the critical section, and questioned if PaX's implementation
that Cook's patch was based on whether the implementation was
indeed correct. Thomas Gleixner suggested
that the proper way to handle writing to a read-only memory location
is to create a writable shadow mapping to that physical memory location, and
update the value of the data in that location through the shadow map. This
approach
would not require clearing the CR0 write-protect bit which, he suggested, is
too dangerous to ever do. Gleixner provided this
piece of pseudocode to illustrate his idea:
write_rare(ptr, val) { mp = map_shadow_rw(ptr); *mp = val; unmap_shadow_rw(mp); }
He added:
map_shadow_rw() is essentially the same thing as we do in the highmem case where the kernel creates a shadow mapping of the user space pages via kmap_atomic().
It's valid (at least on x86) to have a shadow map with the same page attributes but write enabled. That does not require any fixups of CR0 and just works.
After subsequent discussions on which method would be better with regard to performance, Andy Lutomirski suggested setting up a separate mm_struct that had a writable alias of the target read-only data and using use_mm() to access it. Lutomirski argued this was more architecture-neutral and, if anyone was inclined to test its performance, they could do so against the CR0-toggle method. While Mark Rutland agreed with this approach, Gleixner wasn't too sure if this was efficient. Cook replied that there was a need for performance and efficiency, as this mechanism will be used for data structures that need extra protection, and not just those are are written to rarely:
I probably chose the wrong name for this feature (write rarely). That's _usually_ true, but "sensitive_write()" was getting rather long. The things that we need to protect with this are certainly stuff that doesn't get much writing, but some things are just plain sensitive (like page tables) and we should still try to be as fast as possible with them.
PaX Team responded that the weakness of the use_mm() approach is that it wouldn't scale, and that you couldn't use the approach inside use_mm() and switch_mm() themselves. These limitations also concerned Cook:
Lutomirski replied
that page table writes aren't rare, so there may need to be "multiple
levels of rareness
" to accommodate this.
Cook also included an implementation for ARM-based CPUs. The ARM architecture organizes memory into regions called "domains", and each domain can have different access permissions. Cook created a domain called DOMAIN_WR_RARE to store read-only data; whenever rare_write() is invoked on ARM, the code will modify the domain permissions of DOMAIN_WR_RARE to enable writes, and disable them again after it is done.
Conclusion
After the various comments that were posted on the mailing list, Cook is working on another iteration of his patch set. An open question is how something like this can be implemented on architectures that do not have architecture support for enabling or disabling writing to read-only memory. Earlier, in his first RFC series on the kernel-hardening list, Cook stated that he isn't sure how to handle other architectures that do not have the requisite hardware support, but he does plan to get a "viable interface" for the architectures that do. Perhaps the way forward is to use the suggested method of writing to read-only data sections via shadow mappings, which can be implemented with kmap_atomic() as proposed by Gleixner.
Index entries for this article | |
---|---|
Kernel | Security/Kernel hardening |
Security | Linux kernel/Hardening |
GuestArticles | Hussein, Nur |
Posted Jun 1, 2017 21:39 UTC (Thu)
by minipli (guest, #69735)
[Link] (1 responses)
static __always_inline rare_write_begin(void)
static __always_inline rare_write_end(void)
As can be seen, the rare_write_begin() primitive disables preemption and IRQ delivery on the local CPU. (The latter is a bug, actually, as the code could be called from a section that already had interrupts disabled (think of spin_lock_irqsave() et al.) and would wrongly re-enable them in rare_write_end(). But that's just another flaw of that code I don't want to focus on too much.) The point is, the local_irq_disable() doesn't prevent NMIs from interrupting a rare_write_begin()...rare_write_end() block. The NMI handler would run with CR0.WP disabled and, in turn, would allow an attacker to exploit this situation as mechanisms like COW won't work any more. So I raised that concern, as mentioned in the article, but I didn't question the implementation in PaX! In fact, well, let me just quote my email here as well:
"""
Shouldn't you instead question yourself why PaX can do it "just" with
As I'm not a native speaker my understanding of the second sentence might differ from yours, but, what I wanted to say was: PaX's implementation doesn't need to disable interrupts, it does handle the NMI case (and IRQ case, for that matter) perfectly fine and has a working COW semantic even then. I wasn't questioning PaX's implementation, I was questioning Kees' proposal. In fact, my question was asking *him* why PaX can have such a simple implementation of the rare_write_begin() helper (called pax_open_kernel() in the PaX patch) that is correct while his is not. (Hint: He'd missed a few hunks while copy-paste-bike-shedding.) But that led nowhere, just to Thomas arguing CR0.WP is a bad arch hack to begin with and that it shouldn't be used at all. I didn't quite agree, though.
Posted Jun 2, 2017 14:07 UTC (Fri)
by nurh (guest, #114270)
[Link]
Posted Jun 2, 2017 0:22 UTC (Fri)
by PaXTeam (guest, #24616)
[Link] (16 responses)
can we please stop with these bullshit plagiarism based articles? most if not all of the code attributed to Kees in the article is basically mine (he said so himself in the very post you linked to), pointless renaming and introduced bugs aside.
Posted Jun 2, 2017 0:25 UTC (Fri)
by corbet (editor, #1)
[Link] (3 responses)
Posted Jun 2, 2017 1:08 UTC (Fri)
by PaXTeam (guest, #24616)
[Link] (2 responses)
you can't say you weren't warned. some specific examples of apparently wilfully misattributed authorship claims:
> Cook's implementation contains architecture-specific code that relies on CPU features on x86 and ARM that selectively
the x86 specific code is mine, not his.
> Cook noted that his code is inlined [...]
again, it's my code which he copied.
> Cook's newly introduced __wr_rare annotation[...]
it's my __read_only attribute.
> Cook gave a simple example of the usage of the single rare_write() call by converting a function in net/core/sock_diag.c from:
it's my code too.
> Cook's implementation of the rare write functions on x86 became the following:
it's all my code.
Posted Jun 2, 2017 7:02 UTC (Fri)
by itvirta (guest, #49997)
[Link] (1 responses)
I think there might be more appropriate forums for copyright infringement claims than comments on a news site.
Posted Jun 2, 2017 13:02 UTC (Fri)
by madscientist (subscriber, #16861)
[Link]
Posted Jun 2, 2017 1:39 UTC (Fri)
by jeff_marshall (subscriber, #49255)
[Link] (11 responses)
The article seems to point this out in a straightforward (to me) manner:
> Kees Cook proposed such a mechanism based on similar functionality that exists in the PaX/grsecurity patch set
What would you propose that the LWN editors do differently?
Posted Jun 2, 2017 10:09 UTC (Fri)
by ballombe (subscriber, #9523)
[Link] (10 responses)
Posted Jun 4, 2017 0:44 UTC (Sun)
by giraffedata (guest, #1954)
[Link] (9 responses)
If it's basically code copied with some renaming, and Cook said so in the talk as PaXTeam says, then "based on code in the PaX/grsecurity patch set" or "derived from code ..." would be less misleading, if not, "largely copied from ..."
Posted Jun 5, 2017 7:00 UTC (Mon)
by jospoortvliet (guest, #33164)
[Link] (8 responses)
Posted Jun 6, 2017 3:17 UTC (Tue)
by flussence (guest, #85566)
[Link] (7 responses)
Posted Jun 6, 2017 10:56 UTC (Tue)
by PaXTeam (guest, #24616)
[Link] (1 responses)
Posted Jun 6, 2017 17:32 UTC (Tue)
by likryol (guest, #115542)
[Link]
Posted Jun 6, 2017 11:05 UTC (Tue)
by minipli (guest, #69735)
[Link] (4 responses)
Posted Jun 6, 2017 11:21 UTC (Tue)
by tao (subscriber, #17563)
[Link] (1 responses)
I find it hard to say that it lacks proper attribution though.
Posted Jun 26, 2017 9:33 UTC (Mon)
by jospoortvliet (guest, #33164)
[Link]
Posted Jun 6, 2017 11:24 UTC (Tue)
by nix (subscriber, #2304)
[Link] (1 responses)
Posted Jun 6, 2017 18:53 UTC (Tue)
by minipli (guest, #69735)
[Link]
This is just a prime example that they don't understand what they're copying. How would such an approach gain Linux security in any sensible way? It's always the bloody details that count -- that little corner case that also needs to be taken care of to make a security feature complete. But so far KSPP has just copy-pasted PaX/grsec code without fully understanding the interconnections with other features and code changes. I'd say, they need to slow down and start reading the whole thing, trying to understand what they just read and repeat -- until no new "Ahhh!" moments appear any more. What happens now isn't much more than a grep 'n sed over the patch without understanding what those hunks really do -- what other changes they need the grep didn't catch.
Posted Jun 2, 2017 15:03 UTC (Fri)
by Tara_Li (guest, #26706)
[Link]
Posted Jun 2, 2017 20:10 UTC (Fri)
by mm7323 (subscriber, #87386)
[Link] (1 responses)
Posted Jun 3, 2017 15:52 UTC (Sat)
by minipli (guest, #69735)
[Link]
The "rare write" mechanism
{
preempt_disable();
local_irq_disable();
barrier();
__arch_rare_write_begin();
barrier();
}
{
barrier();
__arch_rare_write_end();
barrier();
local_irq_enable();
preempt_enable_no_resched();
}
Well, doesn't look good to me. NMIs will still be able to interrupt
this code and will run with CR0.WP = 0.
preempt_disable() instead?!
"""
The "rare write" mechanism
The "rare write" mechanism
Dear PaX Team:
The provenance of the code in question is clearly described in the article itself. It is, meanwhile, development work that is relevant to the kernel development community and worthy of coverage. No bullshit, no plagiarism. No stopping.
The "rare write" mechanism
The "rare write" mechanism
> enable and disable write access to areas of memory.
The "rare write" mechanism
The "rare write" mechanism
The "rare write" mechanism
The "rare write" mechanism
If the code is largely copied from the PaX/grsecurity patch set, saying that Cook proposed a mechanism based on functionality (or function) in that patch set is inadequate. That says to me original code that does something similar to what the PaX/grsecurity code does.
The "rare write" mechanism - attribution
The "rare write" mechanism - attribution
The "rare write" mechanism - attribution
The "rare write" mechanism - attribution
The "rare write" mechanism - attribution
The relevant code from the PaX Patch for v4.9.24 looks astonishingly similar to Kees' proposal, no?
So, now, please tell me this is not just a poor rip-off with minor bikeshedding applied!?
The "rare write" mechanism - attribution
The "rare write" mechanism - attribution
allows HAVE_ARCH_RARE_WRITE to work on x86."
The "rare write" mechanism - attribution
The "rare write" mechanism - attribution
The "rare write" mechanism - attribution
The "rare write" mechanism
The "rare write" mechanism
That might work, however, I don't see any such changes in the proposed patch.
The "rare write" mechanism