The "rare write" mechanism

June 1, 2017

This article was contributed by Nur Hussein

One of the ways to harden the kernel is by tightening permissions on memory to write-protect as much run-time data as possible. This means the kernel makes some data structures read-only to prevent malicious or accidental corruption. However, inevitably, most data structures need read/write access at some point. Because of this, a blanket read-only policy for these structures wouldn't work. Therefore, we need a mechanism that keeps sensitive data structures read-only when "at rest", but allows writes when the need arises.

Kees Cook proposed such a mechanism based on similar functionality that exists in the PaX/grsecurity patch set. Cook calls it "rare write", but that is a bit of misnomer considering that write access to some of the target data structures is not always rare. The resulting discussion on the Linux kernel mailing list fleshed out how the mechanism could be implemented, from a standpoint of both architecture-neutrality and performance.

Cook's proposed series of eleven patches for the rare_write() infrastructure consists of sample implementations for x86 and ARM, plus usage examples. Two new kernel configuration options, HAVE_ARCH_RARE_WRITE and HAVE_ARCH_RARE_WRITE_MEMCPY, were added to let architectures define whether or not they have implemented the new mechanism. Cook's implementation contains architecture-specific code that relies on CPU features on x86 and ARM that selectively enable and disable write access to areas of memory. Thus, the data write routines cannot be preemptible, as interrupting the kernel after it enables writing on read-only memory would result in leaving a window of vulnerability open. Cook noted that his code is inlined to discourage return-oriented programming attacks on the write-enable routines, which would otherwise be a juicy target for such attacks.

The architecture-specific nature of the code became a point of contention in the resulting discussion on the linux-kernel and kernel-hardening mailing lists. But first, let's look at how Cook's proposed mechanism works.

The kernel stores data in different ELF sections depending on how that data will be accessed: .data or .bss for normal, read-write data, in .rodata for read-only data, or in .data..ro_after_init for data that will become read-only after initialization. Cook's newly introduced __wr_rare annotation, which for now is just a #define alias for __ro_after_init, would be used to mark data structures to be converted to read-only. Write access would be enabled via a new rare_write() macro, or in a critical section protected by rare_write_begin() and rare_write_end() macros. Cook gave a simple example of the usage of the single rare_write() call by converting a function in net/core/sock_diag.c from:

    sock_diag_handlers[family] = NULL;

To:

    rare_write(sock_diag_handlers[family], NULL);

There are also helper functions to deal with linked lists that utilize the rare_write_begin() and rare_write_end() macro pair internally.

Implementation

On x86 CPUs, read-only data is mapped onto pages marked with the read-only bit set in their page-table entries. CR0 is a control register on the x86 CPUs, and toggling the write-protect bit (bit 16) of CR0 allows this read-only protection to be either enabled or disabled. When the write-protect bit has been cleared on a CPU, any data, regardless of the page-table permissions, can be written by the kernel when running on that CPU. After some back and forth discussion, Cook's implementation of the rare write functions on x86 became the following:

    static __always_inline unsigned long __arch_rare_write_begin(void)
    {
           unsigned long cr0;

           cr0 = read_cr0() ^ X86_CR0_WP;
           BUG_ON(cr0 & X86_CR0_WP);
           write_cr0(cr0);
           return cr0 ^ X86_CR0_WP;
    }

    static __always_inline unsigned long __arch_rare_write_end(void)
    {
           unsigned long cr0;

           cr0 = read_cr0() ^ X86_CR0_WP;
           BUG_ON(!(cr0 & X86_CR0_WP));
           write_cr0(cr0);
           return cr0 ^ X86_CR0_WP;
    }

Mathias Krause raised a concern that non-maskable interrupts would still be able to interrupt the CPU inside the critical section, and questioned ~~if PaX's implementation that Cook's patch was based on~~ whether the implementation was indeed correct. Thomas Gleixner suggested that the proper way to handle writing to a read-only memory location is to create a writable shadow mapping to that physical memory location, and update the value of the data in that location through the shadow map. This approach would not require clearing the CR0 write-protect bit which, he suggested, is too dangerous to ever do. Gleixner provided this piece of pseudocode to illustrate his idea:

    write_rare(ptr, val)
    {
            mp = map_shadow_rw(ptr);
            *mp = val;
            unmap_shadow_rw(mp);
    }

He added:

map_shadow_rw() is essentially the same thing as we do in the highmem case where the kernel creates a shadow mapping of the user space pages via kmap_atomic().

It's valid (at least on x86) to have a shadow map with the same page attributes but write enabled. That does not require any fixups of CR0 and just works.

After subsequent discussions on which method would be better with regard to performance, Andy Lutomirski suggested setting up a separate mm_struct that had a writable alias of the target read-only data and using use_mm() to access it. Lutomirski argued this was more architecture-neutral and, if anyone was inclined to test its performance, they could do so against the CR0-toggle method. While Mark Rutland agreed with this approach, Gleixner wasn't too sure if this was efficient. Cook replied that there was a need for performance and efficiency, as this mechanism will be used for data structures that need extra protection, and not just those are are written to rarely:

I probably chose the wrong name for this feature (write rarely). That's _usually_ true, but "sensitive_write()" was getting rather long. The things that we need to protect with this are certainly stuff that doesn't get much writing, but some things are just plain sensitive (like page tables) and we should still try to be as fast as possible with them.

PaX Team responded that the weakness of the use_mm() approach is that it wouldn't scale, and that you couldn't use the approach inside use_mm() and switch_mm() themselves. These limitations also concerned Cook:

These are the limitations that concern me: what will we NOT be able to make read-only as a result of the use_mm() design choice? My RFC series included a simple case and a constify case, but I did not include things like making page tables read-only, etc.

Lutomirski replied that page table writes aren't rare, so there may need to be "multiple levels of rareness" to accommodate this.

Cook also included an implementation for ARM-based CPUs. The ARM architecture organizes memory into regions called "domains", and each domain can have different access permissions. Cook created a domain called DOMAIN_WR_RARE to store read-only data; whenever rare_write() is invoked on ARM, the code will modify the domain permissions of DOMAIN_WR_RARE to enable writes, and disable them again after it is done.

Conclusion

After the various comments that were posted on the mailing list, Cook is working on another iteration of his patch set. An open question is how something like this can be implemented on architectures that do not have architecture support for enabling or disabling writing to read-only memory. Earlier, in his first RFC series on the kernel-hardening list, Cook stated that he isn't sure how to handle other architectures that do not have the requisite hardware support, but he does plan to get a "viable interface" for the architectures that do. Perhaps the way forward is to use the suggested method of writing to read-only data sections via shadow mappings, which can be implemented with kmap_atomic() as proposed by Gleixner.

Index entries for this article
Kernel	Security/Kernel hardening
Security	Linux kernel/Hardening
GuestArticles	Hussein, Nur

The "rare write" mechanism

Posted Jun 1, 2017 21:39 UTC (Thu) by minipli (guest, #69735) [Link] (1 responses)

The article only shows the x86 specific helpers and is missing the arch agnostic primitives. However, those are required to understand my concerns. Let me quote them here for completeness:

static __always_inline rare_write_begin(void)
{
preempt_disable();
local_irq_disable();
barrier();
__arch_rare_write_begin();
barrier();
}

static __always_inline rare_write_end(void)
{
barrier();
__arch_rare_write_end();
barrier();
local_irq_enable();
preempt_enable_no_resched();
}

As can be seen, the rare_write_begin() primitive disables preemption and IRQ delivery on the local CPU. (The latter is a bug, actually, as the code could be called from a section that already had interrupts disabled (think of spin_lock_irqsave() et al.) and would wrongly re-enable them in rare_write_end(). But that's just another flaw of that code I don't want to focus on too much.) The point is, the local_irq_disable() doesn't prevent NMIs from interrupting a rare_write_begin()...rare_write_end() block. The NMI handler would run with CR0.WP disabled and, in turn, would allow an attacker to exploit this situation as mechanisms like COW won't work any more. So I raised that concern, as mentioned in the article, but I didn't question the implementation in PaX! In fact, well, let me just quote my email here as well:

"""
Well, doesn't look good to me. NMIs will still be able to interrupt
this code and will run with CR0.WP = 0.

Shouldn't you instead question yourself why PaX can do it "just" with
preempt_disable() instead?!
"""

As I'm not a native speaker my understanding of the second sentence might differ from yours, but, what I wanted to say was: PaX's implementation doesn't need to disable interrupts, it does handle the NMI case (and IRQ case, for that matter) perfectly fine and has a working COW semantic even then. I wasn't questioning PaX's implementation, I was questioning Kees' proposal. In fact, my question was asking *him* why PaX can have such a simple implementation of the rare_write_begin() helper (called pax_open_kernel() in the PaX patch) that is correct while his is not. (Hint: He'd missed a few hunks while copy-paste-bike-shedding.) But that led nowhere, just to Thomas arguing CR0.WP is a bad arch hack to begin with and that it shouldn't be used at all. I didn't quite agree, though.

The "rare write" mechanism

Posted Jun 2, 2017 14:07 UTC (Fri) by nurh (guest, #114270) [Link]

I did misread the intent of your LKML post, and I apologize. We've corrected the error. Thank you for the clarification.

The "rare write" mechanism

Posted Jun 2, 2017 0:22 UTC (Fri) by PaXTeam (guest, #24616) [Link] (16 responses)

Dear LWN editors,

can we please stop with these bullshit plagiarism based articles? most if not all of the code attributed to Kees in the article is basically mine (he said so himself in the very post you linked to), pointless renaming and introduced bugs aside.

The "rare write" mechanism

Posted Jun 2, 2017 0:25 UTC (Fri) by corbet (editor, #1) [Link] (3 responses)

Dear PaX Team: The provenance of the code in question is clearly described in the article itself. It is, meanwhile, development work that is relevant to the kernel development community and worthy of coverage. No bullshit, no plagiarism. No stopping.

The "rare write" mechanism

Posted Jun 2, 2017 1:08 UTC (Fri) by PaXTeam (guest, #24616) [Link] (2 responses)

Dear Jon,

you can't say you weren't warned. some specific examples of apparently wilfully misattributed authorship claims:

> Cook's implementation contains architecture-specific code that relies on CPU features on x86 and ARM that selectively
> enable and disable write access to areas of memory.

the x86 specific code is mine, not his.

> Cook noted that his code is inlined [...]

again, it's my code which he copied.

> Cook's newly introduced __wr_rare annotation[...]

it's my __read_only attribute.

> Cook gave a simple example of the usage of the single rare_write() call by converting a function in net/core/sock_diag.c from:

it's my code too.

> Cook's implementation of the rare write functions on x86 became the following:

it's all my code.

The "rare write" mechanism

Posted Jun 2, 2017 7:02 UTC (Fri) by itvirta (guest, #49997) [Link] (1 responses)

> the code is mine, not his.

I think there might be more appropriate forums for copyright infringement claims than comments on a news site.

The "rare write" mechanism

Posted Jun 2, 2017 13:02 UTC (Fri) by madscientist (subscriber, #16861) [Link]

No one is talking about copyright infringement (which is good, since that's not a valid claim). What's being discussed is attribution, and in particular attribution in the article (not attribution in the patches). I think the comments section of the article is a valid place to discuss that. I have no opinion on the merits.

The "rare write" mechanism

Posted Jun 2, 2017 1:39 UTC (Fri) by jeff_marshall (subscriber, #49255) [Link] (11 responses)

PaXTeam,

The article seems to point this out in a straightforward (to me) manner:

> Kees Cook proposed such a mechanism based on similar functionality that exists in the PaX/grsecurity patch set

What would you propose that the LWN editors do differently?

The "rare write" mechanism

Posted Jun 2, 2017 10:09 UTC (Fri) by ballombe (subscriber, #9523) [Link] (10 responses)

Just write 'The implementation' instead of 'Cook implementation' etc.

The "rare write" mechanism - attribution

Posted Jun 4, 2017 0:44 UTC (Sun) by giraffedata (guest, #1954) [Link] (9 responses)

If the code is largely copied from the PaX/grsecurity patch set, saying that Cook proposed a mechanism based on functionality (or function) in that patch set is inadequate. That says to me original code that does something similar to what the PaX/grsecurity code does.

If it's basically code copied with some renaming, and Cook said so in the talk as PaXTeam says, then "based on code in the PaX/grsecurity patch set" or "derived from code ..." would be less misleading, if not, "largely copied from ..."

The "rare write" mechanism - attribution

Posted Jun 5, 2017 7:00 UTC (Mon) by jospoortvliet (guest, #33164) [Link] (8 responses)

I can't judge the merits either but when I read the article my impression is that most of the code is written by Cook, merely inspired by the work pax did. If, in reality, 80-90% is copied from PAX and the patch is essentially a port and perhaps clean-up of PAX code to the current kernel I suggest the wording is indeed flawed.

The "rare write" mechanism - attribution

Posted Jun 6, 2017 3:17 UTC (Tue) by flussence (guest, #85566) [Link] (7 responses)

I can understand how something like that might be overlooked. If grsecurity wasn't de-facto proprietary anyone could do a diff instead of having to assume paxteam is once again acting in bad faith.

The "rare write" mechanism - attribution

Posted Jun 6, 2017 10:56 UTC (Tue) by PaXTeam (guest, #24616) [Link] (1 responses)

yeah man, makes one wonder where Kees could have possibly copied my code from if it was 'proprietary'. if you are unable to tell then Kees/LWN failed at properly attributing the origin of my code so complain to them instead of spewing ad hominem. of course, a smart cookie such as yourself could have also just looked at the test patch collection linked from, wait for it, the PaX homepage but then assuming that you failed to do that so that you can justify going on an irrelevant diatribe would be bad faith, wouldn't it? ;)

The "rare write" mechanism - attribution

Posted Jun 6, 2017 17:32 UTC (Tue) by likryol (guest, #115542) [Link]

Why can't you (validly) point out that your code is easily available without attacking people? Your stance here is right, but the tone and delivery (on both sides of the argument to be fair) is why there is friction.

The "rare write" mechanism - attribution

Posted Jun 6, 2017 11:05 UTC (Tue) by minipli (guest, #69735) [Link] (4 responses)

The relevant code from the PaX Patch for v4.9.24 looks astonishingly similar to Kees' proposal, no?
So, now, please tell me this is not just a poor rip-off with minor bikeshedding applied!?

The "rare write" mechanism - attribution

Posted Jun 6, 2017 11:21 UTC (Tue) by tao (subscriber, #17563) [Link] (1 responses)

"Based on PaX's x86 pax_{open,close}_kernel() implementation, this
allows HAVE_ARCH_RARE_WRITE to work on x86."

I find it hard to say that it lacks proper attribution though.

The "rare write" mechanism - attribution

Posted Jun 26, 2017 9:33 UTC (Mon) by jospoortvliet (guest, #33164) [Link]

I believe that that wasn't disputed, Kees properly credited. The article, however, seems to suggest he wrote the code, merely inspired by the PAX feature, which is clearly not the case.

The "rare write" mechanism - attribution

Posted Jun 6, 2017 11:24 UTC (Tue) by nix (subscriber, #2304) [Link] (1 responses)

You say "a poor rip-off with minor bikeshedding": anyone else would say "changes of the sort routinely needed for upstreaming which PaXTeam is unwilling to do". If that makes a "poor rip-off", then the *entire kernel* is a "poor rip-off", which makes it odd that grsecurity is based on it. (This is a general problem with the grsecurity position: if the kernel is so all-around terrible and run by thieves, scoundrels, liars, incompetents and scum as they constantly claim, why on earth are they using it as a basis for their work? Perhaps because it's not anywhere near as bad as all that after all. Further, if it's thievery to take the grsecurity work and modify and upstream it, why on earth is it not thievery for grsecurity to take the kernel and modify it, particularly if they then accompany the result with contracts that try to restrict its further distribution? I mean, yes, both are legally permissible, but if I look for scurrilous behaviour here I do not see it in the kernel maintainers. Oh noes, they're giving people's work a wide audience and all you have to do in exchange for this is act like a decent human being!)

The "rare write" mechanism - attribution

Posted Jun 6, 2017 18:53 UTC (Tue) by minipli (guest, #69735) [Link]

It's a "poor rip-off" simply because Kees' code is broken. It introduces a Dirty COW alike bug and, ironically enough, this time brought to you by the team making Linux "more secure" -- well, not quite, in this case. ;)

This is just a prime example that they don't understand what they're copying. How would such an approach gain Linux security in any sensible way? It's always the bloody details that count -- that little corner case that also needs to be taken care of to make a security feature complete. But so far KSPP has just copy-pasted PaX/grsec code without fully understanding the interconnections with other features and code changes. I'd say, they need to slow down and start reading the whole thing, trying to understand what they just read and repeat -- until no new "Ahhh!" moments appear any more. What happens now isn't much more than a grep 'n sed over the patch without understanding what those hunks really do -- what other changes they need the grep didn't catch.

The "rare write" mechanism

Posted Jun 2, 2017 15:03 UTC (Fri) by Tara_Li (guest, #26706) [Link]

I'm thinking that "rare" in this case might be specifying more the ratio of writes to reads - something like a webpage with comments where the webpage is read a large number of times relative to the comments made to it - so it might make sense to just re-write the page as a static HTML page to add comments, rather than doing massive database look-ups each time the article is presented to users. Of course the viability of this depends on the ratio between time spent reading and time spent writing - so maybe a range of solutions is needed such as write_1_in_100(), write_1_in_10000(), write_1_in_1m(), etc.

The "rare write" mechanism

Posted Jun 2, 2017 20:10 UTC (Fri) by mm7323 (subscriber, #87386) [Link] (1 responses)

Shouldn't the nmi handler just save and clear, then restore the state of X86_CR0_WP or what ever architecture specific register to close that specific worry?

The "rare write" mechanism

Posted Jun 3, 2017 15:52 UTC (Sat) by minipli (guest, #69735) [Link]

That might work, however, I don't see any such changes in the proposed patch.