LWN.net Logo

Extending the use of RO and NX

By Jake Edge
January 12, 2011

Pages of memory that are managed by the kernel are governed by access control flags that are somewhat analogous to the permissions which are applied to files. Those flags govern whether the page can be written to and whether its contents can be executed. Both attributes are useful to restrict what can happen to those pages in the presence of programming errors or security attacks. A pair of patches that were merged in the current merge window will further extend the usage of these flags for the x86 architecture.

The page access flags, unlike file permissions, are enforced by the memory management hardware. The flags of interest for these patches are "write" and "execute", both of which imply "read" access, so they are often specified as follows: RO+X (read-only and execute) or RW+NX (read-write and no-execute). By restricting the usage of these pages, the scope of security flaws can be reduced because, for example, a buffer overflow in an NX page will not be directly useful for code execution.

The memory that is used by the kernel to hold its read-only data (i.e. the .rodata segment) has been able to be marked read-only since 2.6.16 in early 2006, depending on the setting of CONFIG_DEBUG_RODATA. In 2.6.25, the kernel .rodata segment was additionally marked NX (i.e. no-execute), but only for the x86_64 architecture. A patch that was originally created for 2.6.30 (for both the 32 and 64-bit x86 architectures) expanded the use of NX for all kernel data pages, including read-write sections for initialized data and BSS.

That patch was created by Siarhei Liakh and Xuxian Jiang but had fallen by the wayside after causing some boot crashes on one of Ingo Molnar's test systems. When Kees Cook brought up the idea of doing better page access protection of the kernel's memory, Molnar remembered that Matthieu Castet had "dusted off those patches and submitted two of them", back in August. After a few iterations, Molnar pulled them into the -tip tree, and Linus Torvalds pulled that for the mainline in the current 2.6.38 merge window.

The revised patch itself is fairly straightforward. If CONFIG_DEBUG_RODATA is set, various sections of the kernel (.text and .rodata) are page aligned for both their start and end addresses. The NX bit is set for all pages from the end of the .text (i.e. code) section to the _end address that marks the end of the kernel's data section.

There were two other pieces of the puzzle addressed in the patch, the first of which was presumably the cause of the boot crashes that Molnar had with the earlier patch. Some older systems that use PCI BIOS require that some pages in the 640K-1M region be executable. There are also some ISA mappings that require read-write access to that region. Rather than try to work all of that out, and potentially run afoul of buggy hardware, the patch just sets pages in that region to be RW+X on systems where PCI BIOS is used. The second change simply modifies free_init_pages() to turn on NX for any pages that are freed that way, so that those pages have to be explicitly allowed to store executable code when they are reused.

A related patch adds read-only and no-execute flags to the pages used by kernel modules. It came from the same developers, and seems to have been dropped from -tip along with the NX patch. And, like the other patch, Castet pushed it the last bit to finally get it included in the mainline.

The patch splits the module_core and module_init regions into three parts: code, read-only data, and read-write data. Each of those parts is page aligned and the page access permissions are set just before load_module() returns. For the code pieces, RO+X are set, while the data parts get NX and either RO or RW depending on the type of data. These changes are all governed by the setting of CONFIG_DEBUG_SET_MODULE_RONX.

Beyond setting the page access control flags at module load time, the kernel must also reset those flags to RW+NX when the module is unloaded. In addition, the module_init region is freed after initialization is completed and its pages need to be put back to RW+NX. There is one further wrinkle: Ftrace needs to be able to modify the code in modules to enable tracepoints, so the patch provides a means for all module text pages to be set RW while Ftrace is making those changes, and then to set them back to RO afterward.

Marking the kernel module pages as RO and/or NX is important not only because it is consistent with how the rest of the kernel pages are handled, but also because it makes other kernel protection efforts actually work for modules. For example, there has been an effort to declare structures of function pointers as const, so that exploits cannot change the pointers for their own nefarious purposes, but that only works if the .rodata pages are actually marked RO.

The main cost of these patches is some bits of wasted memory from page aligning the various sections. Since that cost is probably not significant for any but the most resource-constrained embedded systems, it would make sense for CONFIG_DEBUG_RODATA and CONFIG_DEBUG_SET_MODULE_RONX to be turned on for most distributions—or to default to "on", though that is generally frowned upon by Torvalds and others.

The fact that these patches have been around for a while, but never quite made the jump into the mainline is unfortunate. There is no real person or group that is currently shepherding core kernel security patches along, though Cook and Dan Rosenberg have recently been making an effort to push these kinds of changes. Cook's query helped resurrect both of these patches; they might have languished far longer without that interest.

It is also worth noting that much or all of the protections embodied in these patches have long been available in the grsecurity/PaX kernels. While no wholesale import of the features from those kernels is ever going to happen, piecemeal patches that implement "sane" (at least in Torvalds's eyes) features can be adopted. That should lead to better kernel security, which is something that is certainly worth shooting for.


(Log in to post comments)

Extending the use of RO and NX

Posted Jan 13, 2011 16:48 UTC (Thu) by kronos (subscriber, #55879) [Link]

How does this work with runtime code modifications (e.g. SMP alternatives or tracing)?

Extending the use of RO and NX

Posted Jan 13, 2011 17:47 UTC (Thu) by nevets (subscriber, #11875) [Link]

The SMP alternatives are performed during boot up before setting of RO takes effect. As for tracing, there are two ways:

1) (slower) method is to allocate a new page table entry that points to the same page but with write permission and make the modification via this new page table entrty.

2) (ftrace function tracer use) just before calling stop machine, all pages are converted back to RW, then stop machine is called and all changes are made, when stop machine finishes, the pages are put back to RO.

Extending the use of RO and NX

Posted Jan 13, 2011 22:46 UTC (Thu) by PaXTeam (subscriber, #24616) [Link]

> The SMP alternatives are performed during boot up before setting of RO takes effect.

they're also performed whenever all but one CPUs are offlined (SMP->UP) or when the second CPU comes online later (UP->SMP). this can be achieved by explicit action for a CPU (via /sys/devices/system/cpu/cpu*/online) or when suspend/reboot/halt occurs.

Extending the use of RO and NX

Posted Jan 16, 2011 19:56 UTC (Sun) by oak (guest, #2786) [Link]

> Beyond setting the page access control flags at module load time, the kernel must also reset those flags to RW+NX when the module is unloaded.

Why the pages need to be accessible at all after module has been unloaded and those pages aren't anymore(?) used?

Extending the use of RO and NX

Posted Jan 16, 2011 20:44 UTC (Sun) by quotemstr (subscriber, #45331) [Link]

So that these pages can be reused for other purposes.

Extending the use of RO and NX

Posted Jan 17, 2011 18:44 UTC (Mon) by oak (guest, #2786) [Link]

Don't their access rights anyway need to be changed whenever they're re-used, depending on what they will be re-used for?

Or is RW+NX use so much more likely that it makes sense to set only RO+X stuff explicitly when pages are taken into that kind of use?

I mean, why it isn't RO+NX when not in use? When pages get re-used for execution purposes, are they all cleared, or is it possible that end of the (last) page has some old data that doesn't get overwritten?

Extending the use of RO and NX

Posted Jan 18, 2011 0:32 UTC (Tue) by Blaisorblade (guest, #25465) [Link]

> Or is RW+NX use so much more likely that it makes sense to set only RO+X stuff explicitly when pages are taken into that kind of use?
Basically yes: the module loader allocates memory through the vmalloc() memory allocator, which is used also in a lots of places in the kernel. (Most) other users of this allocator use it for data, so it has to return RW memory, and it is safe to use +NX for it. The permissions could be set when memory is allocated again, but given that module loading/unloading is also likely to be infrequent you can (and should) make it more expensive, to slowing down all vmalloc() allocations.

> I mean, why it isn't RO+NX when not in use?
Interesting question, but:
1. Can you exploit more easily a system not doing that? My pretty long answer below is "under extreme circumstances, unlikely to happen in the real world, and (I think) never observed until now". However, maybe some world-class security expert knows of some obscure exception to this.

If each page is either writable or executable, you can't execute code of your choosing.
If you write to unused memory (which you could do through a buffer overflow overwriting a pointer), that can only affect the system if that memory is not initialized later before being used. That's a further bug needed, quite likely to cause you crashes without need of hostile attacks and therefore less likely to survive in a kernel release, and quite unlikely to allow you to execute arbitrary code (which is what you need). So you need 2 bugs, one of them is of a pretty unlikely kind, and you still can hardly do anything. If you're happy of remotely crashing the system, it's likely that you can do it via a plain buffer overflow.

2. If you wanted to do this change, either you protect all unused pages, or it isn't worth it, and that has surely a performance cost: you'd need to flush the TLB to apply the permissions change, at each memory allocation. You can't even protect all unused memory ranges, as with any memory allocator you end up with partially unused pages.
Moreover, I bet nobody could ever accept this slowdown for such tiny advantages; if you were ready to pay this price, then why not rather make all these problems impossible by running a kernel written in a memory-safe (i.e. garbage-collected) language, like Singularity? More pragmatically, move more drivers into userspace - there's already effort into this.

Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds