|
|
Log in / Subscribe / Register

Two security improvements for GCC

By Jonathan Corbet
September 24, 2021

LPC
It has often been said that the competition between the GCC and LLVM compilers is good for both of them. One place where that competition shows up is in the area of security features; if one compiler adds a way to harden programs, the other is likely to follow suit. Qing Zhao's session at the 2021 Linux Plumbers Conference told the story of how GCC successfully played catch-up for two security-related features that were of special interest to the kernel community.

Call-used register wiping

Zhao started with a list of security features that kernel developers had been asking for, noting that the LLVM Clang compiler already had a number of them, but GCC did not. She has been working to fill in that gap, starting with the feature known as "call-used register wiping" — clearing the contents of registers used by a function before returning. There are a couple of reasons why one might want this feature in a compiler.

The first of those is to frustrate return-oriented programming (ROP) attacks, which feature regularly in published [Qing Zhao] exploits. A ROP attack works by chaining together a set of "gadgets" — code fragments that perform some useful (to the attacker) function followed by a return. If an attacker can place the right series of "return addresses" on the stack, they can string together a collection of gadgets and get the kernel to do just about anything that they want.

ROP attacks must usually, sooner or later, call some other kernel function to carry out a needed task; the called function will look at the processor registers for its parameters. Making a ROP attack work thus requires getting the right values into those registers; clearing the registers at each function return can be highly effective at frustrating those attacks. It breaks the chain of gadgets that the attacker is trying to assemble.

The other reason to clear registers on return, of course, is to prevent information leaks. It is often surprising to see what attackers can learn from whatever data may have been left in a CPU register.

So clearing registers is good, but there is still the question of which registers need clearing. If the objective is frustrating ROP attacks, clearing only the registers that are used for function parameters is sufficient. Protecting against information leaks, instead, requires clearing all of the registers used. A related question is whether registers should be zeroed or set to random values. For GCC, zero was seen as the safest choice, since it is the least likely to produce values that seem meaningful to other code. It also leads to a smaller and faster implementation.

This functionality is part of the GCC 11 release, controlled by the -fzero-call-used-regs= compiler option, which has a number of possible values to control which registers should be cleared. There is also a new function attribute (zero_call_used_regs) that can be used to control register clearing for a specific function. The implementation is in the form of a new compiler pass that looks at all exit blocks, finds each return instruction, computes the set of registers to clear (which includes tracking which registers are actually used), and emits the instructions to actually perform the clearing. This functionality initially supported the x86 and Arm64 architectures; SPARC was added a bit later.

Support for register clearing when compiling with GCC was merged into the mainline kernel for the 5.15 release; the changelog notes that it reduces the number of usable ROP gadgets in the kernel by about 20%.

Stack variable initialization

The C programming language famously specifies that automatic (stack) variables are not initialized by the compiler. If code uses such a variable before assigning a value to it, it will be working with garbage data that can lead to all kinds of problems. Erroneous outcomes are clearly one of those, but it gets worse; if an attacker can find a way to place a value on the stack where an automatic variable is allocated, they may well be able to compromise the system. If an uninitialized variable is used as a lock, the result can be uncontrolled race conditions. This is all worth avoiding.

There are a number of tools around that can try to detect the use of uninitialized stack variables. Both GCC and Clang support the -Wuninitialized option, which causes warnings to be emitted at compile time, for example. Both compilers also have a -fsanitize= option to detect these usages at run time. Beyond the compilers, tools like Valgrind can be used to find uninitialized-variable usage.

These tools are useful, but they have their limits, Zhao said. Static (compile-time) tools can only perform analysis within individual functions, which can require making assumptions about what other functions do. Their ability to detect problems with uninitialized array elements or values accessed via pointers is limited. So they miss problems while, at the same time, failing to prune out infeasible paths through the code and generating false-positive warnings. Dynamic (run-time) tools cannot cover all paths, so they will miss problems; they also impose a significant run-time overhead.

Starting with the upcoming GCC 12 release, the -ftrivial-auto-var-init= option will control the automatic initialization of on-stack variables. Its default value, uninitialized, maintains the current behavior. If it is set to pattern, variables will be initialized to values that are likely to result in crashes if they are used; this option is intended for debugging use. Setting it to zero, instead, simply initializes all on-stack variables to zero; this option is for hardening production code. There is a new variable attribute (uninitialized) that can be used to mark variables that are deliberately not initialized.

Regardless of the setting of this new option, the compiler will still issue warnings if -Wuninitialized is set. The idea behind this option is not to "fork the language", but to add an extra level of safety; code that fails to properly initialize variables should still be fixed. This work was committed to the GCC trunk in early September; there are some bugs still in need of fixing that should be taken care of soon.

Zhao didn't talk about support for this feature in the kernel. Clang has had support for this option for a while, though, and the kernel can make use of it, so making use of GCC's support once it is available will be straightforward. That should help prevent whole classes of bugs, and may spell the beginning of the end for the structleak GCC plugin that is supported by the kernel now. While the development of these features was driven by a kernel wishlist, they should both prove useful well beyond the kernel context.

The video for this talk is available on YouTube.

Index entries for this article
KernelSecurity/Kernel hardening
ConferenceLinux Plumbers Conference/2021


to post comments

Two security improvements for GCC

Posted Sep 25, 2021 2:23 UTC (Sat) by kees (subscriber, #27264) [Link] (2 responses)

The kernel's support for -ftrivial-auto-var-init={pattern, zero} already works with GCC since it shares the same command-line option as Clang. :)

Two security improvements for GCC

Posted Sep 25, 2021 18:28 UTC (Sat) by khim (subscriber, #9252) [Link] (1 responses)

I'm really glad that while both Clang and GCC developers are fiercely competitive they don't deliberately introduce incompatibilities just because they can.

Kudos!

Two security improvements for GCC

Posted Sep 25, 2021 19:09 UTC (Sat) by madscientist (subscriber, #16861) [Link]

Well, it's not all ponies and rainbows. Clang has at least one unpleasant incompatibility: clang uses a completely different format for its precompiled header files than GCC, which is fine. But, Clang will look for files named with the GCC precompiled header extension (.gch) and if found try to read them. If they are not readable (because they are GCC-generated files which Clang doesn't understand) then Clang will fail.

This causes difficult problems when you're trying to use both GCC and Clang in the same environment (for example, you're compiling with GCC but using libclangd for LSP servers or similar).

This is, in my opinion, a "deliberate incompatibility": Clang wanted a free ride on GCC's popularity by being be a drop-in replacement so they used the same filenames as GCC for PCH even though the format is incompatible, rather than ask people to update their build environment to use a different file extension when they switch to Clang. It's not cool.

Zeroing registers

Posted Sep 25, 2021 6:28 UTC (Sat) by epa (subscriber, #39769) [Link] (14 responses)

Maybe processors should have a special instruction to zero all registers from a certain number upwards. If the function call returns results in registers (rather than on the stack) you wouldn’t want to zero all of them. ‘ZERORS 5’ would zero all registers from register 5 upwards.

Zeroing registers

Posted Sep 25, 2021 7:54 UTC (Sat) by pm215 (subscriber, #98099) [Link] (2 responses)

That would effectively be baking in calling convention details to the instruction set; and I think most calling conventions in use today don't put the callee-saves registers in a single nice neat contiguous range that extends up to the top of the register file. Eg on 64-bit Arm you'd need to clear x1-x8 (assuming a return value in x0) and x19-x29, but leave untouched x9-x18 and x30-x31. I suppose you could have an insn that took an upper and lower bound of registers to clear, and use it twice. It would be interesting to hear from somebody who knows about modern CPU implementations about whether it could be implemented internally more simply than 'break into micro-ops zeroing each register'...

Zeroing registers

Posted Sep 26, 2021 12:23 UTC (Sun) by Sesse (subscriber, #53779) [Link] (1 responses)

The way you usually solve this (all the way back from m68k, and probably even earlier) is to not give a range, but a bitmask of which bits to work on.

Zeroing registers

Posted Sep 27, 2021 7:58 UTC (Mon) by pm215 (subscriber, #98099) [Link]

That only works if you have a small enough register set to be able to fit that bitfield in an instruction. For 64-bit Arm, there are 32 integer registers, so even if you skip the bit that would be the stack pointer it won't reasonably fit in an instruction. For architectures with only 16 registers, a 16-bit bitfield obviously does fit (and you can shave off bits for SP etc). That's still a fairly large chunk of the instruction encoding space to use for one instruction, though. I think these days ISA architects prefer to be parsimonious with encoding space, because it's a limited resource and there will always be new features and instructions coming along in future that you'd like to fit in...

Zeroing registers

Posted Sep 25, 2021 9:42 UTC (Sat) by excors (subscriber, #95769) [Link] (9 responses)

> Maybe processors should have a special instruction to zero all registers from a certain number upwards.

That sounds like it may have similar problems to ARM's LDM/STM instructions (which load/store an arbitrary subset of registers R0-R15 to consecutive memory addresses, in a single instruction), which they removed from ARMv8 and replaced with LDP/STP (which load/store an arbitrary pair of registers). In particular the instruction will take many cycles to execute, and you might get an interrupt halfway through executing the instruction, and you probably don't want to delay the interrupt handler for many cycles, so you need a way to suspend and resume the partially-executed instruction.

It's not necessarily safe to jump to PC-4 and re-execute the instruction, because the PC register can be in the instruction's target register list and might have been overwritten already (unless you make sure PC is always the last register - fortunately true on ARMv7), or because it's accessing a special memory region where reads/writes are not idempotent and you can't safely re-execute them (though that's not so relevant for the hypothetical zero-multiple instruction).

It looks like Cortex-M usually implements LDM/STM by storing "interruptible-continuable instruction bits" in a status register, so it can resume from partway through the instruction when returning from an interrupt handler. Cortex-A doesn't, it just tells the programmer that the instruction might get restarted so you really shouldn't be using it on non-idempotent memory regions. It works but it seems unpleasantly messy.

I suppose zeroing registers is a lot simpler because it doesn't touch memory, but it also seems unnecessary because modern CPUs can already do that in zero cycles. E.g. the Cortex-A78 Optimization Guide (https://developer.arm.com/documentation/102160/latest) describes "Zero latency MOVs" which includes "MOV Xd, #0" - they're handled entirely by the register renaming stage and don't use any execution resources.

Zeroing registers

Posted Sep 25, 2021 9:46 UTC (Sat) by roc (subscriber, #30627) [Link] (1 responses)

The new ARM memcpy/memset instructions must have some strategy for handling interrupts...

Zeroing registers

Posted Sep 25, 2021 12:07 UTC (Sat) by excors (subscriber, #95769) [Link]

Ah, that's interesting. https://community.arm.com/developer/ip-products/processor... lists them like "CPY[F]Px [dst]!,[src]!,num_bytes!" and "SETPx [dst]!,num_bytes!,data", i.e. all the registers are auto-incrementing/decrementing, so perhaps an interrupt after copying N bytes will simply leave the registers as dst+N, src+N, num_bytes-N, and you can jump back to that instruction and resume exactly where it left off. All the state needed for resumption is in the explicitly-listed registers, unlike LDM/STM where it has to be hidden in a status register (like on ARMv7-M) or dropped entirely (like on ARMv7-A). That's just a guess but it seems like a clean way of implementing it.

Zeroing registers

Posted Sep 25, 2021 9:59 UTC (Sat) by atnot (guest, #124910) [Link]

> they're handled entirely by the register renaming stage and don't use any execution resources.

To me this was an argument for such an instruction, not against. Because one resource they do take resources is space in the instruction cache. I don't know how much of an issue it is in practice, but reducing code size is rarely bad for performance.

Zeroing registers

Posted Sep 25, 2021 13:46 UTC (Sat) by epa (subscriber, #39769) [Link] (5 responses)

I meant zero multiple registers in one clock cycle. So it would need hardware support, not just microcode. But surely a switch to flip the value to zero is pretty trivial.

Zeroing registers

Posted Sep 25, 2021 18:44 UTC (Sat) by khim (subscriber, #9252) [Link] (4 responses)

No. On moderc CPU it's not trivial at all. Looks on Fog's tables. VZEROALL is either slow or really slow on modern CPUs

MOV register, #0 may have zero latency, but only if there are one such mov. Add dozen of them in row — and there would be noticeable slowdown. You forget about register renaming — it plays poorly with such instructions.

Zeroing registers

Posted Sep 25, 2021 19:38 UTC (Sat) by epa (subscriber, #39769) [Link]

Yes, it’s slow. I am saying “gee, it would be nice if it were fast”. Perhaps this is an unrealistic wish with modern CPUs. Although if it only happens when returning from a function call — when you have to branch to an address on the stack, and perhaps even reset other CPU state to avoid Spectre-type attacks — perhaps in that particular place it could be done with a single instruction and without much slowdown.

Zeroing registers

Posted Sep 26, 2021 3:26 UTC (Sun) by willy (subscriber, #9762) [Link] (2 responses)

Hm? I'm not a CPU expert, but I was under the impression that many CPUs have a zero register that they rename the zeroed register to. That's why it's a zero cycle instruction to zero a register (yes this sentence is too complicated; forgive me)

Zeroing registers

Posted Sep 28, 2021 21:16 UTC (Tue) by khim (subscriber, #9252) [Link] (1 responses)

But think about what you said one more time: how can you do anything in zero time? Anything at all? You can't. Everything computer does takes at least some resources and thus time.

Then how these instructions can ever be zero-latency/zero-time? μops fusion. Most programs don't zero-out register for the sake of zeroing-out registers. They zero-out register and then use it for something. And in such a case you can convert instruction which zero-out register with a simple mark in the other instruction which says “use zero as input instead of any real register”.

But instruction which clears dozen of registers is quite different. They need to, somehow, make physical registers zero. That is something you can just hide with μops fusion.

IOW: “make register zero” is fast (and sometimes even takes “zero time”) precisely in cases which we don't talk about when we are discussing these security-related zeroings.

Zeroing registers

Posted Sep 28, 2021 21:37 UTC (Tue) by willy (subscriber, #9762) [Link]

I didn't say it was free. I said it took zero cycles. That is, the CPU can do it in parallel with everything else, and it takes no extra time. There's a limit to how many registers can be renamed per cycle (see discussion here, for example: https://www.agner.org/optimize/blog/read.php?i=857#852
)

It's not related to uops fusion. Or at least it doesn't have to be. One way to implement register renaming could be to implement an array of register numbers, so that when your insn says "load r4", it looks up the physical register number in the 4th index and uses the 87th physical register. I'm not saying that's a good implementation, but it's one that could rename a lot of registers to zero very quickly.

Zeroing registers

Posted Sep 27, 2021 16:40 UTC (Mon) by arjan (subscriber, #36785) [Link]

zeroing a register is so cheap that cpus do it using various internal shortcuts, usually not even taking up any execution resources
(so on say an Intel Icelake cpu you can do +- 5 per cycle since your bottleneck is elsewhere in the pipeline)

most code doesn't dirty more than 5 to 10, so that could in theory still be 2 cycles. But... in an out of order engine,
in part these will fit otherwise vacant slots (no input dependencies so they can go basically at any time) and often just vanish entirely

Two security improvements for GCC

Posted Oct 7, 2021 6:32 UTC (Thu) by jpfrancois (subscriber, #65948) [Link]

This article discusses the real benefits of the -fzero-call-used-regs options, and argues it is mostly useless.
A bit obscure for non security specialist

https://dustri.org/b/paper-notes-clean-the-scratch-regist...


Copyright © 2021, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds