LWN: Comments on "Two security improvements for GCC"

Two security improvements for GCC

jpfrancois — Thu, 07 Oct 2021 06:32:55 +0000

This article discusses the real benefits of the -fzero-call-used-regs options, and argues it is mostly useless.
A bit obscure for non security specialist

https://dustri.org/b/paper-notes-clean-the-scratch-regist...

Zeroing registers

willy — Tue, 28 Sep 2021 21:37:18 +0000

I didn't say it was free. I said it took zero cycles. That is, the CPU can do it in parallel with everything else, and it takes no extra time. There's a limit to how many registers can be renamed per cycle (see discussion here, for example: https://www.agner.org/optimize/blog/read.php?i=857#852
)

It's not related to uops fusion. Or at least it doesn't have to be. One way to implement register renaming could be to implement an array of register numbers, so that when your insn says "load r4", it looks up the physical register number in the 4th index and uses the 87th physical register. I'm not saying that's a good implementation, but it's one that could rename a lot of registers to zero very quickly.

Zeroing registers

khim — Tue, 28 Sep 2021 21:16:31 +0000

But think about what you said one more time: how can you do anything in zero time? Anything at all? You can't. Everything computer does takes at least some resources and thus time.

Then how these instructions can ever be zero-latency/zero-time? μops fusion. Most programs don't zero-out register for the sake of zeroing-out registers. They zero-out register and then use it for something. And in such a case you can convert instruction which zero-out register with a simple mark in the other instruction which says “use zero as input instead of any real register”.

But instruction which clears dozen of registers is quite different. They need to, somehow, make physical registers zero. That is something you can just hide with μops fusion.

IOW: “make register zero” is fast (and sometimes even takes “zero time”) precisely in cases which we don't talk about when we are discussing these security-related zeroings.

Zeroing registers

arjan — Mon, 27 Sep 2021 16:40:39 +0000

zeroing a register is so cheap that cpus do it using various internal shortcuts, usually not even taking up any execution resources
(so on say an Intel Icelake cpu you can do +- 5 per cycle since your bottleneck is elsewhere in the pipeline)

most code doesn't dirty more than 5 to 10, so that could in theory still be 2 cycles. But... in an out of order engine,
in part these will fit otherwise vacant slots (no input dependencies so they can go basically at any time) and often just vanish entirely

Zeroing registers

pm215 — Mon, 27 Sep 2021 07:58:19 +0000

That only works if you have a small enough register set to be able to fit that bitfield in an instruction. For 64-bit Arm, there are 32 integer registers, so even if you skip the bit that would be the stack pointer it won't reasonably fit in an instruction. For architectures with only 16 registers, a 16-bit bitfield obviously does fit (and you can shave off bits for SP etc). That's still a fairly large chunk of the instruction encoding space to use for one instruction, though. I think these days ISA architects prefer to be parsimonious with encoding space, because it's a limited resource and there will always be new features and instructions coming along in future that you'd like to fit in...

Zeroing registers

Sesse — Sun, 26 Sep 2021 12:23:02 +0000

The way you usually solve this (all the way back from m68k, and probably even earlier) is to not give a range, but a bitmask of which bits to work on.

Zeroing registers

willy — Sun, 26 Sep 2021 03:26:14 +0000

Hm? I'm not a CPU expert, but I was under the impression that many CPUs have a zero register that they rename the zeroed register to. That's why it's a zero cycle instruction to zero a register (yes this sentence is too complicated; forgive me)

Zeroing registers

epa — Sat, 25 Sep 2021 19:38:16 +0000

Yes, it’s slow. I am saying “gee, it would be nice if it were fast”. Perhaps this is an unrealistic wish with modern CPUs. Although if it only happens when returning from a function call — when you have to branch to an address on the stack, and perhaps even reset other CPU state to avoid Spectre-type attacks — perhaps in that particular place it could be done with a single instruction and without much slowdown.

Two security improvements for GCC

madscientist — Sat, 25 Sep 2021 19:09:39 +0000

Well, it's not all ponies and rainbows. Clang has at least one unpleasant incompatibility: clang uses a completely different format for its precompiled header files than GCC, which is fine. But, Clang will look for files named with the GCC precompiled header extension (.gch) and if found try to read them. If they are not readable (because they are GCC-generated files which Clang doesn't understand) then Clang will fail.

This causes difficult problems when you're trying to use both GCC and Clang in the same environment (for example, you're compiling with GCC but using libclangd for LSP servers or similar).

This is, in my opinion, a "deliberate incompatibility": Clang wanted a free ride on GCC's popularity by being be a drop-in replacement so they used the same filenames as GCC for PCH even though the format is incompatible, rather than ask people to update their build environment to use a different file extension when they switch to Clang. It's not cool.

Zeroing registers

khim — Sat, 25 Sep 2021 18:44:28 +0000

No. On moderc CPU it's not trivial at all. Looks on Fog's tables. VZEROALL is either slow or really slow on modern CPUs

MOV register, #0 may have zero latency, but only if there are one such mov. Add dozen of them in row — and there would be noticeable slowdown. You forget about register renaming — it plays poorly with such instructions.

Two security improvements for GCC

khim — Sat, 25 Sep 2021 18:28:19 +0000

I'm really glad that while both Clang and GCC developers are fiercely competitive they don't deliberately introduce incompatibilities just because they can.

Kudos!

Zeroing registers

epa — Sat, 25 Sep 2021 13:46:16 +0000

I meant zero multiple registers in one clock cycle. So it would need hardware support, not just microcode. But surely a switch to flip the value to zero is pretty trivial.

Zeroing registers

excors — Sat, 25 Sep 2021 12:07:45 +0000

Ah, that's interesting. https://community.arm.com/developer/ip-products/processor... lists them like "CPY[F]Px [dst]!,[src]!,num_bytes!" and "SETPx [dst]!,num_bytes!,data", i.e. all the registers are auto-incrementing/decrementing, so perhaps an interrupt after copying N bytes will simply leave the registers as dst+N, src+N, num_bytes-N, and you can jump back to that instruction and resume exactly where it left off. All the state needed for resumption is in the explicitly-listed registers, unlike LDM/STM where it has to be hidden in a status register (like on ARMv7-M) or dropped entirely (like on ARMv7-A). That's just a guess but it seems like a clean way of implementing it.

Zeroing registers

atnot — Sat, 25 Sep 2021 09:59:50 +0000

> they're handled entirely by the register renaming stage and don't use any execution resources.

To me this was an argument for such an instruction, not against. Because one resource they do take resources is space in the instruction cache. I don't know how much of an issue it is in practice, but reducing code size is rarely bad for performance.

Zeroing registers

roc — Sat, 25 Sep 2021 09:46:37 +0000

The new ARM memcpy/memset instructions must have some strategy for handling interrupts...

Zeroing registers

excors — Sat, 25 Sep 2021 09:42:11 +0000

> Maybe processors should have a special instruction to zero all registers from a certain number upwards.

That sounds like it may have similar problems to ARM's LDM/STM instructions (which load/store an arbitrary subset of registers R0-R15 to consecutive memory addresses, in a single instruction), which they removed from ARMv8 and replaced with LDP/STP (which load/store an arbitrary pair of registers). In particular the instruction will take many cycles to execute, and you might get an interrupt halfway through executing the instruction, and you probably don't want to delay the interrupt handler for many cycles, so you need a way to suspend and resume the partially-executed instruction.

It's not necessarily safe to jump to PC-4 and re-execute the instruction, because the PC register can be in the instruction's target register list and might have been overwritten already (unless you make sure PC is always the last register - fortunately true on ARMv7), or because it's accessing a special memory region where reads/writes are not idempotent and you can't safely re-execute them (though that's not so relevant for the hypothetical zero-multiple instruction).

It looks like Cortex-M usually implements LDM/STM by storing "interruptible-continuable instruction bits" in a status register, so it can resume from partway through the instruction when returning from an interrupt handler. Cortex-A doesn't, it just tells the programmer that the instruction might get restarted so you really shouldn't be using it on non-idempotent memory regions. It works but it seems unpleasantly messy.

I suppose zeroing registers is a lot simpler because it doesn't touch memory, but it also seems unnecessary because modern CPUs can already do that in zero cycles. E.g. the Cortex-A78 Optimization Guide (https://developer.arm.com/documentation/102160/latest) describes "Zero latency MOVs" which includes "MOV Xd, #0" - they're handled entirely by the register renaming stage and don't use any execution resources.

Zeroing registers

pm215 — Sat, 25 Sep 2021 07:54:16 +0000

That would effectively be baking in calling convention details to the instruction set; and I think most calling conventions in use today don't put the callee-saves registers in a single nice neat contiguous range that extends up to the top of the register file. Eg on 64-bit Arm you'd need to clear x1-x8 (assuming a return value in x0) and x19-x29, but leave untouched x9-x18 and x30-x31. I suppose you could have an insn that took an upper and lower bound of registers to clear, and use it twice. It would be interesting to hear from somebody who knows about modern CPU implementations about whether it could be implemented internally more simply than 'break into micro-ops zeroing each register'...

Zeroing registers

epa — Sat, 25 Sep 2021 06:28:42 +0000

Maybe processors should have a special instruction to zero all registers from a certain number upwards. If the function call returns results in registers (rather than on the stack) you wouldn’t want to zero all of them. ‘ZERORS 5’ would zero all registers from register 5 upwards.

Two security improvements for GCC

kees — Sat, 25 Sep 2021 02:23:55 +0000

The kernel's support for -ftrivial-auto-var-init={pattern, zero} already works with GCC since it shares the same command-line option as Clang. :)