Cook: Security things in Linux v4.20
Enabling CONFIG_GCC_PLUGIN_STACKLEAK=y means almost all uninitialized variable flaws go away, with only a very minor performance hit (it appears to be under 1% for most workloads). It’s still possible that, within a single syscall, a later buggy function call could use 'uninitialized' bytes from the stack from an earlier function. Fixing this will need compiler support for pre-initialization (this is under development already for Clang, for example), but that may have larger performance implications."
Posted Dec 27, 2018 21:51 UTC (Thu)
by rweikusat2 (subscriber, #117920)
[Link] (5 responses)
English translation: It's more than 1% for some workloads we don't care about.
Posted Dec 27, 2018 22:21 UTC (Thu)
by lkundrak (subscriber, #43452)
[Link] (1 responses)
It seems to me that the option can be disabled precisely because the workloads where it would cause a significant performance penalty are being cared about.
Posted Dec 28, 2018 16:37 UTC (Fri)
by rweikusat2 (subscriber, #117920)
[Link]
They cannot possibly have tested "most workloads". Consequently, they've done some "workload [simulation] testing" and some times, the performance impact was <1% and some times, it wasn't. They believe that the <1% case will cover "most [real] workloads", as stated. But this is a quality judgement and not a statement of fact: It's conjectured not to hurt anything they consider important enough to worry about it, ie, "the [overwhelming] majority of cases".
Posted Dec 28, 2018 1:56 UTC (Fri)
by flussence (guest, #85566)
[Link]
Posted Dec 28, 2018 15:27 UTC (Fri)
by ncm (guest, #165)
[Link] (1 responses)
If some workloads are slowed by 10%, but they are Python jobs, it probably doesn't matter, because anybody who cares would not be using Python for that job. (Substitute Ruby, Perl, bash, PHP, any aggressively slow language, as needed.)
I would be interested to learn how many security holes this closes. I would rather see the stack randomized than zeroed, to better flush out bugs.
Posted Dec 28, 2018 17:00 UTC (Fri)
by rweikusat2 (subscriber, #117920)
[Link]
People who care design special purpose hardware for "the job". Everybody else is obviously a fair game!
Posted Dec 28, 2018 5:06 UTC (Fri)
by jreiser (subscriber, #11027)
[Link] (4 responses)
Posted Dec 28, 2018 6:10 UTC (Fri)
by eru (subscriber, #2753)
[Link]
Posted Dec 28, 2018 21:17 UTC (Fri)
by ncm (guest, #165)
[Link] (2 responses)
Posted Dec 29, 2018 23:52 UTC (Sat)
by jreiser (subscriber, #11027)
[Link] (1 responses)
Posted Dec 30, 2018 9:50 UTC (Sun)
by ncm (guest, #165)
[Link]
Yeah, I know it doesn't seem too easy.
But remember, by definition no other core is looking at your dead stack frames, so there's no problem with coherency. It would suffice to clear dirty flags just from L1 cache. L1 cache is small enough you could afford an extra bit per byte, indicating which are (still) considered dirty. You wouldn't need to be very precise about what was being cleared -- from SP down to the next page boundary below would do. Any fraction of that would be better than nothing -- even just down to the next cache line. Returning up a call chain could clear bits in various lines incrementally, and any line with no dirty bits left, then, wouldn't trigger a bus cycle.
Similarly for writes into new stack frames -- no need to fill in misses around writes with a bus cycle, because it's all the same anyway.
Intel and AMD architectural engineers are supposed to be always on the lookout for ways to use up the extra transistors the process people keep providing. If the cache people can't bring themselves to assume dead stack frames are dead, I bet executing an extra instruction per stack-frame op -- assurance provided by compilers -- would be cheap compared to the bus cycles they could obviate.
Of course there's no way to know what effect this could have, without serious simulations. It might be that stack memory is hot enough that most writes are themselves overwritten before they can get flushed to RAM, so that you only get missses on the way down the first time, and writebacks only after bouncing up and down the stack for a good while, after a lot of work has been done.
Posted Dec 28, 2018 22:13 UTC (Fri)
by berndp (guest, #52035)
[Link] (1 responses)
The referenced blog entry is quite short ...
Posted Dec 30, 2018 0:07 UTC (Sun)
by jreiser (subscriber, #11027)
[Link]
Cook: Security things in Linux v4.20
Cook: Security things in Linux v4.20
Cook: Security things in Linux v4.20
Cook: Security things in Linux v4.20
Cook: Security things in Linux v4.20
Cook: Security things in Linux v4.20
> be using Python for that job. (Substitute Ruby, Perl, bash, PHP, any aggressively slow language, as needed.)
[SCNR]
... this will need compiler support for pre-initialization ... For some architectures this can be implemented sooner by re-writing vmlinux.o, or by using a script to edit assembly language as generated by gcc. For instance, some current code on x86_64:
Pre-clearing a local frame
shift_arg_pages: # fs/exec.c
callq __fentry__ # 5 bytes of .text
push %r15
push %r14
push %r13
push %r12
push %rbp
push %rbx
sub $0xa0,%rsp # 7 bytes of .text
Binary re-writing can change the sub instruction to
movb $(0xa0-0x80)/8,%al; call push_zeroes
which can clear up to 3KiB; up to 6KiB if counting by 16. If the space to be cleared is less than 0x80 bytes then the sub is only 4 bytes of .text, and in-place re-writing is not as easy; so co-operate with the call __fentry__ or perhaps other intervening instructions. The number of different subroutine prologs is not that large, and many of them can be unified anyway (especially with a few well-chosen rotations of instructions from in-line to closed-subroutine), so handling all the cases is merely tedious.
Pre-clearing a local frame
Pre-clearing a local frame
Hardware makes this difficult. "Inhibit writeback" (when available, such as dcbst and friends on IBM PowerPC) works only for whole cache lines, which today are 32, 64, or 128 bytes (or larger). Most frames are not cache-aligned, a large number intersect at most two cache lines, and dynamically determining which whole lines are inside larger frames is not fast. Also, this would destroy automatic cache consistency, which for decades has been one of the triumphs of Intel hardware.
Pre-clearing a local frame
Pre-clearing a local frame
Cook: Security things in Linux v4.20
Logging the discovery