Cook: Security things in Linux v4.20

[Posted December 27, 2018 by corbet]

Kees Cook summarizes the security-related improvements in the 4.20 kernel. "Enabling CONFIG_GCC_PLUGIN_STACKLEAK=y means almost all uninitialized variable flaws go away, with only a very minor performance hit (it appears to be under 1% for most workloads). It’s still possible that, within a single syscall, a later buggy function call could use 'uninitialized' bytes from the stack from an earlier function. Fixing this will need compiler support for pre-initialization (this is under development already for Clang, for example), but that may have larger performance implications."

Cook: Security things in Linux v4.20

Posted Dec 27, 2018 21:51 UTC (Thu) by rweikusat2 (subscriber, #117920) [Link] (5 responses)

> it appears to be under 1% for most workloads

English translation: It's more than 1% for some workloads we don't care about.

Cook: Security things in Linux v4.20

Posted Dec 27, 2018 22:21 UTC (Thu) by lkundrak (subscriber, #43452) [Link] (1 responses)

Care to explain how did you reach that conclusion?

It seems to me that the option can be disabled precisely because the workloads where it would cause a significant performance penalty are being cared about.

Cook: Security things in Linux v4.20

Posted Dec 28, 2018 16:37 UTC (Fri) by rweikusat2 (subscriber, #117920) [Link]

> it appears to be under 1% for most workloads

They cannot possibly have tested "most workloads". Consequently, they've done some "workload [simulation] testing" and some times, the performance impact was <1% and some times, it wasn't. They believe that the <1% case will cover "most [real] workloads", as stated. But this is a quality judgement and not a statement of fact: It's conjectured not to hurt anything they consider important enough to worry about it, ie, "the [overwhelming] majority of cases".

Cook: Security things in Linux v4.20

Posted Dec 28, 2018 1:56 UTC (Fri) by flussence (guest, #85566) [Link]

Which benchmarks in particular do you think are too slow?

Cook: Security things in Linux v4.20

Posted Dec 28, 2018 15:27 UTC (Fri) by ncm (guest, #165) [Link] (1 responses)

Probably more accurate: "under 1% for the workloads we tried". Nobody can know what happens under workloads they don't know about. It is up to you to try your workloads and report. If you don't report, then by definition you don't care enough.

If some workloads are slowed by 10%, but they are Python jobs, it probably doesn't matter, because anybody who cares would not be using Python for that job. (Substitute Ruby, Perl, bash, PHP, any aggressively slow language, as needed.)

I would be interested to learn how many security holes this closes. I would rather see the stack randomized than zeroed, to better flush out bugs.

Cook: Security things in Linux v4.20

Posted Dec 28, 2018 17:00 UTC (Fri) by rweikusat2 (subscriber, #117920) [Link]

> If some workloads are slowed by 10%, but they are Python jobs, it probably doesn't matter, because anybody who cares would not
> be using Python for that job. (Substitute Ruby, Perl, bash, PHP, any aggressively slow language, as needed.)

People who care design special purpose hardware for "the job". Everybody else is obviously a fair game!
[SCNR]

Pre-clearing a local frame

Posted Dec 28, 2018 5:06 UTC (Fri) by jreiser (subscriber, #11027) [Link] (4 responses)

... this will need compiler support for pre-initialization ... For some architectures this can be implemented sooner by re-writing vmlinux.o, or by using a script to edit assembly language as generated by gcc. For instance, some current code on x86_64:

shift_arg_pages:  # fs/exec.c
    callq  __fentry__  # 5 bytes of .text
    push   %r15
    push   %r14
    push   %r13
    push   %r12
    push   %rbp
    push   %rbx
    sub    $0xa0,%rsp  # 7 bytes of .text

Binary re-writing can change the sub instruction to movb $(0xa0-0x80)/8,%al; call push_zeroes which can clear up to 3KiB; up to 6KiB if counting by 16. If the space to be cleared is less than 0x80 bytes then the sub is only 4 bytes of .text, and in-place re-writing is not as easy; so co-operate with the call __fentry__ or perhaps other intervening instructions. The number of different subroutine prologs is not that large, and many of them can be unified anyway (especially with a few well-chosen rotations of instructions from in-line to closed-subroutine), so handling all the cases is merely tedious.

Pre-clearing a local frame

Posted Dec 28, 2018 6:10 UTC (Fri) by eru (subscriber, #2753) [Link]

Wouldn't it be easier and safer to just modify GCC? These kinds of small tweaks to its code generation are not so hard.

Pre-clearing a local frame

Posted Dec 28, 2018 21:17 UTC (Fri) by ncm (guest, #165) [Link] (2 responses)

I would be happier to have the caches discard writebacks to addresses between the old and new stack pointer values.

Pre-clearing a local frame

Posted Dec 29, 2018 23:52 UTC (Sat) by jreiser (subscriber, #11027) [Link] (1 responses)

Hardware makes this difficult. "Inhibit writeback" (when available, such as dcbst and friends on IBM PowerPC) works only for whole cache lines, which today are 32, 64, or 128 bytes (or larger). Most frames are not cache-aligned, a large number intersect at most two cache lines, and dynamically determining which whole lines are inside larger frames is not fast. Also, this would destroy automatic cache consistency, which for decades has been one of the triumphs of Intel hardware.

Pre-clearing a local frame

Posted Dec 30, 2018 9:50 UTC (Sun) by ncm (guest, #165) [Link]

Hi John,

Yeah, I know it doesn't seem too easy.

But remember, by definition no other core is looking at your dead stack frames, so there's no problem with coherency. It would suffice to clear dirty flags just from L1 cache. L1 cache is small enough you could afford an extra bit per byte, indicating which are (still) considered dirty. You wouldn't need to be very precise about what was being cleared -- from SP down to the next page boundary below would do. Any fraction of that would be better than nothing -- even just down to the next cache line. Returning up a call chain could clear bits in various lines incrementally, and any line with no dirty bits left, then, wouldn't trigger a bus cycle.

Similarly for writes into new stack frames -- no need to fill in misses around writes with a bus cycle, because it's all the same anyway.

Intel and AMD architectural engineers are supposed to be always on the lookout for ways to use up the extra transistors the process people keep providing. If the cache people can't bring themselves to assume dead stack frames are dead, I bet executing an extra instruction per stack-frame op -- assurance provided by compilers -- would be cheap compared to the bus cycles they could obviate.

Of course there's no way to know what effect this could have, without serious simulations. It might be that stack memory is hot enough that most writes are themselves overwritten before they can get flushed to RAM, so that you only get missses on the way down the first time, and writebacks only after bouncing up and down the stack for a good while, after a lot of work has been done.

Cook: Security things in Linux v4.20

Posted Dec 28, 2018 22:13 UTC (Fri) by berndp (guest, #52035) [Link] (1 responses)

If such is a flaw is detected (and cured - as I understand it), is there also to be found in the kernels log so that can be fixed in the source (similar to e.g. lockdep)?

The referenced blog entry is quite short ...

Logging the discovery

Posted Dec 30, 2018 0:07 UTC (Sun) by jreiser (subscriber, #11027) [Link]

The notification probably would be an Oops: deref of a NULL pointer when the pre-init was 0, invalid index or wild pointer when the pre-init was "poison".