Finding Spectre vulnerabilities with smatch

Posted Apr 23, 2018 23:34 UTC (Mon) by excors (subscriber, #95769)
In reply to: Finding Spectre vulnerabilities with smatch by mchehab
Parent article: Finding Spectre vulnerabilities with smatch

> as far as I understand, "elsewhere" is actually limited to L1 cache size

I don't see why that would be the case. An attacker could set e.g. f->index = 0x80000000. The CPU may (incorrectly) speculatively predict the bounds check will pass, then speculatively read format[f->index].name (which is about 32GB after 'format'), then speculatively execute the strlcpy and read characters from that name string. Any read can leak information about the address that was accessed (via its effect on the caches), and in this case the address is the value at an attacker-controlled location in a ~64GB region after 'format', so the attacker could use it to leak the contents of potentially sensitive memory.

Finding Spectre vulnerabilities with smatch

Posted Apr 28, 2018 23:25 UTC (Sat) by dvdeug (guest, #10998) [Link] (1 responses)

According to https://gist.github.com/jboner/2841832 , a main memory access is 20 times as slow as a mispredicted branch, 100 ns versus 5 ns. I can imagine designs that would do speculative main memory accesses, but even before Spectre, the cost of loading something that won't be used into a cache (and ejecting what's actually going to be used) and clogging the memory bus to turn a 100 ns load into a 95 ns load seems unproductive.

Finding Spectre vulnerabilities with smatch

Posted Apr 29, 2018 14:06 UTC (Sun) by excors (subscriber, #95769) [Link]

The benefit can be much bigger than that. E.g. in code like "struct { int n; bool last; char pad[56]; } *p; while (!p->last) { sum += p->n; ++p; }", if you did all the loads and branches sequentially based on their dependencies, it would take ~100ns per iteration (since you can't start the next load of p until you've checked the result of the the previous load). But if you predict the branches then you can queue up dozens of (speculative) loads at once, and complete dozens of iterations per 100ns (limited only by memory bandwidth and queue sizes), which is a massive improvement. That extra parallelism is worth a tiny reduction in cache efficiency.

(In practice you'd need slightly more complicated code to avoid simply being optimised by the cache prefetcher etc, but presumably that kind of code comes up enough in benchmarks and/or real applications to be a worthwhile optimisation, given that Intel has been doing it for two decades.)