Pre-clearing a local frame
Pre-clearing a local frame
Posted Dec 28, 2018 21:17 UTC (Fri) by ncm (guest, #165)In reply to: Pre-clearing a local frame by jreiser
Parent article: Cook: Security things in Linux v4.20
Posted Dec 29, 2018 23:52 UTC (Sat)
by jreiser (subscriber, #11027)
[Link] (1 responses)
Posted Dec 30, 2018 9:50 UTC (Sun)
by ncm (guest, #165)
[Link]
Yeah, I know it doesn't seem too easy.
But remember, by definition no other core is looking at your dead stack frames, so there's no problem with coherency. It would suffice to clear dirty flags just from L1 cache. L1 cache is small enough you could afford an extra bit per byte, indicating which are (still) considered dirty. You wouldn't need to be very precise about what was being cleared -- from SP down to the next page boundary below would do. Any fraction of that would be better than nothing -- even just down to the next cache line. Returning up a call chain could clear bits in various lines incrementally, and any line with no dirty bits left, then, wouldn't trigger a bus cycle.
Similarly for writes into new stack frames -- no need to fill in misses around writes with a bus cycle, because it's all the same anyway.
Intel and AMD architectural engineers are supposed to be always on the lookout for ways to use up the extra transistors the process people keep providing. If the cache people can't bring themselves to assume dead stack frames are dead, I bet executing an extra instruction per stack-frame op -- assurance provided by compilers -- would be cheap compared to the bus cycles they could obviate.
Of course there's no way to know what effect this could have, without serious simulations. It might be that stack memory is hot enough that most writes are themselves overwritten before they can get flushed to RAM, so that you only get missses on the way down the first time, and writebacks only after bouncing up and down the stack for a good while, after a lot of work has been done.
Hardware makes this difficult. "Inhibit writeback" (when available, such as dcbst and friends on IBM PowerPC) works only for whole cache lines, which today are 32, 64, or 128 bytes (or larger). Most frames are not cache-aligned, a large number intersect at most two cache lines, and dynamically determining which whole lines are inside larger frames is not fast. Also, this would destroy automatic cache consistency, which for decades has been one of the triumphs of Intel hardware.
Pre-clearing a local frame
Pre-clearing a local frame