Gregg: CPU Utilization is Wrong
Gregg: CPU Utilization is Wrong
Posted May 11, 2017 17:33 UTC (Thu) by flussence (guest, #85566)In reply to: Gregg: CPU Utilization is Wrong by zlynx
Parent article: Gregg: CPU Utilization is Wrong
Bandwidth is easy to fix by just throwing more parallelism at it though - memory latency is much more of a pain and only getting worse. Whoever invents something to fill the gap between DRAM and SRAM will make a lot of money.
Posted May 11, 2017 19:15 UTC (Thu)
by excors (subscriber, #95769)
[Link]
(Some blocks can even be compressed to 0.5 bytes and 0 memory cycles - a few tag values are reserved for solid (0,0,0,255) and (0,0,0,0) etc, so the block can be 'decompressed' without even looking at VRAM. That also means you can clear an entire framebuffer to zero without any VRAM writes at all.)
That works because 256 bytes is 8x8 RGBA pixels, and the GPU has lots of caches that handle blocks of that size, and a lot of GPU operations have strong spatial locality so they make good use of those large cache lines, and 8x8 chunks of framebuffers often contain losslessly-compressible data (flat colours etc).
It also works because the driver knows where framebuffers are stored in memory, and can enable/disable compression per 64KB page (with a 12-bit index into tag memory per page). That means it can only compress 256MB of its address space (and possibly much less in practice). That's still worthwhile since framebuffers are often read and written many times per frame (perhaps hundreds if you've got a lot of alpha-blending), whereas most of the other stuff in VRAM is read once per frame or not at all.
For CPUs, it looks like modern Intel ones read 8B per cycle per memory channel, and each 64B cache line is stored in a single channel, so there's the same potential 8:1 compression ratio. But I guess the big problem is the tag memory: if you have 64GB of RAM and all of it can be compressed, with a 4-bit tag per cache line, you need 512MB of tag memory, perhaps stored in RAM with a large TLB-like specialised cache so the CPU can access it with minimal latency, which sounds very expensive. If you only support compression on a fraction of RAM at once, and let the application decide where to enable it, that still sounds like a lot of effort for everyone (hardware, OS, application) that will only benefit very specialised use cases.
Gregg: CPU Utilization is Wrong