Gregg: CPU Utilization is Wrong

Posted May 11, 2017 17:33 UTC (Thu) by flussence (guest, #85566)
In reply to: Gregg: CPU Utilization is Wrong by zlynx
Parent article: Gregg: CPU Utilization is Wrong

I wonder if wholesale compression of data to reduce the RAM bottleneck will become the norm. We already do that with operators in x86/ARM, and some language runtimes do tricks to store shorter pointers. Graphics land is already there, using all kinds of weird home-grown texture formats (and a few standard ones) and letting the GPU expand them 60 times a second because it works out faster than keeping fully-rendered textures in VRAM (removing bufferbloat seems to be a good idea in a lot of places - who'd have thought?)

Bandwidth is easy to fix by just throwing more parallelism at it though - memory latency is much more of a pain and only getting worse. Whoever invents something to fill the gap between DRAM and SRAM will make a lot of money.

Gregg: CPU Utilization is Wrong

Posted May 11, 2017 19:15 UTC (Thu) by excors (subscriber, #95769) [Link]

On some NVIDIA GPUs, I believe framebuffer compression works by losslessly compressing 256 bytes to somewhere between 256.5 and 32.5 bytes when writing to VRAM. (The 0.5 bytes is a tag stored in special on-chip memory, describing how to decompress when reading). Each block still has 256B reserved in VRAM but it only needs to write/read a portion of those bytes. The memory bus width is often 256 bits (32 bytes), so it can read a compressed block in 1 memory cycle instead of 8, which is a big saving in bandwidth (but little effect on latency).

(Some blocks can even be compressed to 0.5 bytes and 0 memory cycles - a few tag values are reserved for solid (0,0,0,255) and (0,0,0,0) etc, so the block can be 'decompressed' without even looking at VRAM. That also means you can clear an entire framebuffer to zero without any VRAM writes at all.)

That works because 256 bytes is 8x8 RGBA pixels, and the GPU has lots of caches that handle blocks of that size, and a lot of GPU operations have strong spatial locality so they make good use of those large cache lines, and 8x8 chunks of framebuffers often contain losslessly-compressible data (flat colours etc).

It also works because the driver knows where framebuffers are stored in memory, and can enable/disable compression per 64KB page (with a 12-bit index into tag memory per page). That means it can only compress 256MB of its address space (and possibly much less in practice). That's still worthwhile since framebuffers are often read and written many times per frame (perhaps hundreds if you've got a lot of alpha-blending), whereas most of the other stuff in VRAM is read once per frame or not at all.

For CPUs, it looks like modern Intel ones read 8B per cycle per memory channel, and each 64B cache line is stored in a single channel, so there's the same potential 8:1 compression ratio. But I guess the big problem is the tag memory: if you have 64GB of RAM and all of it can be compressed, with a 4-bit tag per cache line, you need 512MB of tag memory, perhaps stored in RAM with a large TLB-like specialised cache so the CPU can access it with minimal latency, which sounds very expensive. If you only support compression on a fraction of RAM at once, and let the application decide where to enable it, that still sounds like a lot of effort for everyone (hardware, OS, application) that will only benefit very specialised use cases.