Gregg: CPU Utilization is Wrong [LWN.net]

Gregg: CPU Utilization is Wrong

Posted May 10, 2017 16:26 UTC (Wed) by Homer512 (subscriber, #85295) [Link] (7 responses)

When I don't have perf immediately available, I optimize by ear:
Low fan noise -> low IPC.
High fan noise -> high IPC.
Very high fan noise -> Oh look, GPU computing!

Gregg: CPU Utilization is Wrong

Posted May 10, 2017 16:59 UTC (Wed) by linuxrocks123 (subscriber, #34648) [Link] (6 responses)

Posting based just on that excerpt (sorry, but I'm tired right now), this sounds very wrong. Yes, a four-wide processor can retire four instructions in a cycle, and occasionally does, but the average IPC of a processor is never going to be anywhere near its theoretical maximum. Also, maximizing the IPC is the job of the compiler writers and processor architects, and people typically look at OS-provided CPU utilization statistics to figure out whether the computer is overloaded. Knowing that the computer wouldn't be overloaded if the compiler and CPU designers had done a better job and gotten the average IPC of the currently running code closer to the theoretical maximum isn't helpful.

Gregg: CPU Utilization is Wrong

Posted May 10, 2017 17:11 UTC (Wed) by Homer512 (subscriber, #85295) [Link] (4 responses)

IPC is a decent indicator for the efficiency of the memory access pattern and that's something that you as a programmer control, not the compiler. Sure, perf also reports cache misses but not stalling the CPU (read: IPC) is what you ultimately care about.

One aspect where IPC is misleading is vectorization. Higher vectorization lowers IPC.

Gregg: CPU Utilization is Wrong

Posted May 10, 2017 19:49 UTC (Wed) by zlynx (guest, #2285) [Link] (3 responses)

Vector instructions are also a good way to reach peak memory bandwidth. Once you do that, there isn't anywhere else to go with optimization. In a lot of cases, trying a "better" algorithm just introduces a bunch of pipeline stalls. Yeah it did less work, but it took longer to do it.

Although it can be a serious problem for the other CPU cores if you have one program looping updates as fast as it can, using 20 GB/s of RAM. The other cores have a hard time getting a word in edgewise.

A "word" get it? Hahaha. I have to laugh at my own jokes. Who else will?

Gregg: CPU Utilization is Wrong

Posted May 11, 2017 14:52 UTC (Thu) by sdalley (subscriber, #18550) [Link]

> I have to laugh at my own jokes. Who else will?

Well, I chuckled too...

Gregg: CPU Utilization is Wrong

Posted May 11, 2017 17:33 UTC (Thu) by flussence (guest, #85566) [Link] (1 responses)

I wonder if wholesale compression of data to reduce the RAM bottleneck will become the norm. We already do that with operators in x86/ARM, and some language runtimes do tricks to store shorter pointers. Graphics land is already there, using all kinds of weird home-grown texture formats (and a few standard ones) and letting the GPU expand them 60 times a second because it works out faster than keeping fully-rendered textures in VRAM (removing bufferbloat seems to be a good idea in a lot of places - who'd have thought?)

Bandwidth is easy to fix by just throwing more parallelism at it though - memory latency is much more of a pain and only getting worse. Whoever invents something to fill the gap between DRAM and SRAM will make a lot of money.

Gregg: CPU Utilization is Wrong

Posted May 11, 2017 19:15 UTC (Thu) by excors (subscriber, #95769) [Link]

On some NVIDIA GPUs, I believe framebuffer compression works by losslessly compressing 256 bytes to somewhere between 256.5 and 32.5 bytes when writing to VRAM. (The 0.5 bytes is a tag stored in special on-chip memory, describing how to decompress when reading). Each block still has 256B reserved in VRAM but it only needs to write/read a portion of those bytes. The memory bus width is often 256 bits (32 bytes), so it can read a compressed block in 1 memory cycle instead of 8, which is a big saving in bandwidth (but little effect on latency).

(Some blocks can even be compressed to 0.5 bytes and 0 memory cycles - a few tag values are reserved for solid (0,0,0,255) and (0,0,0,0) etc, so the block can be 'decompressed' without even looking at VRAM. That also means you can clear an entire framebuffer to zero without any VRAM writes at all.)

That works because 256 bytes is 8x8 RGBA pixels, and the GPU has lots of caches that handle blocks of that size, and a lot of GPU operations have strong spatial locality so they make good use of those large cache lines, and 8x8 chunks of framebuffers often contain losslessly-compressible data (flat colours etc).

It also works because the driver knows where framebuffers are stored in memory, and can enable/disable compression per 64KB page (with a 12-bit index into tag memory per page). That means it can only compress 256MB of its address space (and possibly much less in practice). That's still worthwhile since framebuffers are often read and written many times per frame (perhaps hundreds if you've got a lot of alpha-blending), whereas most of the other stuff in VRAM is read once per frame or not at all.

For CPUs, it looks like modern Intel ones read 8B per cycle per memory channel, and each 64B cache line is stored in a single channel, so there's the same potential 8:1 compression ratio. But I guess the big problem is the tag memory: if you have 64GB of RAM and all of it can be compressed, with a 4-bit tag per cache line, you need 512MB of tag memory, perhaps stored in RAM with a large TLB-like specialised cache so the CPU can access it with minimal latency, which sounds very expensive. If you only support compression on a fraction of RAM at once, and let the application decide where to enable it, that still sounds like a lot of effort for everyone (hardware, OS, application) that will only benefit very specialised use cases.

Gregg: CPU Utilization is Wrong

Posted May 10, 2017 17:30 UTC (Wed) by bgregg (guest, #46639) [Link]

Yes, in the article I'm not recommending people seek an IPC of 4.0. Check it out.

IPC, and any of the other PMC metrics, are useful for developers for identifying the nature of CPU usage, and pointing to directions for them to tune.

Gregg: CPU Utilization is Wrong

Posted May 10, 2017 17:20 UTC (Wed) by intgr (subscriber, #39733) [Link] (3 responses)

> The kernel varying the clock rate with speed step.

I think this is just as important factor as memory stalls. A process showing 10% utilization may actually be severely underclocked by power management, which on a memory-light workload may well mean 2% utilization of maximum frequency. But of course memory speed doesn't scale with CPU frequency, so that's not always the case. I don't think Linux (or any other OS?) compensates for the frequency scaling factor when reporting CPU utilization?

While on the topic of legacy performance metrics, why is load average still used by people? It tells you almost nothing about the utilization (load) of the system; a lightly loaded system with many threads may well display a large load average. In some cases it includes processes doing I/O and in other cases it doesn't.

Gregg: CPU Utilization is Wrong

Posted May 15, 2017 14:10 UTC (Mon) by mmeehan (subscriber, #72524) [Link] (2 responses)

Regarding load average, it tells you how many processes are ready to run. If it's larger than the number of cores on the host, some process that's ready to run has to wait. Frequently this points to congestion in the I/O|network|memory subsystems. It's good for ruling out a set of common problems.

Gregg: CPU Utilization is Wrong

Posted May 16, 2017 0:13 UTC (Tue) by dlang (guest, #313) [Link] (1 responses)

well, the load average is just that, an average, so you can run into processes waiting for cpu well below a load average os 1*core count.

IIRC, the load average also includes processes blocked for disk.

Personally, I start to care at about load average of 1 * core, but don't really worry about it until around 2 * core. but this will depend on your exact load, so it's something you have to figure out for your own systems.

Gregg: CPU Utilization is Wrong

Posted May 17, 2017 12:16 UTC (Wed) by nix (subscriber, #2304) [Link]

It includes processes blocked in D state, so processes blocked on disk, network filesystem I/O and even stuck on some kernel lock left hanging by an oopsed process :)

For reliable utilization figures -- some idea of how much time the processor actually spent executing instructions -- you really need hardware-level info, i.e. performance counter registers, MSRs or something similar: turbostat can do it for sufficiently-recent x86 processors.

Gregg: CPU Utilization is Wrong

Posted May 10, 2017 17:22 UTC (Wed) by jhoblitt (subscriber, #77733) [Link] (1 responses)

What are the pref counters actually measuring? I.e., The number of instructions retired per cycle? It seems like a really interesting metric would be how many instructions the CPU wanted to issue in a cycle but couldn't because an ALU/etc. was stalled.

Gregg: CPU Utilization is Wrong

Posted May 10, 2017 17:34 UTC (Wed) by bgregg (guest, #46639) [Link]

Yes, stall cycles metrics are in the PMCs as well. Really, IPC is just the first of many one would want to use. I mentioned them but didn't cover them in detail in the interests of keeping that post short. My prior, longer, post mentioned more PMCs but no one read it. :)

Gregg: CPU Utilization is Wrong

Posted May 10, 2017 18:23 UTC (Wed) by excors (subscriber, #95769) [Link] (3 responses)

Why does it necessarily matter whether a process spends most of its CPU time on computation or stalled on memory accesses? That time is unavailable to other processes either way (assuming no hyperthreading). If you run some program and its average CPU utilisation is 50%, that tells you that you can run two copies of it at roughly the same speed, but if you run three copies it'll go slower (assuming single core and constant CPU frequency etc). That seems a useful and meaningful and easily-interpretable metric for system performance.

If you're trying to optimise a specific application, your goal is almost always to minimise the wall-clock time it takes to complete its work. If it's a server processing requests at a constant rate, that's equivalent to minimising CPU utilisation. To achieve that goal, process-wide metrics seem generally pretty useless - a non-trivial application is likely to have multiple bottlenecks with very different performance characteristics. If program A spends 50% of its time in a function with IPC 4.0 and 50% with IPC 0.2 (so the process average is 2.1), and program B spends 100% of its time with IPC 2.0, you'll likely get much more benefit from IPC-focused optimisations on program A, but the averages won't tell you that.

You need a profiler that tells you exactly which functions are taking up the wall-clock time, and once you've identified those functions and decided you want to optimise them, that's when you look at the IPC within each function (along with cache misses, branch mispredictions, etc) to see what optimisation strategies are likely to work in each case, then you optimise and profile and hope the wall-clock time is better (and if not then you go back to the detailed metrics and try again).

And you can't just assume high IPC means efficient code. If you find a way to make your code faster by deleting some redundant computations, without removing any memory stalls, you'll reduce IPC. If you use wider SIMD instructions, you'll reduce IPC. If you insert hundreds of nops everywhere, you'll get excellent IPC but terrible performance. And low IPC doesn't necessarily mean memory stalls, you might just be doing a long serial computation where every instruction depends on the result of the previous one. You can only interpret IPC sensibly in the context of a specific piece of code, not as a black-box metric for a whole application.

Gregg: CPU Utilization is Wrong

Posted May 10, 2017 21:07 UTC (Wed) by drag (guest, #31333) [Link] (2 responses)

It helps you know how your program is behaving once it's deployed.

Why would anybody would bother to look at statistics for disk I/O for database software because there always lots of room for optimization in lots of different places in the database software and a busy database is going to use lots of disk I/O anyways?

Gregg: CPU Utilization is Wrong

Posted May 11, 2017 9:48 UTC (Thu) by Wol (subscriber, #4433) [Link]

:-)

Don't get me started ...

Disk optimisation is the FIRST place I'd look to optimise a Pick application. It's usually the bottleneck ...

Cheers,
Wol

Gregg: CPU Utilization is Wrong

Posted May 11, 2017 17:58 UTC (Thu) by excors (subscriber, #95769) [Link]

I think CPU utilisation and disk IO are much more similar to each other than to IPC.

If you have two versions of a program doing the same work, and one uses less CPU (and the same disk IO) or uses less disk IO (and the same CPU) then it's almost always the better version, because it can do more work with the same hardware or the same work with cheaper hardware. If one version just has 'better' (higher) IPC (and the same CPU usage and disk IO), it's not better in any meaningful way. (Actually it's probably worse because it's using more instructions to do the same work.)

If you try to find the maximum requests/sec throughput of some server, and then you see it has 100% CPU utilisation, giving it a faster CPU will probably let it process more requests. If it has 100% disk IO usage (by some complex combination of MB/sec and IOPS and whatever), giving it a faster disk will probably let it process more requests. But if you just see it has low IPC, that doesn't tell you anything with much confidence - maybe you could improve performance with bigger CPU caches or faster RAM to reduce memory stalls, but maybe that won't help at all, and maybe a program with high IPC would equally benefit from bigger caches and faster RAM (going from 1.5 to 3.0 is just as good as going from 0.5 to 1.0), so the measurement is not giving you any actionable insight.

For system-/process-level tools like top and iotop and 'perf stat -a', CPU utilisation and disk IO seem like widely useful stats by themselves, and IPC doesn't.

(IPC is certainly useful for function-level profilers, but that doesn't seem to be what the article is talking about. And per-function CPU utilisation is even more useful since it tells you which functions to focus on optimising.)

Gregg: CPU Utilization is Wrong

Posted May 11, 2017 9:10 UTC (Thu) by adobriyan (subscriber, #30858) [Link]

Kernel compilation is at 2 IPC on those modern 4-wide systems.

Gregg: CPU Utilization is Wrong

Posted May 11, 2017 20:36 UTC (Thu) by NightMonkey (subscriber, #23051) [Link]

Just put this in as a feature request with "atop". atop is already fantastic, but this would make it even better.

https://github.com/Atoptool/atop/issues/4

Wish me luck! ;)

Gregg: CPU Utilization is Wrong

Posted May 14, 2017 8:28 UTC (Sun) by magnus (subscriber, #34778) [Link] (3 responses)

It's a fun exercise to try to make a program that minimizes IPC, ie maximize cache misses.

With a program that malloc:ed a huge buffer and just write and read through it linearly byte by byte, I got a very high IPC of 3.5 or so. Then I changed to only write/read every 128 bytes so that each access is on a different cache line, then the IPC drops to 0.16.

When using a pointer chasing scheme where each element in the buffer points to the next one to read, I got it to drop down to 0.05 (on Ivy Bridge). Order of magnitude kind of makes sense if RAM runs at about half the CPU speed and has about 10 cycles latency. Anyone has ideas for making it go even lower?

#include <stdlib.h>
#include <stdio.h>

#define NELEM (128*1024*1024)

int main(int argc, char **argv)
{
unsigned int *v;
unsigned int i=0,k;
puts("Setup");
v = malloc(NELEM*sizeof(int));
for (i=0; i<NELEM; i++) v[i] = (i+rand()) % NELEM;
puts("Run");
i = 0; k = 0;
do { k += i; i = v[i]; } while (i != 0);
printf("%d\n",k);
return 0;
}

Gregg: CPU Utilization is Wrong

Posted May 15, 2017 17:20 UTC (Mon) by excors (subscriber, #95769) [Link] (2 responses)

I think your randomisation isn't going to work properly - it'll produce lots of small cycles, and the 'for' loop will get stuck in one and won't access the whole NELEMs, so you're only really testing the L2 latency. RAM latency should be maybe 10x-20x slower. (Rather than an array of random numbers, maybe use an array of consecutive numbers then randomly shuffle it. And use a decent RNG, not rand().)

To increase latency further you should try a stride of 4096 bytes (1 page), so that every memory access is a TLB miss and requires a page table walk.

The page table walk has to go through four levels of page tables (on x86_64). Apparently they're usually accessed via the data cache to reduce the cost; but if you access widely-separated pages, the cached page table entries won't be much use. One cache line fits 8 page table entries, so a stride of 8 pages should be more expensive. If you use even larger strides (like, say, 2^21 bytes or 2^30 bytes) then you'll get cache misses in the earlier levels of page table too.

Unfortunately you can't reserve more than about 2^46 bytes of virtual memory, and very large strides means you can't fit many elements into that space, and small numbers of elements are too easily cached. On my particular CPU, a decent tradeoff seemed to be looping over an array of 2^12 elements with a stride of 2^34 bytes.

Randomising the pointer chasing can help defeat any cache prefetching logic, but for me it doesn't seem to make a significant difference with strides >= 1 page. (Most of Intel's prefetchers simply never cross page boundaries. Apparently Ivy Bridge has a "next-page prefetcher" but I guess that only works in very specific situations.)

It can also be helpful to misalign your data so that each read straddles two pages. Modern Intel CPUs can do most misaligned accesses just as fast as aligned ones, but if crossing cache lines or pages increases the number of cache/TLB misses then it will still add some cost.

Also you need to unroll the pointer-chasing loop and remove the arithmetic, so it ideally does about 1 instruction per memory read.

With that approach, I can get down to an IPC of roughly 0.001:


#include <algorithm>
#include <sys/mman.h>

#define ELEMENTS (1 << 12)
#define STRIDE (1ull << 34)
#define ITERS (1024*1024*64)
#define OFFSET (STRIDE - 4)
#define UNROLL 16

#define ALIGN_UP(x, a) (((x) + (a) - 1) / (a)) * (a)

int main() {
  void *m = mmap(nullptr,
    STRIDE * ELEMENTS + STRIDE + OFFSET,
    PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0);

  uintptr_t base = ALIGN_UP((uintptr_t)m, STRIDE) + OFFSET;

  std::vector<int> indexes(ELEMENTS);
  std::iota(indexes.begin(), indexes.end(), 0);
//  std::shuffle(indexes.begin(), indexes.end(), std::mt19937(0));

  for (int i = 0; i < ELEMENTS; ++i)
    *(uintptr_t *)(base + STRIDE * indexes[i]) = base + STRIDE * indexes[(i+1) % ELEMENTS];

  printf("==================== Setup done ====================\n");

  uintptr_t p = base;
  for (int i = 0; i < ITERS; i += UNROLL)
    for (int j = 0; j < UNROLL; ++j)
      p = *(uintptr_t *)p;

  printf("Result: %p\n", (void *)p);
}

Measuring with "perf stat -e task-clock,cycles,instructions,LLC-load-misses,dTLB-load-misses -I 1000", looking at the stats a few seconds after the "Setup done" message (the setup is very expensive because it's creating lots of page tables):

    1000.220203      task-clock (msec)        
  3,955,098,490      cycles                   
      4,358,278      instructions             
     14,094,438      LLC-load-misses          
      5,215,481      dTLB-load-misses

(on Core i7-4790, "gcc -O3"). Each instruction takes >1 TLB miss, >3 cache misses, and >900 cycles.

Alternatively you could simply call an instruction like "rep stosb" with a very large repetition count. On my CPU the performance counters seem to count it as 1 instruction per 64KB or thereabouts, and it gives an IPC of around 0.0004.

Gregg: CPU Utilization is Wrong

Posted May 15, 2017 17:32 UTC (Mon) by magnus (subscriber, #34778) [Link] (1 responses)

Thanks for a great reply! Nice hack with the rep stosb but that feels like cheating. :)

I was thinking if generating a huge binary and doing a kind of pointer chasing in code also to get constant ICache misses as well could possibly drive down the IPC even further.

Gregg: CPU Utilization is Wrong

Posted May 15, 2017 18:27 UTC (Mon) by excors (subscriber, #95769) [Link]

That could probably work - instead of storing integers or pointers in the large-stride array, store a long chain of instructions like "jmpq *0x40000000(%rip)". For each instruction it'll first have to fetch the instruction itself (with several cache/TLB misses) and then fetch the target address from a distant location in memory (with several more misses), and everything has to run serially, so it should be at least twice as slow.