LWN.net Logo

The search for fast, scalable counters

The kernel needs to count a lot of things. There are counters for networking statistics, usage of various resources, and so on. One would ordinarily think that operating a counter would be a relatively straightforward task, but ordinarily simple things can become complicated in the kernel context, especially when the number of processors involved gets large.

In theory, a counter is just a simple integer variable. In an SMP environment, however, that variable must be protected against concurrent updates, or it will eventually get corrupted. The tool that kernel hackers reach for first in this situation is the atomic_t type. Atomic variables are simple integers with a set of atomic operations. If you have an atomic_t variable called counter, that counter can be incremented with a call like:

    atomic_inc(&counter);

and its value will be changed in an SMP-safe, interrupt-safe manner. These operations are relatively fast, being hand-coded to use the mechanisms provided by each host architecture. In many cases, an atomic_t counter is the best solution to the problem.

The problem with atomic_t counters is that they use expensive locked operations, and they require that the current CPU obtain exclusive cache access for the variable. A frequently-modified atomic counter can cause a cache line to bounce constantly between CPUs, impacting the performance of the entire system. As an example, consider this patch set from Ravikiran Thirumalai. He replaced a single counter (the memory_allocated field of the proto structure) in the networking code with a more SMP-friendly counter, and reported a 5% improvement in an Apache benchmark on an eight-processor system. 5% is a nice improvement for changing a single counter, but it seems that perhaps even better results could be had.

Ravikiran replaced the atomic_t counter with the percpu_counter type. These counters use per-CPU variables to hold a CPU-local count. Modifying that count is fast, since it is local to the given CPU, no locking is required, and no cache lines need be moved from other processors. If any given processor's count exceeds a given threshold, its value is added to a (spinlock-protected) global count, and the CPU-local count is set back to zero. Queries of the counter look only at the global count. The result is a counter which is somewhat approximate, but quite fast. In many cases, an "almost right" count is entirely good enough.

Per-CPU counters become increasingly inaccurate as the number of processors grows, however. Each processor has a certain residual count which has not yet been folded into the global count. In situations where counters tend to increase, the result will be a global count which underestimates the real value, and which is increasingly wrong on larger systems. Per-CPU counters are also memory-intensive, partly due to inefficiencies in how per-CPU variables are allocated.

So the discussion wandered toward another possibility implemented with the somewhat obscure local_t type. This type is apparently intended to function as a sort of atomic_t which is only visible to a single CPU; it is currently only used in two places in the kernel: to manage module reference counts and in the x86-64 architecture code. It supports a set of operations similar to atomic_t: local_set(), local_read, local_add(), etc. There is also a set of variants (cpu_local_set(), ...) intended for use with a local_t declared as a per-CPU variable. The default implementation uses atomic_t for 32-bit systems and a strange three-variable structure for 64-bit systems. All architectures are encouraged to reimplement the type in a more efficient, interrupt-safe manner, however, and that has been done for several of them.

The local_t solution would set up two counters for each CPU, a flag saying which of the two is in use, and a global count. For many operations, they would behave just like percpu_counter, and they could yield the same approximate answer. Should a precise count be needed, however, the "which counter" bit would be flipped and all of the per-CPU offsets summed. The result would be an exact count at the time the bit was flipped, at the cost of taking a spinlock and iterating through the array.

All of this starts to look a little elaborate, however, and that may be the point where kernel developers lose interest. A counter should only be so complex, and making the code more twisted can only improve things to a point. Sooner or later, people will decide that there are more important things to be working on.


(Log in to post comments)

The search for fast, scalable counters

Posted Feb 2, 2006 13:23 UTC (Thu) by samj (guest, #7135) [Link]

I wrote some network quota code a few years ago (ipt_quota) which worked nicely but thinking about concurrency issues made my head hurt so I haven't done a great deal on it since. I'd have liked to have had (or at least known about) this back then. Having nice, reliable, fast counters helps everyone so time spent getting this right now isn't necessarily time wasted.

Mean value

Posted Feb 2, 2006 15:40 UTC (Thu) by man_ls (guest, #15091) [Link]

Speaking as a complete outsider...
Per-CPU counters become increasingly inaccurate as the number of processors grows, however. Each processor has a certain residual count which has not yet been folded into the global count. In situations where counters tend to increase, the result will be a global count which underestimates the real value, and which is increasingly wrong on larger systems.
To avoid underestimation, a better approximation can be obtained by considering the mean values for every counter. If N is the number of processors and M is the max value for every counter, adding N * M / 2 to the current measure would yield the expected value. Max error bound would be N * M / 2, while standard deviation (i.e. expected error) should be around sqrt(N) * M / 2. (In fact, if I'm not mistaken the distribution followed would be Student's t-distribution, but a normal is a good approximation.)

This means that the expected error should only grow with the square root of the number of processors: with 64 processors, the error is only 8 times bigger than on a single processor system. Choosing a small enough max value M would make the measure even more precise.

And while we are at it, I understand that a semaphore is needed to increment each per-CPU value, but to read it? Why not read on the fly all per-CPU counters and add them to the global counter when a count is required?

Mean value

Posted Feb 2, 2006 17:37 UTC (Thu) by madscientist (subscriber, #16861) [Link]

> And while we are at it, I understand that a semaphore is needed to
> increment each per-CPU value, but to read it? Why not read on the fly all
> per-CPU counters and add them to the global counter when a count is
> required?

Exactly what I was thinking! After all a counter is _inherently_ inaccurate: by the time the kernel reports the value back to whomever was asking there's an excellent chance it's changed. So why would it matter if we don't lock the counter before we read it? It seems like there are a number of options that could be considered, although certainly there are complexities as well.

Mean value

Posted Feb 3, 2006 3:28 UTC (Fri) by obobo (guest, #684) [Link]

In general, you can run into problems where the counter is being incremented as it is being read. Say you're using a (mythical) 8-bit architecture, and the 16-bit counter value is being incremented from 0x00ff to 0x0100. The intermediate state, as the counter is being updated, will be 0x0000 (or perhaps 0x01ff) because only 8 bits are being written at a time. Much worse than the off-by-one situation that I think you were imagining. And worse still if it is a 64 bit counter on a 32 bit system...

Caveat- I don't know anything the specific types mentioned in the article (local_t in particular) so this particular problem might not be an issue there depending on how the atomicity of it works.

Mean value

Posted Feb 3, 2006 10:39 UTC (Fri) by man_ls (guest, #15091) [Link]

So, just stick to the size native to the architecture. On 32-bit machines, never use more than 32 bits for the intermediate values; if you happen to be on a Z80 (not so mythical 8-bit architecture) use a max value of 255 for them.

Incrementing a 32-bit register on a 32-bit architecture is an atomic operation even on multi-processor systems, right?

per-CPU counters and locking

Posted Feb 5, 2006 0:35 UTC (Sun) by giraffedata (subscriber, #1954) [Link]

Who said anything about semaphores with per-CPU counters? The article specifically said there is no locking required for updating a per-CPU counter. It says you never read a per-CPU counter.

Also, there is no locking required to read the global counter.

Now as for the orthogonal issue of getting a complete count instead of just reading the global counter: how are you going to read the per-CPU counter of another CPU? The only way is for that CPU to write it out to main memory and for you to go to main memory to get it. That takes a long time, and that's the slowness that per-CPU counters are supposed to eliminate.

per-CPU counters and locking

Posted Feb 5, 2006 2:11 UTC (Sun) by man_ls (guest, #15091) [Link]

how are you going to read the per-CPU counter of another CPU?
Never thought of that; just another example of why kernel programming sometimes looks so hard to outsiders.
The only way is for that CPU to write it out to main memory and for you to go to main memory to get it. That takes a long time, and that's the slowness that per-CPU counters are supposed to eliminate.
At least it would spare the "expensive locking operations" mentioned in the main article, wouldn't it?

per-CPU counters and locking

Posted Feb 5, 2006 3:13 UTC (Sun) by giraffedata (subscriber, #1954) [Link]

just another example of why kernel programming sometimes looks so hard to outsiders.

This is a relatively recent addition to the things performance programmers have to worry about. Used to be, electronic memory was considered fast, and it was always the same speed. Typically, it took at most 4 cycles to access it. Now, there are 4 levels of electronic memory (including the registers) and the main pool is 200 cycles away. We think of looking at memory now the way we used to think of loading a file from a disk.

I'm not ashamed to say my personal leaning these days is toward single-CPU single-core machines tied together at the network adapter. My brain can handle only so much complexity.

At least it would spare the "expensive locking operations" mentioned in the main article, wouldn't it?

That locking operation is for updating a counter. For accessing one, the only cost is moving the cache line from one CPU to another -- through main memory. Whoops, I just remembered - on fancy modern machines, there's a shortcut to move a cache line without going out on the memory bus. So it's cheaper than what I said earlier, but still more than folks are willing to pay.

per-CPU counters and locking

Posted Feb 9, 2006 4:57 UTC (Thu) by roelofs (guest, #2599) [Link]

Now, there are 4 levels of electronic memory (including the registers)

5 levels sometimes (at least in principle). There exist Intel chips with L3 caches (Xeon Gallatin, I think), and presumably each cache type is a different speed from the others and from main memory.

Greg

The search for fast, scalable counters

Posted Feb 9, 2006 10:55 UTC (Thu) by addw (guest, #1771) [Link]

A sweeper thread is probably needed that, once/second, consolidates the local (per cpu) counters onto the one global one that is seen in /proc.
That way slow moving counters don't become too inaccurate.

Copyright © 2006, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds