LWN: Comments on "The problem with prefetch"

You just need to know and remember in your local brain cache this prefetch problem

kazan417 — Wed, 12 Jul 2017 10:16:26 +0000

Thanks for sharing this useful information!
Again unobvious tecchnology behavior, which you just need to know and remember in your local brain cache.
And we have many of them(unobvious behavior) every day in modern technology, because now they are super complex

The problem with prefetch

tcucinotta — Thu, 16 Jun 2011 08:32:29 +0000

> Ingo summarized his results this way:
>
> So the conclusion is: prefetches are absolutely toxic,
> even if the NULL ones are excluded.

If I understand well, the main source of the "toxicity" here is the lists being (most of the cases very) short in the case of their usage within hash tables. So, if one wanted to keep some highly reusable/useful helper macro(s) to iterate items, perhaps it could be worth to have 2 versions of them, one for likely long lists, the other one for likely short lists.
However, I also suspect that, whilst for relatively "empty" hash tables those lists will likely contain only 1 item, for highly "full" tables those lists can easily become more and more crowded, so the developer's hint (likely short vs likely long) would easily be workload/scenario-dependent.
So, the final question would be whether there's a scenario to be considered "more likely" (or worth to be optimized) by developers than others.
Just my 2 cents.

The problem with prefetch

etienne — Tue, 07 Jun 2011 12:04:50 +0000

Using "register" also means that the address of that number shall not be taken, so should have consequences on aliasing optimisation (you have a warning if modifying that code and take address of that variable, maybe this would have big difference in execution speed).

The problem with prefetch

hyoshiok — Mon, 06 Jun 2011 06:52:30 +0000

prefetch makes cache pollution. it is well known issue.

prefetch and buffer bloat

jch — Thu, 02 Jun 2011 21:54:43 +0000

Sorry if this is off topic for this discussion, but bufferbloat is not just a political issue -- it's a difficult technical one. Just reducing the size of buffers won't do, since the right amount of buffering depends on a lot of factors such as throughput, RTT, the transport- and application-layer protocol being used, etc.

Bufferbloat is not about reducing the amount of buffers in routers; it is about designing algorithms to make sure that routers only use as much of their buffers as necessary, and getting the router vendors to deploy such algorithms.

--jch

The problem with prefetch

marcH — Thu, 02 Jun 2011 13:17:31 +0000

I find a _reasonable_ use of "register" still useful for programmer to programmer communication anyway. It seldom hurts to express a (good) intent.

likely()

AdamRichter — Wed, 01 Jun 2011 04:20:09 +0000

Sometimes you want to minimize latency for the less commonly used but more important branch, such as in almost any polling loop.

prefetch and buffer bloat

dlang — Tue, 31 May 2011 21:05:11 +0000

something to remember is that memory comes in standard sizes. it's not always possible/reasonable to put less ram in the device.

since they have the ram anyway, and buffers that are too small can cause problems. the logic then follows 'why not just use the ram in the device as a buffer'

this causes other problems, but these other problems were not well described until recently.

prefetch and buffer bloat

marcH — Tue, 31 May 2011 20:43:46 +0000

> The problem is with the numbers that come out of the tests the manufactures do to show how great their hardware performs. Those get better with bigger buffers.

... up to a size after which throughput does not get better. Yet we can sometimes see buffer sizes way past this point (e.g. 1 second), which proves that some manufacturers do not bother trying to optimize anything at all.

likely()

berkus — Mon, 30 May 2011 15:21:58 +0000

-fprofile-arcs first!

The problem with branch prediction

giraffedata — Sun, 29 May 2011 21:50:40 +0000

Turns out most branches were around either short runs of legitimately conditional code or debug macros. In those cases it didn't matter if we set the prediction to correctly predict we'd branch around.

Why not? I can see there might not be any prefetching advantage because you're branching to something that is already in cache, but you can still do a lot of other execution of the instructions while still working on a prior one.

The branch was very often far enough we hit a different i-cache line. Since we didn't have a way of hinting what line we'd hit, ...

The line you'd hit is completely determined by the target in the branch instruction, isn't it?

The problem with prefetch

RogerOdle — Thu, 26 May 2011 16:38:58 +0000

Is there information about why this is happening? I do a lot of embedded work and cache control is a big deal. Are there any metrics showing the rates of cache stalls and how these are effecting the measurements?

I have not used Linux extensively in embedded in the past but the recent changes in the 2.6.39 kernel that bring more real time support make linux even more attractive in this area. One thing that commercial RTOS's allow is for applications to lock portions of the cache to hold highly reused code like DSP algorithms. I have not looked into how this is done in Linux or if it can be done at all. But partitioning control of the cache is an important feature in a small set of performance sensitive applications.

It has not been possible to use Linux for some of the applications I have been involved with in the past because of latency issues but Linux is constantly changing and 2.6.39 opens up possibilities that were out of reach before.

The problem with prefetch

darthscsi — Thu, 26 May 2011 02:26:39 +0000

Prefetch helps mainly if you can issue far enough in advance of a reference that the reference sees a noticeable reduction in miss penalty (Likewise, prefetching likely hits wastes issue slots). In this code, maybe 3 cpu cycles (generously, but could be measured) are spent between the prefetch and the reference. This is insignificant compared to an L2 cache miss.

There is a large literature on how far in advance prefetches need to be to do any good.

HW prefetcher is smarter than you think

csd — Thu, 26 May 2011 01:22:22 +0000

Kudos to the HW designers. A few years back when the first Opterons came out, I did a lot of perf comparisons of code using 'prefetch' vs not using it, using various strides of walks forwards and backwards in arrays. If I recall the results correctly, only backwards walks with varying offsets would benefit from a 'prefetch' being added (not greatly though), while in all other cases having manual prefetch either made no difference or was slower than the leave-it-to-hardware-prefetch case.
Needless to say, we discarded any notion of using prefetch at the time.

The problem with prefetch

cesarb — Thu, 26 May 2011 00:49:41 +0000

I think I am seeing a common thread between this article and the recent undefined behavior article.

The other article: the compiler will do things you did not expect.

This article: the hardware will do things you did not expect.

The problem with prefetch

dashesy — Wed, 25 May 2011 23:13:48 +0000

I started reading kernel code from the "list.h", where I saw some beautiful code, and constructs (i.e. container_of), and also the prefetch mechanism.
So it was useful for me at least :)

The problem with prefetch

pphaneuf — Wed, 25 May 2011 22:05:26 +0000

Yeah, I don't know about the power consumption, but while it would put out a good deal of heat, the SX-4 was the first of the SX series to be air-cooled. I'm not sure if the SX-3 used SRAM or DRAM (it was before my time), but that one was water-cooled.

While the vector performance was amazing, it was pretty sluggish for scalar stuff, so we didn't even use it for compiling, it was too slow.

Not saying that this is the end-all, be-all (the SX-5 traded in the SRAM for DRAM, and in exchange got 128 gigabytes of it, which was a whole lot in 1999 or 2000!), but just pointing out that it's been used on supercomputers (and I've used them), and it didn't require liquid nitrogen.

Here is a photo of one of the four SX-5 I helped look after.

The problem with prefetch

vonbrand — Wed, 25 May 2011 21:18:53 +0000

Somebody told me there are three kinds of C compilers: Dumb ones (they just disregard register), smart ones (they heed register if possible) and very smart ones (they asign registers better than whatever the programmer could hint at, and just overlook register). Current compilers are mostly in the third group.

The problem with prefetch

wazoox — Wed, 25 May 2011 19:50:59 +0000

Impressive machine, not even puny by today's standard...

The problem with prefetch

branden — Wed, 25 May 2011 17:57:20 +0000

> One immediate outcome from this work is that, for 2.6.40 (or whatever it ends up being called), the prefetch() calls have been removed from linked list, hlist, and sk_buff list traversal operations - just like Andi Kleen tried to do in September.

"Sometimes, getting your patch in is just a matter of waiting for somebody else to reimplement it." -- Jonathan Corbet, https://lwn.net/Articles/51615/

The problem with prefetch

bronson — Wed, 25 May 2011 16:38:26 +0000

> SRAM uses 6 transistors

Yep.

> and in idle mode continuously seeps energy

Not if made out of CMOS (and these days everything is CMOS). At steady state the only power loss in SRAM is gate and substrate leakage. Negligible.

> DRAM uses capacitors and only needs to be refreshed from time to time.

Yeah, 50 times a second. It adds up to quite a bit of power. Plus DRAMs have all the substrate leakage of the SRAM.

I'm pretty sure you're comparing supercomputer SRAMs from the early 80s, clocked to the limits of their lives, with low power modern DRAMs. If you compare equivalent parts you'll see that DRAMs are smaller but burn a lot more power.

Yes, will older processors be further penalized?

ds2horner — Wed, 25 May 2011 16:26:23 +0000

I believe your question more succinct than my attempt to raise the issue ( above).

In Andi Kleen's patch (LWN article linked to in the main article), he attempted to have the prefetch remain for list processing on CPUs (namely the K7 family) that would benefit from it.

His approach was to change the prefetch calls to list_prefetch calls and make the list_prefetch a no-op on most rchitectures (and mapping to prefetch via CONFIG_LIST_PREFETCH for MK7 only (presumably with more to be added if they would benefit).

But my main question was, other than the K7 where there is historical "evidence" that this CPU benefited, who would now do the justification for other older processors (like the P3s)?

likely()

SLi — Wed, 25 May 2011 15:30:02 +0000

I have also seen in user-space code that using the GCC equivalent of likely() in branches where I knew by profiling were much more likely to be either taken or not taken actually slows down the code. I don't quite understand the reason for this. Then in other cases it does indeed cause a speedup. When fine-tuning performance I have had to try it branch by branch to get the best results, and even then I'm not sure it generalizes to other processor models.

The problem with prefetch

nye — Wed, 25 May 2011 13:49:02 +0000

>No. SRAM uses 6 transistors, and in idle mode continuously seeps energy. That's OK if you have a small cache buffer but when you have gigabytes of SRAM it adds up very quickly.

> DRAM uses capacitors and only needs to be refreshed from time to time.

All of the resources I can find describing SRAM state that the power used continuously while idle is trivial compared to the power needed to constantly refresh DRAM, and hence it's only during heavy utilisation that its power consumption can get *up to* that of DRAM.

>Our embedded guys tell me that SRAM also is quite power-hungry during reads and writes, so high-frequency SRAMs consume significantly more power than DRAM. Sometimes very significantly more power

I'm sure what you're saying is true in your case, but how do you reconcile that with the fact that every other comparison between the two disagrees?

The only idea I can come up with is that you're comparing SRAM running *much faster* than DRAM so it's an apples-to-oranges comparison(?)

likely()

oak — Wed, 25 May 2011 13:10:49 +0000

Thanks! So the issue with likely() was just people putting them to wrong places instead of un/likely() itself potentially slowing (things down that was half the issue with the explicit prefetch usage).

I typically use unlikely() in my code just to annotate ifs in error check&logging macros. If errors aren't unlikely, I think the performance of un/likely() is the least of my issues...

The problem with prefetch

rilder — Wed, 25 May 2011 13:00:32 +0000

Fair enough. However, I have a question regarding this. The whole change depends on hardware branch predictors which are not constant across hardware. I am not sure how well the branch predictors were pre-Nehalem/pre-Sandybridge, so if new kernels are used on slightly older hardware, won't they suffer from lack of both software/hardware prefetch hints ? I guess they could have made this a CONFIG_xxx option but may be that was infeasible/cluttering the code further.

The problem with prefetch

Cyberax — Wed, 25 May 2011 12:58:26 +0000

No. SRAM uses 6 transistors, and in idle mode continuously seeps energy. That's OK if you have a small cache buffer but when you have gigabytes of SRAM it adds up very quickly.

DRAM uses capacitors and only needs to be refreshed from time to time.

Our embedded guys tell me that SRAM also is quite power-hungry during reads and writes, so high-frequency SRAMs consume significantly more power than DRAM. Sometimes very significantly more power.

The problem with prefetch

nye — Wed, 25 May 2011 11:49:51 +0000

>SRAM is not only expensive, it's also quite power-intensive. You'd have to cool it with liquid nitrogen

Wikipedia says "SRAM is more expensive, but faster and significantly less power hungry (especially idle) than DRAM. It is therefore used where either bandwidth or low power, or both, are principal considerations". This is also what they taught in my degree course; are you certain you're not confused?

The problem with prefetch

Cyberax — Wed, 25 May 2011 11:36:24 +0000

http://www.ai.mit.edu/projects/aries/papers/vector/hammon...

Power consumption is 123KVA for 32-core version, performance is around 2GFLOPS per core for vector operations.

I think it's about some particular case of branch prediction...

mingo — Wed, 25 May 2011 10:58:08 +0000

It's relatively easy to measure the cost of branch misses in certain cases, such as using 'perf stat --repeat N' (the branch miss rate will be measured by default) and a testcase that uses a pseudo-RNG so it can run the same workload random and non-random and comparing the two.

And yes, missing branches is crippling to performance: a 3% branch miss rate can cause a 5% total execution slowdown and a 20% percent miss rate can already double the runtime of a workload. (!)

The problem with prefetch

dgm — Wed, 25 May 2011 10:23:20 +0000

And some performance numbers would also be appreciated.

The problem with prefetch

kruemelmo — Wed, 25 May 2011 10:03:27 +0000

Now tell us something about power consumption as well please.

The problem with outguessing the CPU

alex — Wed, 25 May 2011 07:16:47 +0000

I have seen prefetch help on some architectures. When doing DBT stuff we would often look for places we could arrange the code to make it as efficient as possible. In the case of Itanium prefetch was a definite win as the architecture was structured to leave the hard stuff to the compiler. On x86 our experiments with instruction re-ordering and prefetch generally didn't yield much at all. The main difference being the x86 expends an awful lot of silicon in logic that attempts to predict all this behaviour for you. It's pretty good at it's job as well given how hard it was for us to squeeze extra out despite having a much better view of how the code was running than a compiler usually has.

The problem with prefetch

bronson — Wed, 25 May 2011 06:16:09 +0000

Reminds me of the register keyword. For a few years programmers put 'register' on just about any int that was used more than once... Turns out that in almost every case that was slowing things down. Compilers tend to chose better register variables than programmers.

I think it's about some particular case of branch prediction...

khim — Wed, 25 May 2011 06:10:24 +0000

To fill the pipeline on contemporary CPU you need 30-50 instructions in flight. Without branch prediction it's just impossible to do. If you disable branch prediction on contemporary CPU the slowdown is crippling. Sadly only Intel engineers can give you numbers (because there are no way to disable it on retail CPU) - and they are not telling.

The problem with prefetch

pphaneuf — Tue, 24 May 2011 23:48:34 +0000

The NEC SX-4 had 16 gigabytes of SRAM as its main memory. It didn't really have any CPU cache (only something like 32KB of instruction cache). That was a nice way of not having to deal with cache coherence issues in an SMP system.

It also had a 256 bytes wide memory bus (compared to the typical 64 bits).

Serious hardware, that. :-)

prefetch and buffer bloat

Lennie — Tue, 24 May 2011 22:58:07 +0000

The problem is with the numbers that come out of the tests the manufactures do to show how great their hardware performs. Those get better with bigger buffers.

So again you have manufactures doing non-real-world test (which might have been a good test a long time ago) for marketing purposes and optimising for that case.

likely()

corbet — Tue, 24 May 2011 21:55:19 +0000

A quick look in the LWN kernel index turns up an article from 2004, an article from 2006, and one from last December.

The problem with branch prediction

davecb — Tue, 24 May 2011 21:43:22 +0000

I saw a similar issue with prefetch and branch prediction back when I was doing a lot of SPARC work.

Branch prediction gave us a bit of extra performance with a few code bases, but the older and better the code, they less we saw. My favorite example is Samba, so a Smarter Colleague[tm] and I looked at what was actually happening. Turns out most branches were around either short runs of legitimately conditional code or debug macros. In those cases it didn't matter if we set the prediction to correctly predict we'd branch around. The branch was very often far enough we hit a different i-cache line. Since we didn't have a way of hinting what line we'd hit, we'd slow down whenever it wasn't trivial-to-predict straight-line code.

The better and older the code, the less we would get the next i-cache line sitting waiting for us, and the slower we'd run. Grungy straight-line FORTRAN benefited fine.

I don't recollect ever seeing an actual slowdown, but we rarely could see the predicted benefits from branch prediction.

I'd venture as much as a five cent bet we'll see the same with the intel architecture.

--dave

hopefully Andi Kleen's approach will be accepted now.

ds2horner — Tue, 24 May 2011 21:37:51 +0000

I reread the article on Andi Kleen's patch.

His approach would leave the optimizations in place for those CPUs that could benefit; so K7 and others would have the option.

Thanks for the foresight Andi.