LWN: Comments on "The problem with prefetch" https://lwn.net/Articles/444336/ This is a special feed containing comments posted to the individual LWN article titled "The problem with prefetch". en-us Sun, 31 Aug 2025 07:44:35 +0000 Sun, 31 Aug 2025 07:44:35 +0000 https://www.rssboard.org/rss-specification lwn@lwn.net You just need to know and remember in your local brain cache this prefetch problem https://lwn.net/Articles/727594/ https://lwn.net/Articles/727594/ kazan417 <div class="FormattedComment"> Thanks for sharing this useful information!<br> Again unobvious tecchnology behavior, which you just need to know and remember in your local brain cache.<br> And we have many of them(unobvious behavior) every day in modern technology, because now they are super complex<br> </div> Wed, 12 Jul 2017 10:16:26 +0000 The problem with prefetch https://lwn.net/Articles/447847/ https://lwn.net/Articles/447847/ tcucinotta <div class="FormattedComment"> <font class="QuotedText">&gt; Ingo summarized his results this way:</font><br> <font class="QuotedText">&gt; </font><br> <font class="QuotedText">&gt; So the conclusion is: prefetches are absolutely toxic,</font><br> <font class="QuotedText">&gt; even if the NULL ones are excluded. </font><br> <p> If I understand well, the main source of the "toxicity" here is the lists being (most of the cases very) short in the case of their usage within hash tables. So, if one wanted to keep some highly reusable/useful helper macro(s) to iterate items, perhaps it could be worth to have 2 versions of them, one for likely long lists, the other one for likely short lists.<br> However, I also suspect that, whilst for relatively "empty" hash tables those lists will likely contain only 1 item, for highly "full" tables those lists can easily become more and more crowded, so the developer's hint (likely short vs likely long) would easily be workload/scenario-dependent.<br> So, the final question would be whether there's a scenario to be considered "more likely" (or worth to be optimized) by developers than others.<br> Just my 2 cents.<br> </div> Thu, 16 Jun 2011 08:32:29 +0000 The problem with prefetch https://lwn.net/Articles/446418/ https://lwn.net/Articles/446418/ etienne <div class="FormattedComment"> Using "register" also means that the address of that number shall not be taken, so should have consequences on aliasing optimisation (you have a warning if modifying that code and take address of that variable, maybe this would have big difference in execution speed).<br> </div> Tue, 07 Jun 2011 12:04:50 +0000 The problem with prefetch https://lwn.net/Articles/446230/ https://lwn.net/Articles/446230/ hyoshiok <div class="FormattedComment"> prefetch makes cache pollution. it is well known issue.<br> </div> Mon, 06 Jun 2011 06:52:30 +0000 prefetch and buffer bloat https://lwn.net/Articles/445940/ https://lwn.net/Articles/445940/ jch <div class="FormattedComment"> Sorry if this is off topic for this discussion, but bufferbloat is not just a political issue -- it's a difficult technical one. Just reducing the size of buffers won't do, since the right amount of buffering depends on a lot of factors such as throughput, RTT, the transport- and application-layer protocol being used, etc.<br> <p> Bufferbloat is not about reducing the amount of buffers in routers; it is about designing algorithms to make sure that routers only use as much of their buffers as necessary, and getting the router vendors to deploy such algorithms.<br> <p> --jch<br> <p> </div> Thu, 02 Jun 2011 21:54:43 +0000 The problem with prefetch https://lwn.net/Articles/445799/ https://lwn.net/Articles/445799/ marcH <div class="FormattedComment"> I find a _reasonable_ use of "register" still useful for programmer to programmer communication anyway. It seldom hurts to express a (good) intent.<br> <p> </div> Thu, 02 Jun 2011 13:17:31 +0000 likely() https://lwn.net/Articles/445453/ https://lwn.net/Articles/445453/ AdamRichter <div class="FormattedComment"> Sometimes you want to minimize latency for the less commonly used but more important branch, such as in almost any polling loop.<br> </div> Wed, 01 Jun 2011 04:20:09 +0000 prefetch and buffer bloat https://lwn.net/Articles/445429/ https://lwn.net/Articles/445429/ dlang <div class="FormattedComment"> something to remember is that memory comes in standard sizes. it's not always possible/reasonable to put less ram in the device.<br> <p> since they have the ram anyway, and buffers that are too small can cause problems. the logic then follows 'why not just use the ram in the device as a buffer'<br> <p> this causes other problems, but these other problems were not well described until recently.<br> </div> Tue, 31 May 2011 21:05:11 +0000 prefetch and buffer bloat https://lwn.net/Articles/445421/ https://lwn.net/Articles/445421/ marcH <div class="FormattedComment"> <font class="QuotedText">&gt; The problem is with the numbers that come out of the tests the manufactures do to show how great their hardware performs. Those get better with bigger buffers.</font><br> <p> ... up to a size after which throughput does not get better. Yet we can sometimes see buffer sizes way past this point (e.g. 1 second), which proves that some manufacturers do not bother trying to optimize anything at all.<br> <p> </div> Tue, 31 May 2011 20:43:46 +0000 likely() https://lwn.net/Articles/445277/ https://lwn.net/Articles/445277/ berkus <div class="FormattedComment"> -fprofile-arcs first!<br> </div> Mon, 30 May 2011 15:21:58 +0000 The problem with branch prediction https://lwn.net/Articles/445213/ https://lwn.net/Articles/445213/ giraffedata <blockquote> Turns out most branches were around either short runs of legitimately conditional code or debug macros. In those cases it didn't matter if we set the prediction to correctly predict we'd branch around. </blockquote> <p> Why not? I can see there might not be any prefetching advantage because you're branching to something that is already in cache, but you can still do a lot of other execution of the instructions while still working on a prior one. <blockquote> The branch was very often far enough we hit a different i-cache line. Since we didn't have a way of hinting what line we'd hit, ... </blockquote> <p> The line you'd hit is completely determined by the target in the branch instruction, isn't it? Sun, 29 May 2011 21:50:40 +0000 The problem with prefetch https://lwn.net/Articles/444909/ https://lwn.net/Articles/444909/ RogerOdle <div class="FormattedComment"> Is there information about why this is happening? I do a lot of embedded work and cache control is a big deal. Are there any metrics showing the rates of cache stalls and how these are effecting the measurements?<br> <p> I have not used Linux extensively in embedded in the past but the recent changes in the 2.6.39 kernel that bring more real time support make linux even more attractive in this area. One thing that commercial RTOS's allow is for applications to lock portions of the cache to hold highly reused code like DSP algorithms. I have not looked into how this is done in Linux or if it can be done at all. But partitioning control of the cache is an important feature in a small set of performance sensitive applications.<br> <p> It has not been possible to use Linux for some of the applications I have been involved with in the past because of latency issues but Linux is constantly changing and 2.6.39 opens up possibilities that were out of reach before.<br> <p> </div> Thu, 26 May 2011 16:38:58 +0000 The problem with prefetch https://lwn.net/Articles/444799/ https://lwn.net/Articles/444799/ darthscsi <div class="FormattedComment"> Prefetch helps mainly if you can issue far enough in advance of a reference that the reference sees a noticeable reduction in miss penalty (Likewise, prefetching likely hits wastes issue slots). In this code, maybe 3 cpu cycles (generously, but could be measured) are spent between the prefetch and the reference. This is insignificant compared to an L2 cache miss.<br> <p> There is a large literature on how far in advance prefetches need to be to do any good.<br> </div> Thu, 26 May 2011 02:26:39 +0000 HW prefetcher is smarter than you think https://lwn.net/Articles/444795/ https://lwn.net/Articles/444795/ csd <div class="FormattedComment"> Kudos to the HW designers. A few years back when the first Opterons came out, I did a lot of perf comparisons of code using 'prefetch' vs not using it, using various strides of walks forwards and backwards in arrays. If I recall the results correctly, only backwards walks with varying offsets would benefit from a 'prefetch' being added (not greatly though), while in all other cases having manual prefetch either made no difference or was slower than the leave-it-to-hardware-prefetch case.<br> Needless to say, we discarded any notion of using prefetch at the time.<br> <p> <br> </div> Thu, 26 May 2011 01:22:22 +0000 The problem with prefetch https://lwn.net/Articles/444792/ https://lwn.net/Articles/444792/ cesarb <p>I think I am seeing a common thread between this article and the recent <a href="http://lwn.net/Articles/444045/">undefined behavior</a> article.</p> <p>The other article: the compiler will do things you did not expect.</p> <p>This article: the <em>hardware</em> will do things you did not expect.</p> Thu, 26 May 2011 00:49:41 +0000 The problem with prefetch https://lwn.net/Articles/444789/ https://lwn.net/Articles/444789/ dashesy <div class="FormattedComment"> I started reading kernel code from the "list.h", where I saw some beautiful code, and constructs (i.e. container_of), and also the prefetch mechanism. <br> So it was useful for me at least :)<br> </div> Wed, 25 May 2011 23:13:48 +0000 The problem with prefetch https://lwn.net/Articles/444784/ https://lwn.net/Articles/444784/ pphaneuf <p>Yeah, I don't know about the power consumption, but while it would put out a good deal of heat, the SX-4 was the first of the SX series to be air-cooled. I'm not sure if the SX-3 used SRAM or DRAM (it was before my time), but that one was water-cooled. <p>While the vector performance was amazing, it was pretty sluggish for scalar stuff, so we didn't even use it for compiling, it was too slow. <p>Not saying that this is the end-all, be-all (the SX-5 traded in the SRAM for DRAM, and in exchange got 128 gigabytes of it, which was a whole lot in 1999 or 2000!), but just pointing out that it's been used on supercomputers (and I've used them), and it didn't require liquid nitrogen. <p>Here is <a href="http://www.flickr.com/photos/pphaneuf/88477734/">a photo</a> of one of the four SX-5 I helped look after. Wed, 25 May 2011 22:05:26 +0000 The problem with prefetch https://lwn.net/Articles/444779/ https://lwn.net/Articles/444779/ vonbrand <p> Somebody told me there are three kinds of C compilers: Dumb ones (they just disregard <code>register</code>), smart ones (they heed <code>register</code> if possible) and very smart ones (they asign registers better than whatever the programmer could hint at, and just overlook <code>register</code>). Current compilers are mostly in the third group.</p> Wed, 25 May 2011 21:18:53 +0000 The problem with prefetch https://lwn.net/Articles/444761/ https://lwn.net/Articles/444761/ wazoox <div class="FormattedComment"> Impressive machine, not even puny by today's standard...<br> </div> Wed, 25 May 2011 19:50:59 +0000 The problem with prefetch https://lwn.net/Articles/444740/ https://lwn.net/Articles/444740/ branden <div class="FormattedComment"> <font class="QuotedText">&gt; One immediate outcome from this work is that, for 2.6.40 (or whatever it ends up being called), the prefetch() calls have been removed from linked list, hlist, and sk_buff list traversal operations - just like Andi Kleen tried to do in September.</font><br> <p> "Sometimes, getting your patch in is just a matter of waiting for somebody else to reimplement it." -- Jonathan Corbet, <a href="https://lwn.net/Articles/51615/">https://lwn.net/Articles/51615/</a><br> </div> Wed, 25 May 2011 17:57:20 +0000 The problem with prefetch https://lwn.net/Articles/444697/ https://lwn.net/Articles/444697/ bronson <div class="FormattedComment"> <font class="QuotedText">&gt; SRAM uses 6 transistors</font><br> <p> Yep.<br> <p> <font class="QuotedText">&gt; and in idle mode continuously seeps energy</font><br> <p> Not if made out of CMOS (and these days everything is CMOS). At steady state the only power loss in SRAM is gate and substrate leakage. Negligible.<br> <p> <font class="QuotedText">&gt; DRAM uses capacitors and only needs to be refreshed from time to time.</font><br> <p> Yeah, 50 times a second. It adds up to quite a bit of power. Plus DRAMs have all the substrate leakage of the SRAM.<br> <p> I'm pretty sure you're comparing supercomputer SRAMs from the early 80s, clocked to the limits of their lives, with low power modern DRAMs. If you compare equivalent parts you'll see that DRAMs are smaller but burn a lot more power.<br> </div> Wed, 25 May 2011 16:38:26 +0000 Yes, will older processors be further penalized? https://lwn.net/Articles/444692/ https://lwn.net/Articles/444692/ ds2horner <div class="FormattedComment"> I believe your question more succinct than my attempt to raise the issue ( above).<br> <p> In Andi Kleen's patch (LWN article linked to in the main article), he attempted to have the prefetch remain for list processing on CPUs (namely the K7 family) that would benefit from it.<br> <p> His approach was to change the prefetch calls to list_prefetch calls and make the list_prefetch a no-op on most rchitectures (and mapping to prefetch via CONFIG_LIST_PREFETCH for MK7 only (presumably with more to be added if they would benefit).<br> <p> But my main question was, other than the K7 where there is historical "evidence" that this CPU benefited, who would now do the justification for other older processors (like the P3s)?<br> <br> </div> Wed, 25 May 2011 16:26:23 +0000 likely() https://lwn.net/Articles/444684/ https://lwn.net/Articles/444684/ SLi <div class="FormattedComment"> I have also seen in user-space code that using the GCC equivalent of likely() in branches where I knew by profiling were much more likely to be either taken or not taken actually slows down the code. I don't quite understand the reason for this. Then in other cases it does indeed cause a speedup. When fine-tuning performance I have had to try it branch by branch to get the best results, and even then I'm not sure it generalizes to other processor models.<br> </div> Wed, 25 May 2011 15:30:02 +0000 The problem with prefetch https://lwn.net/Articles/444650/ https://lwn.net/Articles/444650/ nye <div class="FormattedComment"> <font class="QuotedText">&gt;No. SRAM uses 6 transistors, and in idle mode continuously seeps energy. That's OK if you have a small cache buffer but when you have gigabytes of SRAM it adds up very quickly.</font><br> <p> <font class="QuotedText">&gt; DRAM uses capacitors and only needs to be refreshed from time to time.</font><br> <p> All of the resources I can find describing SRAM state that the power used continuously while idle is trivial compared to the power needed to constantly refresh DRAM, and hence it's only during heavy utilisation that its power consumption can get *up to* that of DRAM.<br> <p> <font class="QuotedText">&gt;Our embedded guys tell me that SRAM also is quite power-hungry during reads and writes, so high-frequency SRAMs consume significantly more power than DRAM. Sometimes very significantly more power</font><br> <p> I'm sure what you're saying is true in your case, but how do you reconcile that with the fact that every other comparison between the two disagrees?<br> <p> The only idea I can come up with is that you're comparing SRAM running *much faster* than DRAM so it's an apples-to-oranges comparison(?)<br> </div> Wed, 25 May 2011 13:49:02 +0000 likely() https://lwn.net/Articles/444646/ https://lwn.net/Articles/444646/ oak <div class="FormattedComment"> Thanks! So the issue with likely() was just people putting them to wrong places instead of un/likely() itself potentially slowing (things down that was half the issue with the explicit prefetch usage).<br> <p> I typically use unlikely() in my code just to annotate ifs in error check&amp;logging macros. If errors aren't unlikely, I think the performance of un/likely() is the least of my issues...<br> <p> </div> Wed, 25 May 2011 13:10:49 +0000 The problem with prefetch https://lwn.net/Articles/444645/ https://lwn.net/Articles/444645/ rilder <div class="FormattedComment"> Fair enough. However, I have a question regarding this. The whole change depends on hardware branch predictors which are not constant across hardware. I am not sure how well the branch predictors were pre-Nehalem/pre-Sandybridge, so if new kernels are used on slightly older hardware, won't they suffer from lack of both software/hardware prefetch hints ? I guess they could have made this a CONFIG_xxx option but may be that was infeasible/cluttering the code further.<br> </div> Wed, 25 May 2011 13:00:32 +0000 The problem with prefetch https://lwn.net/Articles/444644/ https://lwn.net/Articles/444644/ Cyberax <div class="FormattedComment"> No. SRAM uses 6 transistors, and in idle mode continuously seeps energy. That's OK if you have a small cache buffer but when you have gigabytes of SRAM it adds up very quickly.<br> <p> DRAM uses capacitors and only needs to be refreshed from time to time.<br> <p> Our embedded guys tell me that SRAM also is quite power-hungry during reads and writes, so high-frequency SRAMs consume significantly more power than DRAM. Sometimes very significantly more power.<br> </div> Wed, 25 May 2011 12:58:26 +0000 The problem with prefetch https://lwn.net/Articles/444641/ https://lwn.net/Articles/444641/ nye <div class="FormattedComment"> <font class="QuotedText">&gt;SRAM is not only expensive, it's also quite power-intensive. You'd have to cool it with liquid nitrogen</font><br> <p> Wikipedia says "SRAM is more expensive, but faster and significantly less power hungry (especially idle) than DRAM. It is therefore used where either bandwidth or low power, or both, are principal considerations". This is also what they taught in my degree course; are you certain you're not confused?<br> </div> Wed, 25 May 2011 11:49:51 +0000 The problem with prefetch https://lwn.net/Articles/444638/ https://lwn.net/Articles/444638/ Cyberax <div class="FormattedComment"> <a href="http://www.ai.mit.edu/projects/aries/papers/vector/hammond.pdf">http://www.ai.mit.edu/projects/aries/papers/vector/hammon...</a> <br> <p> Power consumption is 123KVA for 32-core version, performance is around 2GFLOPS per core for vector operations.<br> </div> Wed, 25 May 2011 11:36:24 +0000 I think it's about some particular case of branch prediction... https://lwn.net/Articles/444632/ https://lwn.net/Articles/444632/ mingo <div class="FormattedComment"> <p> It's relatively easy to measure the cost of branch misses in certain cases, such as using 'perf stat --repeat N' (the branch miss rate will be measured by default) and a testcase that uses a pseudo-RNG so it can run the same workload random and non-random and comparing the two.<br> <p> And yes, missing branches is crippling to performance: a 3% branch miss rate can cause a 5% total execution slowdown and a 20% percent miss rate can already double the runtime of a workload. (!)<br> <p> </div> Wed, 25 May 2011 10:58:08 +0000 The problem with prefetch https://lwn.net/Articles/444628/ https://lwn.net/Articles/444628/ dgm <div class="FormattedComment"> And some performance numbers would also be appreciated.<br> </div> Wed, 25 May 2011 10:23:20 +0000 The problem with prefetch https://lwn.net/Articles/444622/ https://lwn.net/Articles/444622/ kruemelmo <div class="FormattedComment"> Now tell us something about power consumption as well please.<br> </div> Wed, 25 May 2011 10:03:27 +0000 The problem with outguessing the CPU https://lwn.net/Articles/444612/ https://lwn.net/Articles/444612/ alex <div class="FormattedComment"> I have seen prefetch help on some architectures. When doing DBT stuff we would often look for places we could arrange the code to make it as efficient as possible. In the case of Itanium prefetch was a definite win as the architecture was structured to leave the hard stuff to the compiler. On x86 our experiments with instruction re-ordering and prefetch generally didn't yield much at all. The main difference being the x86 expends an awful lot of silicon in logic that attempts to predict all this behaviour for you. It's pretty good at it's job as well given how hard it was for us to squeeze extra out despite having a much better view of how the code was running than a compiler usually has.<br> </div> Wed, 25 May 2011 07:16:47 +0000 The problem with prefetch https://lwn.net/Articles/444610/ https://lwn.net/Articles/444610/ bronson <div class="FormattedComment"> Reminds me of the register keyword. For a few years programmers put 'register' on just about any int that was used more than once... Turns out that in almost every case that was slowing things down. Compilers tend to chose better register variables than programmers.<br> </div> Wed, 25 May 2011 06:16:09 +0000 I think it's about some particular case of branch prediction... https://lwn.net/Articles/444609/ https://lwn.net/Articles/444609/ khim To fill the pipeline on contemporary CPU you need 30-50 instructions in flight. Without branch prediction it's just impossible to do. If you disable branch prediction on contemporary CPU the slowdown is crippling. Sadly only Intel engineers can give you numbers (because there are no way to disable it on retail CPU) - and they are not telling. Wed, 25 May 2011 06:10:24 +0000 The problem with prefetch https://lwn.net/Articles/444569/ https://lwn.net/Articles/444569/ pphaneuf <p>The NEC SX-4 had <em>16 gigabytes of SRAM</em> as its <em>main memory</em>. It didn't really have any CPU cache (only something like 32KB of instruction cache). That was a nice way of not having to deal with cache coherence issues in an SMP system. <p>It also had a 256 <em>bytes</em> wide memory bus (compared to the typical 64 <em>bits</em>). <p>Serious hardware, that. <tt>:-)</tt> Tue, 24 May 2011 23:48:34 +0000 prefetch and buffer bloat https://lwn.net/Articles/444566/ https://lwn.net/Articles/444566/ Lennie <div class="FormattedComment"> The problem is with the numbers that come out of the tests the manufactures do to show how great their hardware performs. Those get better with bigger buffers.<br> <p> So again you have manufactures doing non-real-world test (which might have been a good test a long time ago) for marketing purposes and optimising for that case.<br> </div> Tue, 24 May 2011 22:58:07 +0000 likely() https://lwn.net/Articles/444550/ https://lwn.net/Articles/444550/ corbet A quick look in the <a rel="nofollow" href="https://lwn.net/Kernel/Index/">LWN kernel index</a> turns up <a rel="nofollow" href="https://lwn.net/Articles/70473/">an article from 2004</a>, <a rel="nofollow" href="https://lwn.net/Articles/182369/">an article from 2006</a>, and <a rel="nofollow" href="https://lwn.net/Articles/420019/">one from last December</a>. Tue, 24 May 2011 21:55:19 +0000 The problem with branch prediction https://lwn.net/Articles/444546/ https://lwn.net/Articles/444546/ davecb <div class="FormattedComment"> I saw a similar issue with prefetch and branch prediction back when I was doing a lot of SPARC work. <br> <p> Branch prediction gave us a bit of extra performance with a few code bases, but the older and better the code, they less we saw. My favorite example is Samba, so a Smarter Colleague[tm] and I looked at what was actually happening. Turns out most branches were around either short runs of legitimately conditional code or debug macros. In those cases it didn't matter if we set the prediction to correctly predict we'd branch around. The branch was very often far enough we hit a different i-cache line. Since we didn't have a way of hinting what line we'd hit, we'd slow down whenever it wasn't trivial-to-predict straight-line code.<br> <p> The better and older the code, the less we would get the next i-cache line sitting waiting for us, and the slower we'd run. Grungy straight-line FORTRAN benefited fine.<br> <p> I don't recollect ever seeing an actual slowdown, but we rarely could see the predicted benefits from branch prediction.<br> <p> I'd venture as much as a five cent bet we'll see the same with the intel architecture.<br> <p> --dave<br> </div> Tue, 24 May 2011 21:43:22 +0000 hopefully Andi Kleen's approach will be accepted now. https://lwn.net/Articles/444547/ https://lwn.net/Articles/444547/ ds2horner <div class="FormattedComment"> I reread the article on Andi Kleen's patch.<br> <p> His approach would leave the optimizations in place for those CPUs that could benefit; so K7 and others would have the option.<br> <p> Thanks for the foresight Andi.<br> <p> <p> </div> Tue, 24 May 2011 21:37:51 +0000