LWN.net Logo

The problem with prefetch

The problem with prefetch

Posted May 24, 2011 19:07 UTC (Tue) by alvieboy (subscriber, #51617)
Parent article: The problem with prefetch

Truth is we don't have faster computers due to economical reasons. If we did have proper RAM (SRAM) we would not need cache mostly (unless to use as FIFO, due to signal propagation delays).

Cache is a very complex thing - actually cache + MMU + branchpredictor + multimaster buses will eventually eat most of your chip, space-wise.

Now, to the subject:

Prefetching is useful when you actually know you'll be doing sequential accesses (still only if you do it on background). This is almost never true for the data cache, unless you're doing block copies. Unfortunately the way DMA engines are designed do not allow for easy off-processor copies, because set-up, check-end and tear down of the DMA engine cost too much (DMA can also work memory <-> memory, not only device <-> memory). Now, the cache controller cannot (not that I know of) fetch a cache line while other line is being fetched (DDR does not have multiple ports, and I don't think you can invalidate a cache line while it's being fetched). This means that you have to trust logic inside your cache manager, and *eventually* give it some hints of what you are trying to do. Technically speaking these "hints" may exist on some buses, like AMBA and Wishbone, so that latencies are mitigated by sequentially transferring large chunks. I don't know x86 internals enough to say if CPU<->cache has also a similar approach.

The biggest problem is the pipeline stalls while cache is filling up/writing back. I believe you can still use the pipeline + units while cache is being filled up, but again this is so slow (latency and speed-wise) compared to the processor speed that you'll eventually will need to access memory again, thus halting the processing.

CPU Chips could be 4x smaller if we had proper RAM (faster and multiport). This is the truth. Perhaps multidimensional chips will bring us this, but I sincerely doubt it.

Alvie


(Log in to post comments)

The problem with prefetch

Posted May 24, 2011 21:11 UTC (Tue) by dgm (subscriber, #49227) [Link]

Correct me if I'm wrong, but I think that fast SRAM is not only expensive, but also has to be very close to the processor core (ideally on the same die) to be effective.
Expensive and fast RAM close to the processor... hummm... isn't that the cache?

The problem with prefetch

Posted May 24, 2011 21:25 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

SRAM is not only expensive, it's also quite power-intensive. You'd have to cool it with liquid nitrogen.

That's why it's not used even on supercomputers.

The problem with prefetch

Posted May 24, 2011 23:48 UTC (Tue) by pphaneuf (guest, #23480) [Link]

The NEC SX-4 had 16 gigabytes of SRAM as its main memory. It didn't really have any CPU cache (only something like 32KB of instruction cache). That was a nice way of not having to deal with cache coherence issues in an SMP system.

It also had a 256 bytes wide memory bus (compared to the typical 64 bits).

Serious hardware, that. :-)

The problem with prefetch

Posted May 25, 2011 10:03 UTC (Wed) by kruemelmo (subscriber, #8279) [Link]

Now tell us something about power consumption as well please.

The problem with prefetch

Posted May 25, 2011 10:23 UTC (Wed) by dgm (subscriber, #49227) [Link]

And some performance numbers would also be appreciated.

The problem with prefetch

Posted May 25, 2011 11:36 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

http://www.ai.mit.edu/projects/aries/papers/vector/hammon...

Power consumption is 123KVA for 32-core version, performance is around 2GFLOPS per core for vector operations.

The problem with prefetch

Posted May 25, 2011 19:50 UTC (Wed) by wazoox (subscriber, #69624) [Link]

Impressive machine, not even puny by today's standard...

The problem with prefetch

Posted May 25, 2011 22:05 UTC (Wed) by pphaneuf (guest, #23480) [Link]

Yeah, I don't know about the power consumption, but while it would put out a good deal of heat, the SX-4 was the first of the SX series to be air-cooled. I'm not sure if the SX-3 used SRAM or DRAM (it was before my time), but that one was water-cooled.

While the vector performance was amazing, it was pretty sluggish for scalar stuff, so we didn't even use it for compiling, it was too slow.

Not saying that this is the end-all, be-all (the SX-5 traded in the SRAM for DRAM, and in exchange got 128 gigabytes of it, which was a whole lot in 1999 or 2000!), but just pointing out that it's been used on supercomputers (and I've used them), and it didn't require liquid nitrogen.

Here is a photo of one of the four SX-5 I helped look after.

The problem with prefetch

Posted May 25, 2011 11:49 UTC (Wed) by nye (guest, #51576) [Link]

>SRAM is not only expensive, it's also quite power-intensive. You'd have to cool it with liquid nitrogen

Wikipedia says "SRAM is more expensive, but faster and significantly less power hungry (especially idle) than DRAM. It is therefore used where either bandwidth or low power, or both, are principal considerations". This is also what they taught in my degree course; are you certain you're not confused?

The problem with prefetch

Posted May 25, 2011 12:58 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

No. SRAM uses 6 transistors, and in idle mode continuously seeps energy. That's OK if you have a small cache buffer but when you have gigabytes of SRAM it adds up very quickly.

DRAM uses capacitors and only needs to be refreshed from time to time.

Our embedded guys tell me that SRAM also is quite power-hungry during reads and writes, so high-frequency SRAMs consume significantly more power than DRAM. Sometimes very significantly more power.

The problem with prefetch

Posted May 25, 2011 13:49 UTC (Wed) by nye (guest, #51576) [Link]

>No. SRAM uses 6 transistors, and in idle mode continuously seeps energy. That's OK if you have a small cache buffer but when you have gigabytes of SRAM it adds up very quickly.

> DRAM uses capacitors and only needs to be refreshed from time to time.

All of the resources I can find describing SRAM state that the power used continuously while idle is trivial compared to the power needed to constantly refresh DRAM, and hence it's only during heavy utilisation that its power consumption can get *up to* that of DRAM.

>Our embedded guys tell me that SRAM also is quite power-hungry during reads and writes, so high-frequency SRAMs consume significantly more power than DRAM. Sometimes very significantly more power

I'm sure what you're saying is true in your case, but how do you reconcile that with the fact that every other comparison between the two disagrees?

The only idea I can come up with is that you're comparing SRAM running *much faster* than DRAM so it's an apples-to-oranges comparison(?)

The problem with prefetch

Posted May 25, 2011 16:38 UTC (Wed) by bronson (subscriber, #4806) [Link]

> SRAM uses 6 transistors

Yep.

> and in idle mode continuously seeps energy

Not if made out of CMOS (and these days everything is CMOS). At steady state the only power loss in SRAM is gate and substrate leakage. Negligible.

> DRAM uses capacitors and only needs to be refreshed from time to time.

Yeah, 50 times a second. It adds up to quite a bit of power. Plus DRAMs have all the substrate leakage of the SRAM.

I'm pretty sure you're comparing supercomputer SRAMs from the early 80s, clocked to the limits of their lives, with low power modern DRAMs. If you compare equivalent parts you'll see that DRAMs are smaller but burn a lot more power.

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds