Posted Sep 9, 2010 13:48 UTC (Thu) by ejr (subscriber, #51652)
[Link]
Prefetching random accesses requires 100s of cycles (and growing!) between the prefetch and the access. If you're running a single 'task' that long, you're likely hitting memory and/or polluting the cache in the meantime. Plus, current systems have very few pipes out to memory and cannot support many outstanding loads.
Explicit prefetch seems useful in a narrow set of conditions: You have many light-weight tasks with tiny memory footprint that require >10s of cycles and don't pollute the cache. If the tasks are too short, the prefetches will stomp on each other. If they're too long, you often end up needing other data in the meantime. Graph analysis algorithms benefit, but not much else seems to benefit.
(What may be more useful architecturally is the ability to stop a HW prefetch engine and retarget it. Consider repeatedly processing the same image/audio frame in memory. The prefetch engine is happily continuing to fetch past the end... It might be nice to say "hey, no, restart prefetching from the beginning" when you're on the last scanline, etc. I don't know if any HW supports it, or if that would even have a place in an OS kernel.)