Truth is we don't have faster computers due to economical reasons. If we did have proper RAM (SRAM) we would not need cache mostly (unless to use as FIFO, due to signal propagation delays).
Cache is a very complex thing - actually cache + MMU + branchpredictor + multimaster buses will eventually eat most of your chip, space-wise.
Now, to the subject:
Prefetching is useful when you actually know you'll be doing sequential accesses (still only if you do it on background). This is almost never true for the data cache, unless you're doing block copies. Unfortunately the way DMA engines are designed do not allow for easy off-processor copies, because set-up, check-end and tear down of the DMA engine cost too much (DMA can also work memory <-> memory, not only device <-> memory). Now, the cache controller cannot (not that I know of) fetch a cache line while other line is being fetched (DDR does not have multiple ports, and I don't think you can invalidate a cache line while it's being fetched). This means that you have to trust logic inside your cache manager, and *eventually* give it some hints of what you are trying to do. Technically speaking these "hints" may exist on some buses, like AMBA and Wishbone, so that latencies are mitigated by sequentially transferring large chunks. I don't know x86 internals enough to say if CPU<->cache has also a similar approach.
The biggest problem is the pipeline stalls while cache is filling up/writing back. I believe you can still use the pipeline + units while cache is being filled up, but again this is so slow (latency and speed-wise) compared to the processor speed that you'll eventually will need to access memory again, thus halting the processing.
CPU Chips could be 4x smaller if we had proper RAM (faster and multiport). This is the truth. Perhaps multidimensional chips will bring us this, but I sincerely doubt it.