AFAIK no kernel code is needed for operation of the CPU caches, since the BIOS does all the setup (with the exception of marking uncacheable memory ranges on some systems).
As for DMA, surely the system could manage that automatically as well?
That is, IOMMUs would map to the 64-bit automatically managed address space, and the system would move memory for DMA just like it does for CPU access, and just like PCIe DMA is cache coherent for L1/L2/L3, it can be cache-coherent for this hypotetical L4 cache.
To clarify, a simple way to do this is to just add a few gigabytes of per-node L4 cache (in standalone chips), and use the same cache-coherency mechanism for it used for the L3 level.
The advantage could be that memory movement would happen by specialized hardware in parallel with CPU operation.