Well, the data cache utilization that he conveniently doesn't highlight can be huge for some software.
You can probably show a 3x speedup with an ad-hoc benchmark that traverses a randomly arranged linked list that fits exactly in LLC when using 32-bit pointers, since on 4 GHz Sandy Bridge accessing RAM apparently takes 6x as much as accessing LLC.