IOgrind has been released a while ago. Its advantage over live system
profiling is that the results are deterministic whereas live system
performance measurements can (according to Meeks) differ as much as 10%
(on Linux) from run to run. On a properly designed system, you don't
(anymore) find that large bottlenecks, they are smaller.
If the bottlenecks are larger, I would assume one could catch them even
with strace (just strace all applicable processes at the same time).