The key issue with the performance penalty is that application writers intend fsync() to apply to their file(s), but it actually forces a file system-wide barrier. Fixing that, and selectively syncing just a subset of the journal, should help with the performance issues.
(A possible extension then might be to have fsyncl(), which accepts a list of fds to sync at the same time, but it is not strictly required.)
Or, of course, to get application writers to use more async IO.