This is why the block layer really ought to support write "threads" with thread specific write barriers. It is trivial to convert a barrier to a cache flush where it is impractical to do something more intelligent, but thread specific barriers are ideal for fast commits of journal entries and the like.
In a typical journalling filesystem, it is usually only the journal entries that need to be synchronously flushed at all. Most writes can be applied independently. If a mounted filesystem had two or more I/O "threads" (or flows), one for journal and synchronous data writes, and one for ordinary data writes, an intelligent lower layer could handle a barrier on one thread by flushing a small amount of journal data, while the other one takes its own sweet time - with a persistent cache, even across a system restart if necessary.
Otherwise, the larger the write cache, the larger the delay when a journal commit comes along. Call it a block layer version of buffer bloat. As with networking, other than making the buffer smaller, the typical solution is to use multiple class or flow based queues. If you don't have flow based queuing, you really don't want much of a buffer at all, because it causes latency to skyrocket.
As a consequence, I don't see how write back caching can help very much here, unless all writes (or at least meta data for all out of order writes) are queued in the cache, so that write ordering is not broken. Am I wrong?