In order to make it truly fast, it had to be lockless (at least most of the time), thread-local (no cacheline ping-pong) and entirely in userspace.
I don't see how anything that involves a syscall can ever truly be "free." Even a system which just involves an mmap'ed ring buffer protected by a single mutex will tend to sharply limit scalability on the multi-core hardware of the future (and present).
Maybe Journald can accumulate a bunch of messages and send them to the kernel all at once. However, that begs the question: what happens to unflushed messages on a crash? They could use a signal handler to take care of this problem, but a lot of programs handle signals themselves.