Even if only operations that *need* to be asynchronous require kernel threads, and even if they are recycled instead of being created and destroyed on demand, that will still be quite expensive at the kernel level.
If one thread per outstanding operation or per client is too many, there are good userspace thread pool implementations that dedicate a few threads to waiting for IO completions whilst others get on with whatever work can proceed immediately.
I'm not convinced that pushing the thread pool down into the kernel is a performance win.
The Linux thread implementation chose for very good reasons to stick to a 1:1 relationship between userspace and kernel threads: it's because the job of multiplexing application tasks to a smaller number of system threads is hard to do in a generic way. All the choices are best made by the application developer, therefore thread pool implementations belong in userspace.
I don't really see the point of supporting POSIX signal-driven AIO at the kernel level if the implementation uses threads and sits on top of the existing synchronous IO. A userspace library could do it just as reliably using select() and kill(), for those few applications that insist on the POSIX AIO interface for whatever reason.
That said, the kernel handles asynchronous events all the time. Why exactly is it so hard to let userspace handle them asynchronously too at a low level, without going through the synchronous layer?