The fundamental problem with AIO is that the block drivers aren't interruptible at their core. They need to run on a thread and block waiting on certain hardware operations.
So whether you're using kernel- or user-thread AIO, a thread will be sitting there waiting to unblock. Kernel AIO can be faster only because of efficiency with copying data. So perhaps we should look for more designs like vmsplice() and worry less about disk AIO per se. But other than that, emulating AIO with user-threads isn't much different than in-kernel AIO. And you can build a user-land, pollable interface using eventfd().
BTW, kqueue(2) provides all the things you talk about, in particular on FreeBSD which has defined AIO kevents.