Fixing asynchronous I/O, again
Linux AIO does suffer from a number of ailments. The subsystem is quite complex and requires explicit code in any I/O target for it to be supported. The API is not considered to be one of our best and is not exposed by the GNU C library; indeed, the POSIX AIO support in glibc is implemented in user space and doesn't use the kernel's AIO subsystem at all. For files, only direct I/O is supported; despite various attempts over the years, buffered I/O is not supported. Even direct I/O can block in some settings. Few operations beyond basic reads and writes are supported, and those that are (fsync(), for example) are incomplete at best. Many have wished for a better AIO subsystem over the years, but what we have now still looks a lot like what was merged in 2002.
Benjamin LaHaise, the original implementer of the kernel AIO subsystem, has recently returned to this area with this patch set. The core change here is to short out much of the kernel code dedicated to the tracking, restarting, and cancellation of AIO requests; instead, the AIO subsystem simply fires off a kernel thread to perform the requested operation. This approach is conceptually simpler; it also has the potential to perform better and, in many cases, makes cancellation more reliable.
With that core in place, Benjamin's patch set adds a number of new operations. It starts with fsync(), which, in current kernels, only works if the operation's target supports it explicitly. A quick grep shows that, in the 4.4 kernel, there is not a single aio_fsync() method defined, so asynchronous fsync() does not work at all. With AIO based on kernel threads, it is a simple matter to just call the regular fsync() method and instantly have working asynchronous fsync() for any I/O target supporting AIO in general (though, as Dave Chinner pointed out, Benjamin's current implementation does not yet solve the whole problem).
In theory, fsync() is supported by AIO now, even if it doesn't actually work. A number of other things are not. Benjamin's patch set addresses some of those gaps by adding new operations, including openat() (opens are usually blocking operations), renameat(), unlinkat(), and poll(). Finally, it adds an option to request reading pages from a file into the page cache (readahead) with the intent that later attempts to access those pages will not block.
For the most part, adding these features is easy once the thread mechanism is in place; there is no longer any need to track partially completed operations or perform restarts. The attempts to add buffered I/O support to AIO in the past were pulled down by their own complexity; adding that support with this mechanism (not done in the current patch set) would not require much more than an internal read() or write() call. The one exception is the openat() support, which requires the addition of proper credential handling to the kernel thread.
The end result would seem to be a significant improvement to the kernel's AIO subsystem, but Linus still didn't like it. He is happy with the desired result and with much of the implementation, but he would like to see the focus be on the targeted capabilities rather than improving an AIO subsystem that, in his mind, is not really fixable. As he put it:
In other words, why is the interface not simply: "do arbitrary system call X with arguments A, B, C, D asynchronously using a kernel thread".
That's something that a lot of people might use. In fact, if they can avoid the nasty AIO interface, maybe they'll even use it for things like read() and write().
Linus suggested that the thread-based implementation in Benjamin's patch set could be adapted to this sort of use, but that the interface needs to change.
Thread-based asynchronous system calls are not a new idea, of course; it
has come around a number of times in the past under names like
fibrils,
threadlets,
syslets, and
acall.
Linus even once posted an asynchronous system
call patch of his own as these discussions were happening. There are
some challenges to making asynchronous system calls work properly; there
would have to be, for example, a whitelist of the system calls that can be
safely run in this mode. As Andy Lutomirski pointed out, "exit is bad
".
Linus also noted that many system calls and
structures as presented by glibc differ considerably from what the kernel
provides; it would be difficult to provide an asynchronous system call API
that could preserve the interface as seen by programs now.
Those challenges are real, but they may not prevent developers from having another look at the old ideas. But, as Benjamin was quick to point out, none of those approaches ever got to the point where they were ready to be merged. He seemed to think that another attempt now might run into the same sorts of complexity issues; it is not hard to conclude that he would really rather continue with the approach he has taken thus far.
Chances are, though, that this kind of extension to the AIO API is unlikely
to make it into the mainline until somebody shows that the more general
asynchronous system call approach simply isn't workable. The advantages of
the latter are significant enough — and dislike for AIO strong enough — to
create a lot of pressure in that direction. Once the dust settles, we may
finally see the merging of a feature that developers have been pondering
for years.
Index entries for this article | |
---|---|
Kernel | Asynchronous I/O |
Posted Jan 14, 2016 1:47 UTC (Thu)
by neilbrown (subscriber, #359)
[Link] (12 responses)
Why would we add kernel support to perform syscall asynchronously when we can already do
if (clone() == 0) perform_syscall();
(Admitted that is an over-simplification, but does fleshing out the details make it more complex that adding new functionality to the kernel?)
Posted Jan 14, 2016 9:05 UTC (Thu)
by pbonzini (subscriber, #60935)
[Link] (10 responses)
In fact, because AIO actually blocks sometimes, userspace will usually just skip AIO and just use a thread pool.
You can see this in QEMU for example. It uses both AIO and a thread pool, and:
Right now the thread pool implementation in QEMU is pretty simple, so it uses quite a lot of CPU due to cache-line bouncing on the lists of pending work items. but despite that it already has performance comparable with AIO except with really fast backends such as FusionIO.
Posted Jan 14, 2016 14:55 UTC (Thu)
by bcrl (guest, #5934)
[Link] (6 responses)
Posted Jan 14, 2016 21:05 UTC (Thu)
by pbonzini (subscriber, #60935)
[Link] (5 responses)
Posted Jan 14, 2016 21:31 UTC (Thu)
by andresfreund (subscriber, #69562)
[Link] (4 responses)
Posted Jan 15, 2016 9:12 UTC (Fri)
by pbonzini (subscriber, #60935)
[Link] (3 responses)
Seriously: the number of such writebacks you can do per second is slow enough that you probably won't get much benefit from using kernel threads and from batching submissions. If you need to do thousands of writebacks per second, buy yourself a UPS or a disk with non-volatile (battery-backed) cache. I would like to see numbers (# of ops per second on *real-world* usecases, CPU utilization for kernel workqueue vs. userspace threadpool, etc.) before committing to a large change such as asynchronous system calls.
Posted Jan 15, 2016 12:02 UTC (Fri)
by andresfreund (subscriber, #69562)
[Link] (2 responses)
I rather doubt that. I mean with a decent pcie attached enterprise ssd you can do a *lot* of flushes/sec. But to actually utilize the hardware, you always need several writes to be in progress in parallel. While you probably need several submission threads (best one per actual core) for full utilization, using a thread pool large enough to have the required number of writes in progress at the same time, introduces too much context switching.
At the moment you can't even really utilize the actual potential of "prosumer" SSDs for random write workloads. Sequential IO is fine because it's quickly bottlenecked by the bus anyway. But if you are e.g. a RDBMS (my corner), and you want to efficiently flush victim pages from an in-memory buffer back to disk, you'll quickly end up being bottlenecked on latency.
Obviously this is only really interesting for rather IO intensive workloads.
> I would like to see numbers
Fair enough.
> # of ops per second on *real-world* usecases
I can only speak from the PostgreSQL corner here. But 50-100k 8192byte diry blocks written back/sec is easily achievable. At that point, in my testing, we're bottlenecked at sync_file_range(SYNC_FILE_RANGE_WRITE) latency because it starts blocking quite soon (note we're doing a separate fsync for actual durability later, the s_f_r is just to keep the amount of work done by fsync bounded).
> CPU utilization for kernel workqueue vs. userspace threadpool, etc.) before committing to a large change such as asynchronous system calls.
To some degree that does require a decent kernelspace implementation in a usable state for comparison.
Posted Jan 15, 2016 12:08 UTC (Fri)
by andresfreund (subscriber, #69562)
[Link]
Posted Jan 21, 2016 18:22 UTC (Thu)
by Wol (subscriber, #4433)
[Link]
My reaction entirely. For a database server, it's all very well saying "it won't make much of an improvement overall", but if it's applicable to 90% of the workload of a dedicated server, then it's going to make one heck of a difference to that server.
And if those dedicated servers are a class where they are typically under heavy load, then this becomes a pretty obvious scalability issue - it bites when heavy-duty hardware is under heavy load - so the option of "throwing hardware at the problem" is not available ...
Cheers,
Posted Jan 22, 2016 4:33 UTC (Fri)
by liam (guest, #84133)
[Link] (2 responses)
Posted Jan 22, 2016 11:04 UTC (Fri)
by intgr (subscriber, #39733)
[Link] (1 responses)
Hard to say for sure how realistic these figures are, as no products are on the market yet, but 3D XPoint *claims* be that technology. The numbers from a few news articles claim close to one order of magnitude improvement in IOPS compared to plain old flash memory, for the first generation of products.
> In an NVMe-based solid state drive, XPoint chips can deliver more than 95,000 I/O operations per second at a 9 microsecond latency, compared to 13,400 IOPs and 73 ms latency for flash
http://www.eetimes.com/document.asp?doc_id=1328682
Posted Jan 22, 2016 22:40 UTC (Fri)
by liam (guest, #84133)
[Link]
Posted Jan 14, 2016 16:01 UTC (Thu)
by tshow (subscriber, #6411)
[Link]
Well, in my case, because it would be nice for my game engine to be able to load files in the background without spinning off a disk management thread. On most game consoles there's some async version of read() that looks something like:
async_cookie_t read_async(int file, void *buffer, size_t bytes);
And a corresponding:
bool async_op_complete(async_cookie_t cookie);
There's often a corresponding write_async() if it makes sense, but with game consoles you're often using read-only storage.
Having an explicit pollable async read means that somewhere in the main loop can be a simple data loader that maintains a list of things that need to be loaded and the processing that needs to be done. All of this can happen in a single (user) thread, without having to drag pipes, mutexes or cross-thread memory management into the picture, and without bogging down the responsiveness of the UI. This matters greatly when you're (say) dragging a gigabyte of sound and graphic data off the disk while trying to keep the UI updating at 60Hz.
Posted Jan 14, 2016 12:01 UTC (Thu)
by HIGHGuY (subscriber, #62277)
[Link] (1 responses)
wait_struct = start_something_async();
to perform sync calls using async primitives, than to fire off threads to simulate async calls with sync primitives.
I think Linus is spot on, that performing async system calls makes for a nice system. One could start off building that in a generic way (via kernel threads), then add specializations where a subsystem is capable of it.
Posted Jan 14, 2016 15:15 UTC (Thu)
by bcrl (guest, #5934)
[Link]
Posted Jan 24, 2016 6:32 UTC (Sun)
by toyotabedzrock (guest, #88005)
[Link]
Fixing asynchronous I/O, again
Right. clone() is a bit heavyweight, but you can just replace it with a thread pool. The benefit of AIO is the ability to submit and retrieve the results of multiple operations at a time. Unless userspace is submitting thousands of operations per second, which is pretty much the case only for read/write, there's no real benefit in asynchronous system calls. Userspace can handle what's left (such as openat, and fsync too) with pthreads.
Fixing asynchronous I/O, again
Fixing asynchronous I/O, again
Fixing asynchronous I/O, again
Fixing asynchronous I/O, again
Fixing asynchronous I/O, again
Fixing asynchronous I/O, again
Fixing asynchronous I/O, again
Fixing asynchronous I/O, again
Wol
Fixing asynchronous I/O, again
So, fusionio should remain the near unicorn that need not concern anyone (other than them).
Fixing asynchronous I/O, again
http://hothardware.com/news/intel-and-micron-jointly-drop...
Fixing asynchronous I/O, again
X point is exactly what i had in mind, and why it makes sense to tackle this issue properly sooner rather than later.
That proper aio keeps coming up should be an additional reason to take this seriously. It's not as though the other major kernels are missing this feature.
Fixing asynchronous I/O, again
>
> if (clone() == 0) perform_syscall();
Fixing asynchronous I/O, again
It's always easier/faster/... to write:
return wait(wait_struct);
Fixing asynchronous I/O, again
Fixing asynchronous I/O, again