Fixing asynchronous I/O, again

By Jonathan Corbet
January 13, 2016

The process of adding asynchronous I/O (AIO) support to the kernel began with the 2.5.23 development kernel in June 2002. Sometimes it seems that the bulk of the time since then has been taken up by complaints about AIO in the kernel. That said, AIO meets a specific need and has users who depend on it. A current attempt to improve the AIO subsystem has brought out some of those old complaints along with some old ideas for improving the situation.

Linux AIO does suffer from a number of ailments. The subsystem is quite complex and requires explicit code in any I/O target for it to be supported. The API is not considered to be one of our best and is not exposed by the GNU C library; indeed, the POSIX AIO support in glibc is implemented in user space and doesn't use the kernel's AIO subsystem at all. For files, only direct I/O is supported; despite various attempts over the years, buffered I/O is not supported. Even direct I/O can block in some settings. Few operations beyond basic reads and writes are supported, and those that are (fsync(), for example) are incomplete at best. Many have wished for a better AIO subsystem over the years, but what we have now still looks a lot like what was merged in 2002.

Benjamin LaHaise, the original implementer of the kernel AIO subsystem, has recently returned to this area with this patch set. The core change here is to short out much of the kernel code dedicated to the tracking, restarting, and cancellation of AIO requests; instead, the AIO subsystem simply fires off a kernel thread to perform the requested operation. This approach is conceptually simpler; it also has the potential to perform better and, in many cases, makes cancellation more reliable.

With that core in place, Benjamin's patch set adds a number of new operations. It starts with fsync(), which, in current kernels, only works if the operation's target supports it explicitly. A quick grep shows that, in the 4.4 kernel, there is not a single aio_fsync() method defined, so asynchronous fsync() does not work at all. With AIO based on kernel threads, it is a simple matter to just call the regular fsync() method and instantly have working asynchronous fsync() for any I/O target supporting AIO in general (though, as Dave Chinner pointed out, Benjamin's current implementation does not yet solve the whole problem).

In theory, fsync() is supported by AIO now, even if it doesn't actually work. A number of other things are not. Benjamin's patch set addresses some of those gaps by adding new operations, including openat() (opens are usually blocking operations), renameat(), unlinkat(), and poll(). Finally, it adds an option to request reading pages from a file into the page cache (readahead) with the intent that later attempts to access those pages will not block.

For the most part, adding these features is easy once the thread mechanism is in place; there is no longer any need to track partially completed operations or perform restarts. The attempts to add buffered I/O support to AIO in the past were pulled down by their own complexity; adding that support with this mechanism (not done in the current patch set) would not require much more than an internal read() or write() call. The one exception is the openat() support, which requires the addition of proper credential handling to the kernel thread.

The end result would seem to be a significant improvement to the kernel's AIO subsystem, but Linus still didn't like it. He is happy with the desired result and with much of the implementation, but he would like to see the focus be on the targeted capabilities rather than improving an AIO subsystem that, in his mind, is not really fixable. As he put it:

If you want to do arbitrary asynchronous system calls, just *do* it. But do _that_, not "let's extend this horrible interface in arbitrary random ways one special system call at a time".

In other words, why is the interface not simply: "do arbitrary system call X with arguments A, B, C, D asynchronously using a kernel thread".

That's something that a lot of people might use. In fact, if they can avoid the nasty AIO interface, maybe they'll even use it for things like read() and write().

Linus suggested that the thread-based implementation in Benjamin's patch set could be adapted to this sort of use, but that the interface needs to change.

Thread-based asynchronous system calls are not a new idea, of course; it has come around a number of times in the past under names like fibrils, threadlets, syslets, and acall. Linus even once posted an asynchronous system call patch of his own as these discussions were happening. There are some challenges to making asynchronous system calls work properly; there would have to be, for example, a whitelist of the system calls that can be safely run in this mode. As Andy Lutomirski pointed out, "exit is bad". Linus also noted that many system calls and structures as presented by glibc differ considerably from what the kernel provides; it would be difficult to provide an asynchronous system call API that could preserve the interface as seen by programs now.

Those challenges are real, but they may not prevent developers from having another look at the old ideas. But, as Benjamin was quick to point out, none of those approaches ever got to the point where they were ready to be merged. He seemed to think that another attempt now might run into the same sorts of complexity issues; it is not hard to conclude that he would really rather continue with the approach he has taken thus far.

Chances are, though, that this kind of extension to the AIO API is unlikely to make it into the mainline until somebody shows that the more general asynchronous system call approach simply isn't workable. The advantages of the latter are significant enough — and dislike for AIO strong enough — to create a lot of pressure in that direction. Once the dust settles, we may finally see the merging of a feature that developers have been pondering for years.

Index entries for this article
Kernel	Asynchronous I/O

Fixing asynchronous I/O, again

Posted Jan 14, 2016 1:47 UTC (Thu) by neilbrown (subscriber, #359) [Link] (12 responses)

I must be missing something important here...

Why would we add kernel support to perform syscall asynchronously when we can already do

if (clone() == 0) perform_syscall();

(Admitted that is an over-simplification, but does fleshing out the details make it more complex that adding new functionality to the kernel?)

Fixing asynchronous I/O, again

Posted Jan 14, 2016 9:05 UTC (Thu) by pbonzini (subscriber, #60935) [Link] (10 responses)

Right. clone() is a bit heavyweight, but you can just replace it with a thread pool. The benefit of AIO is the ability to submit and retrieve the results of multiple operations at a time. Unless userspace is submitting thousands of operations per second, which is pretty much the case only for read/write, there's no real benefit in asynchronous system calls. Userspace can handle what's left (such as openat, and fsync too) with pthreads.

In fact, because AIO actually blocks sometimes, userspace will usually just skip AIO and just use a thread pool.

You can see this in QEMU for example. It uses both AIO and a thread pool, and:

uses a thread pool to implement 9pfs (where you can have a lot of blocking operations such as openat or rename);
offers the choice between AIO and thread pool for high-performance virtio-blk (and always uses the thread pool for stuff such as fsync and discard)

Right now the thread pool implementation in QEMU is pretty simple, so it uses quite a lot of CPU due to cache-line bouncing on the lists of pending work items. but despite that it already has performance comparable with AIO except with really fast backends such as FusionIO.

Fixing asynchronous I/O, again

Posted Jan 14, 2016 14:55 UTC (Thu) by bcrl (guest, #5934) [Link] (6 responses)

clone() is too expensive to be used in this way in the real world -- task_struct is huge and there is a lot of data that has to be touched. As for thread pools: testing at my employer shows that a userspace thread pool implementation using pthreads is 25% slower than using a pool of kernel threads. That number is for the overall performance of the application of which the parts doing AIO are actually quite small (but in the critical path of ensuring data is persistent).

Fixing asynchronous I/O, again

Posted Jan 14, 2016 21:05 UTC (Thu) by pbonzini (subscriber, #60935) [Link] (5 responses)

Indeed kernel threads are faster. But what are the syscalls that happen often enough in your application, and block for long enough, that it actually matters? If it's just file I/O, then you don't need a full-blown asynchronous system call interface.

Fixing asynchronous I/O, again

Posted Jan 14, 2016 21:31 UTC (Thu) by andresfreund (subscriber, #69562) [Link] (4 responses)

Huh, why not? Is give half an arm for a decent portable async range fsync/writeback interface. Sure, it's not every application, but very few new Linux features are applicable to a large portion of applications. Many low hanging fruits are gone.

Fixing asynchronous I/O, again

Posted Jan 15, 2016 9:12 UTC (Fri) by pbonzini (subscriber, #60935) [Link] (3 responses)

What's more portable than a userspace thread pool? :)

Seriously: the number of such writebacks you can do per second is slow enough that you probably won't get much benefit from using kernel threads and from batching submissions. If you need to do thousands of writebacks per second, buy yourself a UPS or a disk with non-volatile (battery-backed) cache. I would like to see numbers (# of ops per second on *real-world* usecases, CPU utilization for kernel workqueue vs. userspace threadpool, etc.) before committing to a large change such as asynchronous system calls.

Fixing asynchronous I/O, again

Posted Jan 15, 2016 12:02 UTC (Fri) by andresfreund (subscriber, #69562) [Link] (2 responses)

> Seriously: the number of such writebacks you can do per second is slow enough that you probably won't get much benefit from using kernel threads and from batching submissions.

I rather doubt that. I mean with a decent pcie attached enterprise ssd you can do a *lot* of flushes/sec. But to actually utilize the hardware, you always need several writes to be in progress in parallel. While you probably need several submission threads (best one per actual core) for full utilization, using a thread pool large enough to have the required number of writes in progress at the same time, introduces too much context switching.

At the moment you can't even really utilize the actual potential of "prosumer" SSDs for random write workloads. Sequential IO is fine because it's quickly bottlenecked by the bus anyway. But if you are e.g. a RDBMS (my corner), and you want to efficiently flush victim pages from an in-memory buffer back to disk, you'll quickly end up being bottlenecked on latency.

Obviously this is only really interesting for rather IO intensive workloads.

> I would like to see numbers

Fair enough.

> # of ops per second on *real-world* usecases

I can only speak from the PostgreSQL corner here. But 50-100k 8192byte diry blocks written back/sec is easily achievable. At that point, in my testing, we're bottlenecked at sync_file_range(SYNC_FILE_RANGE_WRITE) latency because it starts blocking quite soon (note we're doing a separate fsync for actual durability later, the s_f_r is just to keep the amount of work done by fsync bounded).

> CPU utilization for kernel workqueue vs. userspace threadpool, etc.) before committing to a large change such as asynchronous system calls.

To some degree that does require a decent kernelspace implementation in a usable state for comparison.

Fixing asynchronous I/O, again

Posted Jan 15, 2016 12:08 UTC (Fri) by andresfreund (subscriber, #69562) [Link]

https://lkml.org/lkml/2015/10/28/878 has some interesting numbers. Particularly the number fsyncs & journal writes in the synchronous vs. the asynchronous case are kinda impressive.

Fixing asynchronous I/O, again

Posted Jan 21, 2016 18:22 UTC (Thu) by Wol (subscriber, #4433) [Link]

> But if you are e.g. a RDBMS (my corner), and you want to efficiently flush victim pages from an in-memory buffer back to disk, you'll quickly end up being bottlenecked on latency.

My reaction entirely. For a database server, it's all very well saying "it won't make much of an improvement overall", but if it's applicable to 90% of the workload of a dedicated server, then it's going to make one heck of a difference to that server.

And if those dedicated servers are a class where they are typically under heavy load, then this becomes a pretty obvious scalability issue - it bites when heavy-duty hardware is under heavy load - so the option of "throwing hardware at the problem" is not available ...

Cheers,
Wol

Fixing asynchronous I/O, again

Posted Jan 22, 2016 4:33 UTC (Fri) by liam (guest, #84133) [Link] (2 responses)

Luckily, and I think I'm right about this, there's no technology on the horizon claiming to improve iops by an order of magnitude...
So, fusionio should remain the near unicorn that need not concern anyone (other than them).

Fixing asynchronous I/O, again

Posted Jan 22, 2016 11:04 UTC (Fri) by intgr (subscriber, #39733) [Link] (1 responses)

> there's no technology on the horizon claiming to improve iops by an order of magnitude

Hard to say for sure how realistic these figures are, as no products are on the market yet, but 3D XPoint *claims* be that technology. The numbers from a few news articles claim close to one order of magnitude improvement in IOPS compared to plain old flash memory, for the first generation of products.

> In an NVMe-based solid state drive, XPoint chips can deliver more than 95,000 I/O operations per second at a 9 microsecond latency, compared to 13,400 IOPs and 73 ms latency for flash

http://www.eetimes.com/document.asp?doc_id=1328682
http://hothardware.com/news/intel-and-micron-jointly-drop...

Fixing asynchronous I/O, again

Posted Jan 22, 2016 22:40 UTC (Fri) by liam (guest, #84133) [Link]

I guess the irony didn't make it through the adc:)
X point is exactly what i had in mind, and why it makes sense to tackle this issue properly sooner rather than later.
That proper aio keeps coming up should be an additional reason to take this seriously. It's not as though the other major kernels are missing this feature.

Fixing asynchronous I/O, again

Posted Jan 14, 2016 16:01 UTC (Thu) by tshow (subscriber, #6411) [Link]

> Why would we add kernel support to perform syscall asynchronously when we can already do
>
> if (clone() == 0) perform_syscall();

Well, in my case, because it would be nice for my game engine to be able to load files in the background without spinning off a disk management thread. On most game consoles there's some async version of read() that looks something like:

async_cookie_t read_async(int file, void *buffer, size_t bytes);

And a corresponding:

bool async_op_complete(async_cookie_t cookie);

There's often a corresponding write_async() if it makes sense, but with game consoles you're often using read-only storage.

Having an explicit pollable async read means that somewhere in the main loop can be a simple data loader that maintains a list of things that need to be loaded and the processing that needs to be done. All of this can happen in a single (user) thread, without having to drag pipes, mutexes or cross-thread memory management into the picture, and without bogging down the responsiveness of the UI. This matters greatly when you're (say) dragging a gigabyte of sound and graphic data off the disk while trying to keep the UI updating at 60Hz.

Fixing asynchronous I/O, again

Posted Jan 14, 2016 12:01 UTC (Thu) by HIGHGuY (subscriber, #62277) [Link] (1 responses)

Isn't the elephant in the room that the whole userspace interface by itself is not ready for asynchronous operation?
It's always easier/faster/... to write:

wait_struct = start_something_async();
return wait(wait_struct);

to perform sync calls using async primitives, than to fire off threads to simulate async calls with sync primitives.

I think Linus is spot on, that performing async system calls makes for a nice system. One could start off building that in a generic way (via kernel threads), then add specializations where a subsystem is capable of it.

Fixing asynchronous I/O, again

Posted Jan 14, 2016 15:15 UTC (Thu) by bcrl (guest, #5934) [Link]

Real applications that are using async operations don't wait on a specific operation; they wait for notification that any I/O has completed. A high level overview of the application I work on is that it has multiple threads that perform various operations within a pipeline. There are parts of the system that face the network and parse data coming in over TCP. The parser then formats the request into a message (which is easier to work on and agnostic of the actual on-the-wire protocol being used) and then sends those messages to various threads that then run their FSMs and potentially send messages to other threads. Some of the FSMs cause disk reads/writes to be issued. All threads are structured to spin in their main event loops which poll the internal queues between threads, run various FSMs that have been scheduled, and some of which also check the AIO ring buffer for notification of completion events for I/O. Under heavy load, nothing blocks; there is always more work to do. It is highly undesirable for any thread to block in the kernel and starve other processing from occurring, as that will add latency to the response time the end user sees. Waiting on a specific AIO to complete is simply not an idiom that is used.

Fixing asynchronous I/O, again

Posted Jan 24, 2016 6:32 UTC (Sun) by toyotabedzrock (guest, #88005) [Link]

Just merge the parts that would make the current interface faster and then create a new interface that suites Linus and add new feature there.