LWN: Comments on "Fixing asynchronous I/O, again"

Fixing asynchronous I/O, again

toyotabedzrock — Sun, 24 Jan 2016 06:32:05 +0000

Just merge the parts that would make the current interface faster and then create a new interface that suites Linus and add new feature there.

Fixing asynchronous I/O, again

liam — Fri, 22 Jan 2016 22:40:03 +0000

I guess the irony didn't make it through the adc:)
X point is exactly what i had in mind, and why it makes sense to tackle this issue properly sooner rather than later.
That proper aio keeps coming up should be an additional reason to take this seriously. It's not as though the other major kernels are missing this feature.

Fixing asynchronous I/O, again

intgr — Fri, 22 Jan 2016 11:04:00 +0000

> there's no technology on the horizon claiming to improve iops by an order of magnitude

Hard to say for sure how realistic these figures are, as no products are on the market yet, but 3D XPoint *claims* be that technology. The numbers from a few news articles claim close to one order of magnitude improvement in IOPS compared to plain old flash memory, for the first generation of products.

> In an NVMe-based solid state drive, XPoint chips can deliver more than 95,000 I/O operations per second at a 9 microsecond latency, compared to 13,400 IOPs and 73 ms latency for flash

http://www.eetimes.com/document.asp?doc_id=1328682
http://hothardware.com/news/intel-and-micron-jointly-drop...

Fixing asynchronous I/O, again

liam — Fri, 22 Jan 2016 04:33:16 +0000

Luckily, and I think I'm right about this, there's no technology on the horizon claiming to improve iops by an order of magnitude...
So, fusionio should remain the near unicorn that need not concern anyone (other than them).

Fixing asynchronous I/O, again

Wol — Thu, 21 Jan 2016 18:22:55 +0000

> But if you are e.g. a RDBMS (my corner), and you want to efficiently flush victim pages from an in-memory buffer back to disk, you'll quickly end up being bottlenecked on latency.

My reaction entirely. For a database server, it's all very well saying "it won't make much of an improvement overall", but if it's applicable to 90% of the workload of a dedicated server, then it's going to make one heck of a difference to that server.

And if those dedicated servers are a class where they are typically under heavy load, then this becomes a pretty obvious scalability issue - it bites when heavy-duty hardware is under heavy load - so the option of "throwing hardware at the problem" is not available ...

Cheers,
Wol

Fixing asynchronous I/O, again

andresfreund — Fri, 15 Jan 2016 12:08:53 +0000

https://lkml.org/lkml/2015/10/28/878 has some interesting numbers. Particularly the number fsyncs & journal writes in the synchronous vs. the asynchronous case are kinda impressive.

Fixing asynchronous I/O, again

andresfreund — Fri, 15 Jan 2016 12:02:15 +0000

> Seriously: the number of such writebacks you can do per second is slow enough that you probably won't get much benefit from using kernel threads and from batching submissions.

I rather doubt that. I mean with a decent pcie attached enterprise ssd you can do a *lot* of flushes/sec. But to actually utilize the hardware, you always need several writes to be in progress in parallel. While you probably need several submission threads (best one per actual core) for full utilization, using a thread pool large enough to have the required number of writes in progress at the same time, introduces too much context switching.

At the moment you can't even really utilize the actual potential of "prosumer" SSDs for random write workloads. Sequential IO is fine because it's quickly bottlenecked by the bus anyway. But if you are e.g. a RDBMS (my corner), and you want to efficiently flush victim pages from an in-memory buffer back to disk, you'll quickly end up being bottlenecked on latency.

Obviously this is only really interesting for rather IO intensive workloads.

> I would like to see numbers

Fair enough.

> # of ops per second on *real-world* usecases

I can only speak from the PostgreSQL corner here. But 50-100k 8192byte diry blocks written back/sec is easily achievable. At that point, in my testing, we're bottlenecked at sync_file_range(SYNC_FILE_RANGE_WRITE) latency because it starts blocking quite soon (note we're doing a separate fsync for actual durability later, the s_f_r is just to keep the amount of work done by fsync bounded).

> CPU utilization for kernel workqueue vs. userspace threadpool, etc.) before committing to a large change such as asynchronous system calls.

To some degree that does require a decent kernelspace implementation in a usable state for comparison.

Fixing asynchronous I/O, again

pbonzini — Fri, 15 Jan 2016 09:12:17 +0000

What's more portable than a userspace thread pool? :)

Seriously: the number of such writebacks you can do per second is slow enough that you probably won't get much benefit from using kernel threads and from batching submissions. If you need to do thousands of writebacks per second, buy yourself a UPS or a disk with non-volatile (battery-backed) cache. I would like to see numbers (# of ops per second on *real-world* usecases, CPU utilization for kernel workqueue vs. userspace threadpool, etc.) before committing to a large change such as asynchronous system calls.

Fixing asynchronous I/O, again

andresfreund — Thu, 14 Jan 2016 21:31:13 +0000

Huh, why not? Is give half an arm for a decent portable async range fsync/writeback interface. Sure, it's not every application, but very few new Linux features are applicable to a large portion of applications. Many low hanging fruits are gone.

Fixing asynchronous I/O, again

pbonzini — Thu, 14 Jan 2016 21:05:59 +0000

Indeed kernel threads are faster. But what are the syscalls that happen often enough in your application, and block for long enough, that it actually matters? If it's just file I/O, then you don't need a full-blown asynchronous system call interface.

Fixing asynchronous I/O, again

tshow — Thu, 14 Jan 2016 16:01:46 +0000

> Why would we add kernel support to perform syscall asynchronously when we can already do
>
> if (clone() == 0) perform_syscall();

Well, in my case, because it would be nice for my game engine to be able to load files in the background without spinning off a disk management thread. On most game consoles there's some async version of read() that looks something like:

async_cookie_t read_async(int file, void *buffer, size_t bytes);

And a corresponding:

bool async_op_complete(async_cookie_t cookie);

There's often a corresponding write_async() if it makes sense, but with game consoles you're often using read-only storage.

Having an explicit pollable async read means that somewhere in the main loop can be a simple data loader that maintains a list of things that need to be loaded and the processing that needs to be done. All of this can happen in a single (user) thread, without having to drag pipes, mutexes or cross-thread memory management into the picture, and without bogging down the responsiveness of the UI. This matters greatly when you're (say) dragging a gigabyte of sound and graphic data off the disk while trying to keep the UI updating at 60Hz.

Fixing asynchronous I/O, again

bcrl — Thu, 14 Jan 2016 15:15:52 +0000

Real applications that are using async operations don't wait on a specific operation; they wait for notification that any I/O has completed. A high level overview of the application I work on is that it has multiple threads that perform various operations within a pipeline. There are parts of the system that face the network and parse data coming in over TCP. The parser then formats the request into a message (which is easier to work on and agnostic of the actual on-the-wire protocol being used) and then sends those messages to various threads that then run their FSMs and potentially send messages to other threads. Some of the FSMs cause disk reads/writes to be issued. All threads are structured to spin in their main event loops which poll the internal queues between threads, run various FSMs that have been scheduled, and some of which also check the AIO ring buffer for notification of completion events for I/O. Under heavy load, nothing blocks; there is always more work to do. It is highly undesirable for any thread to block in the kernel and starve other processing from occurring, as that will add latency to the response time the end user sees. Waiting on a specific AIO to complete is simply not an idiom that is used.

Fixing asynchronous I/O, again

bcrl — Thu, 14 Jan 2016 14:55:09 +0000

clone() is too expensive to be used in this way in the real world -- task_struct is huge and there is a lot of data that has to be touched. As for thread pools: testing at my employer shows that a userspace thread pool implementation using pthreads is 25% slower than using a pool of kernel threads. That number is for the overall performance of the application of which the parts doing AIO are actually quite small (but in the critical path of ensuring data is persistent).

Fixing asynchronous I/O, again

HIGHGuY — Thu, 14 Jan 2016 12:01:47 +0000

Isn't the elephant in the room that the whole userspace interface by itself is not ready for asynchronous operation?
It's always easier/faster/... to write:

wait_struct = start_something_async();
return wait(wait_struct);

to perform sync calls using async primitives, than to fire off threads to simulate async calls with sync primitives.

I think Linus is spot on, that performing async system calls makes for a nice system. One could start off building that in a generic way (via kernel threads), then add specializations where a subsystem is capable of it.

Fixing asynchronous I/O, again

pbonzini — Thu, 14 Jan 2016 09:05:22 +0000

Right. clone() is a bit heavyweight, but you can just replace it with a thread pool. The benefit of AIO is the ability to submit and retrieve the results of multiple operations at a time. Unless userspace is submitting thousands of operations per second, which is pretty much the case only for read/write, there's no real benefit in asynchronous system calls. Userspace can handle what's left (such as openat, and fsync too) with pthreads.

In fact, because AIO actually blocks sometimes, userspace will usually just skip AIO and just use a thread pool.

You can see this in QEMU for example. It uses both AIO and a thread pool, and:

uses a thread pool to implement 9pfs (where you can have a lot of blocking operations such as openat or rename);
offers the choice between AIO and thread pool for high-performance virtio-blk (and always uses the thread pool for stuff such as fsync and discard)

Right now the thread pool implementation in QEMU is pretty simple, so it uses quite a lot of CPU due to cache-line bouncing on the lists of pending work items. but despite that it already has performance comparable with AIO except with really fast backends such as FusionIO.

Fixing asynchronous I/O, again

neilbrown — Thu, 14 Jan 2016 01:47:43 +0000

I must be missing something important here...

Why would we add kernel support to perform syscall asynchronously when we can already do

if (clone() == 0) perform_syscall();

(Admitted that is an over-simplification, but does fleshing out the details make it more complex that adding new functionality to the kernel?)