User: Password:
|
|
Subscribe / Log in / New account

Atomic I/O operations

This article brought to you by LWN subscribers

Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

By Jonathan Corbet
May 30, 2013
LinuxCon Japan 2013
According to Btrfs developer Chris Mason, tuning Linux filesystems to work well on solid-state storage devices is a lot like working on an old, clunky car. Lots of work goes into just trying to make the thing run with decent performance. Old cars may have mainly hardware-related problems, but, with Linux, the bottleneck is almost always to be found in the software. It is, he said, hard to give a customer a high-performance device and expect them to actually see that performance in their application. Fixing this problem will require work in a lot of areas. One of those areas, supporting and using atomic I/O operations, shows particular potential.

The problem

To demonstrate the kind of problem that filesystem developers are grappling with, Chris started with a plot from a problematic customer workload on an ext4 filesystem; it showed alternating periods of high and low I/O throughput rates. The source of the problem, in this case, was a combination of (1) overwriting an existing file and (2) a filesystem that had been mounted in the data=ordered mode. That combination causes data blocks to be put into a special list that must get flushed to disk every [Chris Mason] time that the filesystem commits a transaction. Since the system in question had a fair amount of memory, the normal asynchronous writeback mechanism didn't kick in, so dirty blocks were not being written steadily; instead, they all had to be flushed when the transaction commit happened. The periods of low throughput corresponded to the transaction commits; everything just stopped while the filesystem caught up with its pending work.

In general, a filesystem commit operation involves a number of steps, the first of which is to write all of the relevant file data and wait for that write to complete. Then the critical metadata can be written to the log; once again, the filesystem must wait until that write is done. Finally, the commit block can be written — inevitably followed by a wait. All of those waits are critical for filesystem integrity, but they can be quite hard on performance.

Quite a few workloads — including a lot of database workloads — are especially sensitive to the latency imposed by waits in the filesystem. If the number of waits could be somehow reduced, latency would improve. Fewer waits would also make it possible to send larger I/O operations to the device, with a couple of significant benefits: performance would improve, and, since large chunks are friendlier to a flash-based device's garbage-collection subsystem, the lifetime of the device would also improve. So reducing the number of wait operations executed in a filesystem transaction commit is an important prerequisite for getting the best performance out of contemporary drives.

Atomic I/O operations

One way to reduce waits is with atomic I/O operations — operations that are guaranteed (by the hardware) to either succeed or fail as a whole. If the system performs an atomic write of four blocks to the device, either all four blocks will be successfully written, or none of them will be. In many cases, hardware that supports atomic operations can provide the same integrity guarantees that are provided by waits now, making those waits unnecessary. The T10 (SCSI) standard committee has approved a simple specification for atomic operations; it only supports contiguous I/O operations, so it is "not very exciting." Work is proceeding on vectored atomic operations that would handle writes to multiple discontiguous areas on the disk, but that has not yet been finalized.

As an example of how atomic I/O operations can help performance, Chris looked at the special log used by Btrfs to implement the fsync() system call. The filesystem will respond to an fsync() by writing the important data to a new log block. In the current code, each commit only has to wait twice, thanks to some recent work by Josef Bacik: once for the write of the data and metadata, and once for the superblock write. That work brought a big performance boost, but atomic I/O can push things even further. By using atomic operations to eliminate one more wait, Chris was able to improve performance by 10-15%; he said he thought the improvement should be better than that, but even that level of improvement is the kind of thing database vendors send out press releases for. Getting a 15% improvement without even trying that hard, he said, was a nice thing.

At Fusion-io, work has been done to enable atomic I/O operations in the MariaDB and Percona database management systems. Currently, these operations are only enabled with the Fusion-io software development kit and its "DirectFS" filesystem. Atomic I/O operations allowed the elimination of the MySQL-derived double-buffering mode, resulting in 43% more transactions per second and half the wear on the storage device. Both improvements matter: if you have made a large investment in flash storage, getting twice the life is worth a lot of money.

Getting there

So it's one thing to hack some atomic operations into a database application; making atomic I/O operations more generally available is a larger problem. Chris has developed a set of API changes that will allow user-space programs to make use of atomic I/O operations, but there are some significant limitations, starting with the fact that only direct I/O is supported. With buffered I/O, it just is not possible for the kernel to track the various pieces through the stack and guarantee atomicity. There will also need to be some limitations on the maximum size of any given I/O operation.

An application will request atomic I/O with the new O_ATOMIC flag to the open() system call. That is all that is required; many direct I/O applications, Chris said, can benefit from atomic I/O operations nearly for free. Even at this level, there are benefits. Oracle's database, he said, pretends it has atomic I/O when it doesn't; the result can be "fractured blocks" where a system crash interrupts the writing of a data block that been scattered across a fragmented filesystem, leading to database corruption. With atomic I/O operations, those fragmented blocks will be a thing of the past.

[Chris Mason] Atomic I/O operation support can be taken a step further, though, by adding asynchronous I/O (AIO) support. The nice thing about the Linux AIO interface (which is not generally acknowledged to have many nice aspects) is that it allows an application to enqueue multiple I/O operations with a single system call. With atomic support, those multiple operations — which need not all involve the same file — can all be done as a single atomic unit. That allows multi-file atomic operations, a feature which can be used to simplify database transaction engines and improve performance. Once this functionality is in place, Chris hopes, the (currently small) number of users of the kernel's direct I/O and AIO capabilities will increase.

Some block-layer changes will clearly be needed to make this all work, of course. Low-level drivers will need to advertise the maximum number of atomic segments any given device will support. The block layer's plugging infrastructure, which temporarily stops the issuing of I/O requests to allow them to accumulate and be coalesced, will need to be extended. Currently, a plugged queue is automatically unplugged when the current kernel thread schedules out of the processor; there will need to be a means to require an explicit unplug operation instead. This, Chris noted, was how plugging used to work, and it caused a lot of problems with lost unplug operations. Explicit unplugging was removed for a reason; it would have to be re-added carefully and used "with restraint." Once that feature is there, the AIO and direct I/O code will need to be modified to hold queue plugs for the creation of atomic writes.

The hard part, though, is, as usual, the error handling. The filesystem must stay consistent even if an atomic operation grows too large to complete. There are a number of tricky cases where this can come about. There are also challenges with deadlocks while waiting for plugged I/O. The hardest problem, though, may be related to the simple fact that the proper functioning of atomic I/O operations will only be tested when things go wrong — a system crash, for example. It is hard to know that rarely-tested code works well. So there needs to be a comprehensive test suite that can verify that the hardware's atomic I/O operations are working properly. Otherwise, it will be hard to have full confidence in the integrity guarantees provided by atomic operations.

Status and future work

The Fusion-io driver has had atomic I/O operation support for some time, but Chris would like to make this support widely available so that developers can count on its presence. Extending it to NVM Express is in progress now; SCSI support will probably wait until the specification for vectored atomic operations is complete. Btrfs can use (vectored) atomic I/O operations in its transaction commits; work on other filesystems is progressing. The changes to the plugging code are done with the small exception of the deadlock handler; that gap needs to be filled before the patches can go out, Chris said.

From here, it will be necessary to finalize the proposed changes to the kernel API and submit them for review. The review process itself could take some time; the AIO and direct I/O interfaces tend to be contentious, with lots of developers arguing about them but few being willing to actually work on that code. So a few iterations on the API can be expected there. The FIO benchmark needs to be extended to test atomic I/O. Then there is the large task of enabling atomic I/O operation in applications.

For the foreseeable future, a number of limitations will apply to atomic I/O operations. The most annoying is likely to be the small maximum I/O size: 64KB for the time being. Someday, hopefully, that maximum will be increased significantly, but for now it applies. The complexity of the AIO and direct I/O code will challenge filesystem implementations; the code is far more complex than one might expect, and each filesystem interfaces with that code in a slightly different way. There are worries about performance variations between vendors; Fusion-io devices can implement atomic I/O operations at very low cost, but that may not be the case for all hardware. Atomic I/O operations also cannot work across multiple devices; that means that the kernel's RAID implementations will require work to be able to support atomic I/O. This work, Chris said, will not be in the initial patches.

There are alternatives to atomic I/O operations, including explicit I/O ordering (used in the kernel for years) or I/O checksumming (to detect incomplete I/O operations after a crash). But, for many situations, atomic I/O operations look like a good way to let the hardware help the software get better performance from solid-state drives. Once this functionality finds its way into the mainline, taking advantage of fast drives might just feel a bit less like coaxing another trip out of that old jalopy.

[Your editor would like to thank the Linux Foundation for supporting his travel to LinuxCon Japan.]


(Log in to post comments)

LCJ: Atomic I/O operations

Posted May 30, 2013 16:00 UTC (Thu) by butlerm (guest, #13312) [Link]

I don't see how atomic (transactional) group writes can avoid waits in the general case. What you need to avoid waits in the general case are write barriers, preferably threaded write barriers, so that only some writes are serialized, and others can proceed without delay. Also so that writes can be serialized independently when two or more filesystems (or databases) share the same device.

Atomic writes aren't very helpful unless there is an implicit write barrier of some kind both before the write starts and after the write completes. One cannot safely commit succeeding writes in the general case until you know that the preceding write has completed successfully. And in that case you want to fail all dependent writes back to the filesystem so that the filesystem can take appropriate corrective action. Am I wrong?

LCJ: Atomic I/O operations

Posted May 30, 2013 16:44 UTC (Thu) by masoncl (subscriber, #47138) [Link]

You're not wrong, an fsync requires at least one wait.

The problem is that FS transactions usually wait two or three times as we collect the file data, log blocks, and commit blocks. The atomics can bring it down to just one.

LCJ: Atomic I/O operations

Posted Jun 1, 2013 13:52 UTC (Sat) by butlerm (guest, #13312) [Link]

I can see how atomic writes could make recovery after a failure much simpler on a device with a nonvolatile undo buffer, a copy on write implementation, or the equivalent.

The point about write barriers is that if implemented properly filesystem activity could proceed with zero waits. The only thing necessary is if a write fails subsequent queued writes must fail as well. Data blocks, barrier, log blocks, barrier, commit block, barrier, and so on. The latency could be arbitrarily high with zero effect on throughput.

LCJ: Atomic I/O operations

Posted Jun 1, 2013 23:41 UTC (Sat) by giraffedata (subscriber, #1954) [Link]

Along the same lines, I was wondering what principle makes this work. It seems like a fundamental computational problem is just moved from Linux to the disk drive. Why can the disk drive provide the transactional semantics internally with less elapsed time than the kernel can effect using the (old) SCSI interface to the drive?

LCJ: Atomic I/O operations

Posted Jun 2, 2013 0:13 UTC (Sun) by butlerm (guest, #13312) [Link]

Solid state drives are so fast that the turnaround time between the filesystem layer and the device becomes a real issue. There are at least two software layers plus device communication overhead in between. With remote block devices, the impact is worse.

LCJ: Atomic I/O operations

Posted Jun 2, 2013 1:08 UTC (Sun) by masoncl (subscriber, #47138) [Link]

Yes, for filesystem periodic commits, those can be wait free, but since nobody is really waiting on the periodic commits, the waits don't hurt.

For fsync, O_DIRECT etc, you need to wait because we're not just promising a consistent FS, we're also promising a given thing will really be there after a crash.

LCJ: Atomic I/O operations

Posted Jun 2, 2013 21:44 UTC (Sun) by butlerm (guest, #13312) [Link]

From an application perspective, absolutely. I would like to see traditional drives add bits of flash memory just so they can make that promise prior to writing blocks to their final location.

LCJ: Atomic I/O operations

Posted Jun 3, 2013 14:20 UTC (Mon) by foom (subscriber, #14868) [Link]

Yea, but many (most?) apps which use fsync don't actually need the new thing to be really there, they just want the state of their database to not be Really Corrupted.

If there was some other way to express a required ordering of writes to the kernel than fsync, many uses of fsync could be eliminated.

LCJ: Atomic I/O operations

Posted Jun 6, 2013 17:05 UTC (Thu) by sanxiyn (guest, #44599) [Link]

Previously on LWN: Featherstitch: userspace API for expressing file system writes ordering requirements. http://lwn.net/Articles/354861/

LCJ: Atomic I/O operations

Posted Jun 6, 2013 21:13 UTC (Thu) by kleptog (subscriber, #1183) [Link]

While the idea is clever and looks pretty neat, I'd settle for:

begin_fs_transaction();
... do my filesystem calls ...
end_fs_transaction();

based on the idea that it's easy to understand and hard to screw up.

LCJ: Atomic I/O operations

Posted May 30, 2013 17:12 UTC (Thu) by alexl (guest, #19068) [Link]

If only AIO could deliver completion events in a sane way (i.e. wake up a mainloop sleeping in poll/epoll, rather than weird things like signals (can't use in a library) or threads (if i wanted to use threads i'd just do the i/o in a thread). Then maybe someone could use it...

LCJ: Atomic I/O operations

Posted May 30, 2013 18:12 UTC (Thu) by mtanski (subscriber, #56423) [Link]

That's the same pain I'm experiencing now. On top of that AIO doesn't work for buffered read/writes. So you're left with O_DIRECT and having to reimplement the page cache, read-ahead, etc... in your app.

LCJ: Atomic I/O operations

Posted Jul 4, 2013 12:24 UTC (Thu) by rilder (guest, #59804) [Link]

I don't think that is right, check https://code.google.com/p/kernel/wiki/AIOUserGuide for details.

LCJ: Atomic I/O operations

Posted May 30, 2013 21:27 UTC (Thu) by pbonzini (subscriber, #60935) [Link]

Linux AIO can use eventfd to signal completion. Perhaps you're thinking of POSIX AIO?

LCJ: Atomic I/O operations

Posted May 31, 2013 6:48 UTC (Fri) by alexl (guest, #19068) [Link]

Interesting, although its still not very useful as it only does O_DIRECT.

LCJ: Atomic I/O operations

Posted May 31, 2013 21:55 UTC (Fri) by ssmith32 (guest, #72404) [Link]

Hah! If only this came a couple weeks ago I wouldn't have had to track down the same issue (I hope - system pauses, iostat -x reports large wait times) on servers with a large amount of memory (some over 100GB), ext4, mounted in data=ordered mode :)

Did make me feel a lot better - I fixed it by changing to data=writeback and some other mount options - but I didn't really know why, other than some hand waving by me about not having to pause so much to sync to disk. Now I know!

Thanks Mr. Corbet!!

A little O/T, I know - but just goes to show the articles help in more ways than just educating us about the chosen topic.

Take care,
-stu

Why doesn't the kernel writeback all the time?

Posted Jun 6, 2013 6:10 UTC (Thu) by gmatht (subscriber, #58961) [Link]

"Since the system in question had a fair amount of memory, the normal asynchronous writeback mechanism didn't kick in, so dirty blocks were not being written steadily"

I've noticed that behavior too. It seems strange that adding more memory for write buffers would reduce performance. One can of course tell the kernel to limit the size of the write buffers, but why would the kernel leave the disk essentially idle while buffers are filling up? Wouldn't it give better performance to utilize periods of idle-ish I/O to write dirty buffers? Or is this strategy intended to reduce power consumption?

I suspect I am missing something.

Why doesn't the kernel writeback all the time?

Posted Jun 6, 2013 7:25 UTC (Thu) by dlang (guest, #313) [Link]

> I suspect I am missing something.

The problem is that the people tuning these things are so close to the problem, and so used to conserving precious resources that they have gone too far in delaying I/O

In many cases, if you can delay I/O you can discover that you never needed to do it (the file was temporary and was deleted, the directory chunk was modified again, etc)

In addition, if the drive is busy doing something that isn't really needed, it can't instantly start working on some I/O that _is_ needed immediately.

all these things drive developers into optimizing my microbenchmark, which keeps looking better, but can end up causing counterintuititive issues like this.

Plugging is another example. Because large I/O operations are far more efficient than multiple small I/O operations, we have this entire notion of plugging the I/O to tell the system not to actually write out what you are about to tell it to write on the theory that it can combine the I/O with other things that you will write out later (and hopefully you actually unplug the I/O when you should or everything stays blocked)

These approaches result in the least utilization of the disk, so if you are keeping your disk busy continuously, you would get more done.

But since the disk really isn't being kept busy continuously, there needs to be a bit of a shift to back off the optimizations in favour of starting work sooner to reduce latency (and reduce the pileup of work like happened in this case)

The I/O subsystem should not spend effort trying to consolidate I/O if the disk is idle, just give the disk the first chunk of work that you have for it to do. It's only when the disk is busy and you have extra CPU that you should look at simplifying the future work.

If the workload is small enough to never fully saturate the disk, you never do any optimizations and the drive is very busy, but you are getting each chunk of work done with as little latency as possible.

But if the disk ever does get fully saturated, while the system is waiting, it can now combine I/O, reorder seeks (or not if it's on a SSD), etc.

The result is that the utilization of the disk will climb much more rapidly than it does now, but when the disk utilization nears the peak, it will flatten out while the throughput continues to climb (as the requests become more efficient)

I will note as well that this is a perfect example of the performance/power tradeoff, this increases performance by reducing latency and avoiding big piles of work (and also making more memory able to be freed rapidly, which can again increase performance), but it does so at the cost of utilizing the drive more

As such, this post belongs in at least three current comment threads :-)

Why doesn't the kernel writeback all the time?

Posted Jun 6, 2013 14:04 UTC (Thu) by raven667 (subscriber, #5198) [Link]

That's a good summary. What I find interesting is how similar these kind of IO problems are across subsystems/technologies, specifically network queuing and throughput vs. latency (ie. bufferbloat). A lot of queuing research which has been done on networks seems directly applicable to disk IO queuing, with little modification (intelligent coalescing isn't as difficult in networks because MTUs top out so quickly for example).

Why doesn't the kernel writeback all the time?

Posted Jun 14, 2013 13:10 UTC (Fri) by quanstro (subscriber, #77996) [Link]

+1. i wish i'd written that. :-) not delaying i/o for idle devices also should mean a smaller probability of corruption when recovering from casters-up mode.

Atomic I/O operations

Posted Feb 28, 2017 20:16 UTC (Tue) by sanjeev.trika (guest, #114375) [Link]

Hi! Its been a few years since this article was posted. Can you please advise re the status of atomic i/o support in the OS?

thx!
Sanjeev

Atomic I/O operations

Posted Feb 28, 2017 20:25 UTC (Tue) by corbet (editor, #1) [Link]

I would have said "not much is happening", but there was a new patch set posted this very day. I assume there will be discussion at LSFMM next month.


Copyright © 2013, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds