According to Btrfs developer Chris Mason, tuning Linux filesystems to work
well on solid-state storage devices is a lot like working on an old,
clunky car. Lots of work goes into just trying to make the thing run with
decent performance. Old cars may have mainly hardware-related problems,
but, with Linux,
the bottleneck is almost always to be found in the software. It is, he
said, hard to give a customer a high-performance device and expect them to
actually see that performance in their application. Fixing this problem
will require work in a lot of areas. One of those areas, supporting and
using atomic I/O operations, shows particular potential.
The problem
To demonstrate the kind of problem that filesystem developers are grappling
with, Chris started with a plot from a problematic customer workload on an
ext4 filesystem; it showed alternating periods of high and low I/O
throughput rates. The source of the problem, in this case, was a
combination of (1) overwriting an existing file and (2) a
filesystem that had been
mounted in the data=ordered mode. That combination causes data
blocks to be put into a special list that must get flushed to disk every
time that the filesystem commits a transaction. Since the system in
question had a fair amount of memory, the normal asynchronous writeback
mechanism didn't kick in, so dirty blocks were not being written steadily;
instead, they all had to be flushed when the transaction commit happened.
The periods of low throughput corresponded to the transaction commits;
everything just stopped while the filesystem caught up with its pending work.
In general, a filesystem commit operation involves a number of steps, the
first of which is to write all of the relevant file data and wait for that
write to complete. Then the critical metadata can be written to the log;
once again, the filesystem must wait until that write is done. Finally,
the commit block can be written — inevitably followed by a wait. All of
those waits are critical for filesystem integrity, but they can be quite
hard on performance.
Quite a few workloads — including a lot of database workloads — are
especially sensitive to the latency imposed by waits in the filesystem.
If the number of
waits could be somehow reduced, latency would improve. Fewer waits would
also make it possible to send larger I/O operations to the device, with a
couple of significant benefits: performance would improve, and, since large
chunks are friendlier to a flash-based device's garbage-collection
subsystem, the lifetime of the device would also improve. So reducing the
number of wait operations executed in a filesystem transaction commit is an
important prerequisite for getting the best performance out of contemporary
drives.
Atomic I/O operations
One way to reduce waits is with atomic I/O operations — operations that are
guaranteed (by the hardware) to either succeed or fail as a whole. If the
system performs an atomic write of four blocks to the device, either all
four blocks will be successfully written, or none of them will be. In
many cases, hardware that supports atomic operations can provide the same
integrity guarantees that are provided by waits now, making those waits
unnecessary. The T10 (SCSI) standard committee has approved a simple
specification for atomic operations; it only supports contiguous I/O
operations, so it is "not very exciting." Work is proceeding on vectored
atomic operations that would handle writes to multiple discontiguous areas
on the disk, but that has not yet been finalized.
As an example of how atomic I/O operations can help performance, Chris
looked at the special log used by Btrfs to implement the fsync()
system call. The filesystem will respond to an fsync() by writing
the important data to a new log block. In the current code, each commit
only has to wait twice, thanks to some recent work by Josef Bacik: once
for the write of the data and metadata, and once for the superblock write.
That work brought a big performance boost, but atomic I/O can push things
even further. By using atomic operations to eliminate one more wait,
Chris was able to improve performance by 10-15%;
he said he thought the improvement should be better than that, but even
that level of improvement is the kind of thing database vendors send out
press releases for. Getting a 15% improvement without even trying that
hard, he said, was a nice thing.
At Fusion-io, work has been done to enable atomic I/O operations in the MariaDB
and Percona database management systems. Currently, these operations are
only enabled with the
Fusion-io software development kit and its "DirectFS" filesystem. Atomic
I/O operations allowed the elimination of the MySQL-derived
double-buffering mode, resulting in 43% more transactions per second and
half the wear on the storage device. Both improvements matter: if you have
made a large investment in flash storage, getting twice the life is worth a
lot of money.
Getting there
So it's one thing to hack some atomic operations into a database
application; making atomic I/O operations more generally available is a
larger problem. Chris has developed a set of API changes that will allow
user-space programs to make use of atomic I/O operations, but there are
some significant limitations, starting with the fact that only direct I/O
is supported. With buffered I/O, it just is not possible for the kernel to
track the various pieces through the stack and guarantee atomicity. There
will also need to be some limitations on the maximum size of any given I/O
operation.
An application will request atomic I/O with the new O_ATOMIC flag
to the open() system call. That is all that is required; many
direct I/O applications, Chris said, can benefit from atomic I/O operations
nearly for free. Even at this level, there are benefits. Oracle's
database, he said, pretends it has atomic I/O when it doesn't; the result
can be "fractured blocks" where a system crash interrupts the writing of a
data block that been scattered across a fragmented filesystem, leading to
database corruption. With atomic I/O operations, those fragmented blocks
will be a thing of the past.
Atomic I/O operation support can be taken a step further, though, by adding
asynchronous I/O (AIO) support. The nice thing about the Linux AIO
interface (which is not generally acknowledged to have many nice aspects)
is that it allows an application to enqueue multiple I/O operations with a
single system call. With atomic support, those multiple operations — which
need not all involve the same file — can all be done as a single atomic
unit. That allows multi-file atomic operations, a feature which can be
used to simplify database transaction engines and improve performance.
Once this functionality is in place, Chris hopes, the (currently small)
number of users of the kernel's direct I/O and AIO capabilities will
increase.
Some block-layer changes will clearly be needed to make this all work, of
course. Low-level drivers will need to advertise the maximum number of
atomic segments any given device will support. The block layer's plugging
infrastructure, which temporarily stops the issuing of I/O requests to
allow them to accumulate and be coalesced, will need to be extended.
Currently, a plugged queue is automatically unplugged when the current
kernel thread schedules out of the processor; there will need to be a means
to require an explicit unplug operation instead. This, Chris noted, was
how plugging used to work, and it caused a lot of problems with lost unplug
operations. Explicit unplugging was removed for a reason; it would have to
be re-added carefully and used "with restraint." Once that feature is
there, the AIO and direct I/O code will need to be modified to hold queue
plugs for the creation of atomic writes.
The hard part, though, is, as usual, the error handling. The filesystem
must stay consistent even if an atomic operation grows too large to
complete. There are a number of tricky cases where this can come about.
There are also challenges with deadlocks while waiting for plugged I/O.
The hardest problem, though, may be related to the simple fact that the
proper functioning of atomic I/O operations will only be tested when things
go wrong — a system crash, for example. It is hard to know that
rarely-tested code works well. So there needs to be a comprehensive test
suite that can verify that the hardware's atomic I/O operations are working
properly. Otherwise, it will be hard to have full confidence in the
integrity guarantees provided by atomic operations.
Status and future work
The Fusion-io driver has had atomic I/O operation support for some time,
but Chris would like to make this support widely available so that
developers can count on its presence.
Extending it to NVM Express is in progress now; SCSI support will probably
wait until the specification for vectored atomic operations is complete.
Btrfs can use (vectored) atomic I/O operations in its transaction commits;
work on other filesystems is progressing. The changes to the plugging code
are done with the small exception of the deadlock handler; that gap needs
to be filled before the patches can go out, Chris said.
From here, it will be necessary to finalize the proposed changes to the
kernel API and submit them for review. The review process itself could
take some time; the AIO and direct I/O interfaces tend to be contentious,
with lots of developers arguing about them but few being willing to
actually work on that code. So a few iterations on the API can be expected
there. The FIO benchmark
needs to be extended to test atomic I/O. Then
there is the large task of enabling atomic I/O operation in
applications.
For the foreseeable future, a number of limitations will apply to atomic
I/O operations. The most annoying is likely to be the small maximum I/O
size: 64KB for the time being. Someday, hopefully, that maximum will be
increased significantly, but for now it applies. The complexity of the AIO
and direct I/O code will challenge filesystem implementations; the code is
far more complex than one might expect, and each filesystem interfaces with
that code in a slightly different way. There are worries about performance
variations between vendors; Fusion-io devices can implement atomic I/O
operations at very low cost, but that may not be the case for all hardware.
Atomic I/O operations also cannot work across multiple devices; that means
that the kernel's RAID implementations will require work to be able to
support atomic I/O. This work, Chris said, will not be in the initial
patches.
There are alternatives to atomic I/O operations, including explicit I/O
ordering (used in the kernel for years) or I/O checksumming (to detect
incomplete I/O operations after a crash). But, for many situations, atomic
I/O operations look like a good way to let the hardware help the software
get better performance from solid-state drives. Once this functionality
finds its way into the mainline, taking advantage of fast drives might just
feel a bit less like coaxing another trip out of that old jalopy.
[Your editor would like to thank the Linux Foundation for supporting his
travel to LinuxCon Japan.]
(
Log in to post comments)