By Jonathan Corbet
April 14, 2009
One might think that the ext3 filesystem, by virtue of being standard on
almost all installed Linux systems for some years now, would be reasonably
well tuned for performance. Recent events have shown, though, that some
performance problems remain in ext3, especially in places where the
fsync() system call is used. It's impressive what can happen when
attention is drawn to a problem; the 2.6.30 kernel will contain
fixes which seemingly eliminate many of the latencies experienced by ext3
users. This article will look at the changes that were made, including a
surprising change to the default journaling mode made just before the
2.6.30-rc1 release.
The problem, in short, is this: the ext3 filesystem, when running in the
default data=ordered mode, can exhibit lengthy stalls when some
process calls fsync() to flush data to disk. This issue most
famously manifested itself as the much-lamented Firefox system-freeze problem, but it goes
beyond just Firefox. Anytime there is reasonably heavy I/O going on, an
fsync() call can bring everything to a halt for several seconds.
Some stalls on the order of minutes have been reported. This behavior has
tended to discourage the use of fsync() in applications and it
makes the Linux desktop less fun to use. It's clearly worth fixing - but
nobody did that for years.
When Ted Ts'o looked into the problem, he noticed an obvious problem: data
sent to the disk via fsync() is put at the back of the I/O
scheduler's queue, behind all other outstanding requests. If processes on
the system are
writing a lot of data, that queue could be quite long. So it takes a long
time for fsync() to complete. While that is happening, other
parts of the filesystem can stall, eventually bringing much of the system
to a halt.
The first fix was to mark I/O requests generated by fsync() with the
WRITE_SYNC operation bit, marking them as synchronous requests.
The CFQ I/O scheduler tries to run synchronous requests (which generally
have a process waiting for the results) ahead of asynchronous ones (where
nobody is waiting). Normally, reads are considered to be synchronous,
while writes are not. Once the fsync()-related requests were made
synchronous, they were able to jump ahead of normal I/O. That
makes fsync() much faster, at the expense of slowing down the
I/O-intensive tasks in the system. This is considered to be a good
tradeoff by just about everybody involved. (It's amusing to note that this
change is conceptually similar to the I/O priority patch posted by
Arjan van de Ven some time ago; some ideas take a while to reach
acceptance).
Block subsystem maintainer Jens Axboe disliked
the change, though, stating that it would cause severe performance
regressions for some workloads. Linus made it
clear, though, that the patch was probably going to go in, and that, if
the CFQ I/O scheduler couldn't handle it, there would soon be a change to a
different default scheduler. Jens probably would have looked further in
any case, but the extra motivation supplied by Linus is unlikely to have
slowed this process down.
The problem, as it turns out, is that WRITE_SYNC actually does two
things: putting the request onto the higher-priority synchronous queue, and
unplugging the queue. "Plugging" is the technique used by the block layer
to issue requests to the underlying disk driver in bursts. Between bursts,
the queue is "plugged," causing requests to accumulate there. This
accumulation gives the I/O scheduler an opportunity to merge adjacent
requests and issue them in some sort of reasonable order. Judicious use of
plugging improves block subsystem performance significantly.
Unplugging the
queue for a synchronous request can make sense in some situations; if
somebody is waiting for the the operation, chances are they will not be
adding any adjacent requests to the queue, so there is no point in waiting
any longer.
As it happens, though, fsync() is not one of those situations.
Instead, a call to fsync() will usually generate a whole series of
synchronous requests, and the chances of those requests being adjacent to
each other is fairly good. So unplugging the queue after each synchronous
request is likely to make performance worse. Upon identifying this
problem, Jens posted a series of
patches to fix it. One of them adds a new WRITE_SYNC_PLUG
operation which queues a synchronous write without unplugging the queue.
This allows operations like fsync() to create a series of
operations, then unplug the queue once at the end.
While he was at it, Jens fixed a couple of related issues. One was that
the block subsystem can still sometimes run synchronous requests behind
asynchronous requests in some situations. The code here is a bit tricky,
since it may be desirable to let a few asynchronous requests through occasionally to
prevent them from being starved entirely. Jens changed the balance to
ensure that synchronous requests get through in a timely manner.
Beyond that, the CFQ scheduler
uses "anticipatory" logic with synchronous requests; upon executing one
such request, it will stall the queue to see if an adjacent request shows
up. The idea is that the disk head will be ideally positioned to satisfy
that request, so the best performance is obtained by not moving it away
immediately.
This logic can work well for synchronous reads, but it's not helpful
when dealing with write operations generated by fsync(). So now there's a
new internal flag that prevents anticipation when WRITE_SYNC_PLUG
operations are executed.
Linus liked the changes:
Goodie. Let's just do this. After all, right now we would otherwise
have to revert the other changes as being a regression, and I
absolutely _love_ the fact that we're actually finally getting
somewhere on this fsync latency issue that has been with us for so
long.
It turns out that there's a little more,
though. Linus noticed that he was still getting stalls, even if they were
much shorter than before, and he wondered why:
One thing that I find intriguing is how the fsync time seems so
_consistent_ across a wild variety of drives. It's interesting how
you see delays that are roughly the same order of magnitude, even
though you are using an old SATA drive, and I'm using the Intel
SSD.
The obvious conclusion is that there was still something else going on.
Linus's hypothesis was that the volume of requests pending to the drive was
large enough to cause stalls even when the synchronous requests go to the
front of the queue. With a default configuration, requests can contain up
to 512KB of data; stack up a couple dozen or so of those, and it's going to
take the drive a little while to work through them. Linus experimented
with setting the maximum size (controlled by
/sys/block/drive/queue/max_sectors_kb) to 64KB, and reports
that things worked a lot better. As of this writing, though, the default
has not been changed; Linus suggested that it might make sense to cap the
maximum amount of outstanding data, rather than the size of any individual
request. More experimentation is called for.
There is one other important change needed to get a truly quick
fsync() with ext3, though: the filesystem must be mounted in
data=writeback mode. This mode eliminates the requirement that
data blocks be flushed to disk ahead of metadata; in data=ordered
mode, instead, the amount of data to be written guarantees that
fsync() will always be slower. Switching to
data=writeback eliminates those writes, but, in the process, it
also turns off the feature which made ext3 seem more robust than ext4.
Ted Ts'o has mitigated that problem somewhat, though, by adding in the same
safeguards he put into ext4. In some situations (such as when a new file
is renamed on top of an existing file), data will be forced out ahead of
metadata. As a result, data loss resulting from a system crash should be less of a
problem.
Sidebar: data=guarded
Another alternative to data=ordered may be the data=guarded mode proposed by
Chris Mason. This mode would delay size updates to prevent information
disclosure problems. It is a very new patch, though, which won't be ready
for 2.6.30.
The other potential problem with
data=writeback is that, in some
situations, a crash can leave a file with blocks allocated to it which have
not yet been written. Those blocks may contain somebody else's old data,
which is a potential security problem. Security is a smaller issue than it
once was, for the simple reason that multiuser Linux systems are relatively
scarce in 2009. In a world where most systems are dedicated to a single
user, the potential for information disclosure in the event of a crash
seems vanishingly small. In other words, it's not clear
that the extra security provided by
data=ordered is worth the
associated performance costs anymore.
So Ted suggested that, maybe,
data=writeback should be made the default. There was some
resistance to this idea; not everybody thinks that ext3, at this stage of
its life, should see a big option change like that. Linus, however, was unswayed by the arguments. He merged a
patch which creates a configuration option for the default ext3 data mode,
and set it to "writeback." That will cause ext3 mounts to silently switch
to data=writeback mode with 2.6.30 kernels. Says Linus:
I'm expecting that within a few months, most modern distributions
will have (almost by mistake) gotten a new set of saner defaults,
and anybody who keeps their machine up-to-date will see a smoother
experience without ever even realizing _why_.
It's worth noting that this default will not change anything if
(1) the data mode is explicitly specified when the filesystem is
mounted, or (2) a different mode has been wired into the filesystem
with tune2fs. It will also be ineffective if distributors change
it back to "ordered" when configuring their kernels. Some distributors, at
least, may well decide that they do not wish to push that kind of change to
their users. We will not see the answer to that question for some months
yet.
(
Log in to post comments)