By Jonathan Corbet
October 4, 2010
Over the last few years, it has become clear that one of the most pressing
scalability problems faced by Linux is being driven by solid-state storage
devices (SSDs). The rapid increase in performance offered by these devices
cannot help but reveal any bottlenecks in the Linux filesystem and block
layers. What has been less clear, at times, is what we are going to do
about this problem. In his LinuxCon Japan talk, block maintainer Jens
Axboe described some of the work that has been done to improve block layer
scalability and offered a view of where things might go in the future.
While workloads will vary, Jens says, most I/O patterns are dominated by
random I/O and relatively small requests. Thus, getting the best results
requires being able to perform a large number of I/O operations per second
(IOPS). With a high-end rotating drive (running at 15,000 RPM), the
maximum rate possible is about 500 IOPS. Most real-world drives, of
course, will have significantly slower performance and lower I/O rates.
SSDs, by eliminating seeks and rotational delays, change everything; we
have gone from hundreds of IOPS to hundreds of thousands of IOPS in a very
short period of time. A number of people have said that the massive
increase in IOPS means that the block layer will have to become more like
the networking layer, where every bit of per-packet overhead has been
squeezed out over time. But, as Jens points out, time is not in great
abundance. Networking technology went from 10Mb/s in the 1980's to 10Gb/s
now, the better part of 30 years later. SSDs have forced a similar jump
(three orders of magnitude) in a much shorter period of time - and every
indication suggests that devices with IOPS rates in the millions are not
that far away. The result, says Jens, is "a big problem."
This problem pops up in a number of places, but it usually comes down to
contention for shared resources. Locking overhead which is tolerable at
500 IOPS is crippling at 500,000. There are also problems with contention
at the hardware level too; vendors of storage controllers have been caught
by surprise by SSDs and are having to scramble to get their performance up
to the required levels. The growth of multicore systems naturally makes
things worse; such systems can create contention problems throughout the
kernel, and the block layer is no exception. So much of the necessary work
comes down to avoiding contention.
Before that, though, some work had to be done just to get the block layer
to recognize that it is dealing with an SSD and react accordingly.
Traditionally, the block layer has been driven by the need to avoid head
seeks; the use of quite a bit of CPU time could be justified if it managed
to avoid a single seek. SSDs - at least the good ones - care a lot less
about seeks, so expending a bunch of CPU time to avoid them no longer makes
sense. There are various ways of detecting SSDs in the hardware, but they
don't always work, especially with the lower-quality devices. So the block
layer exports a flag under
/sys/block/<device>/queue/rotational
which can be used to override the system's notion of what kind of storage
device it is dealing with.
Improving performance with SSDs can be a challenging task. There is no
single big bottleneck which is causing performance problems; instead, there
are numerous small things to fix. Each fix yields a bit of progress, but
it mostly serves to highlight the next problem. Additionally, performance
testing is hard; results are often not reproducible and can be perturbed by
small changes. This is especially true on larger systems with more CPUs.
Power
management can also get in the way of the generation of consistent results.
One of the first things to address on an SSD was queue plugging. On a
rotating disk, the first I/O operation to show up in the request queue will
cause the queue to be "plugged," meaning that no operations will actually
be dispatched to the hardware. The idea behind plugging is that, by
allowing a little time for additional I/O requests to arrive, the block
layer will be able to merge adjacent requests (reducing the operation
count) and sort them into an optimal order, increasing performance.
Performance on SSDs tends not to benefit from this treatment, though there
is still a little value to merging requests. Dropping (or, at least,
reducing) plugging not only
eliminates a needless delay; it also reduces the need to take the queue
lock in the process.
Then, there is the issue of request timeouts. Like most I/O code, the
block layer needs to notice when an I/O request is never completed by the
device. That detection is done with timeouts. The old implementation
involved a separate timeout for each outstanding request, but that clearly
does not scale when the number of such requests can be huge. The answer
was to go to a per-queue timer, reducing the number of running timers
considerably.
Block I/O operations, due to their inherently unpredictable execution
times, have traditionally contributed entropy to the kernel's random number
pool. There is a problem, though: the necessary call to
add_timer_randomness() has to acquire a global lock, causing
unpleasant systemwide contention. Some work was done to batch these calls
and accumulate randomness on a per-CPU basis, but, even when batching 4K
operations at a time, the performance cost was significant. On top of it
all, it's not really clear that using an SSD as an entropy source makes a
lot of sense. SSDs lack mechanical parts moving around, so their
completion times are much more predictable. Still, for the moment, SSDs
contribute to the entropy pool by default; administrators who would
like to change that behavior can do so by changing the
queue/add_random sysfs variable.
There are other locking issues to be dealt with. Over time, the block
layer has gone from being protected by the big kernel lock to a block-level
lock, then to a per-disk lock, but lock contention is still a problem. The
I/O scheduler adds contention of its own, especially if it is performing
disk-level accounting. Interestingly, contention for the locks themselves
is not
usually the problem; it's not that the locks are being held for too long.
The big problem is the cache-line bouncing caused by moving the lock
between processors. So the traditional technique of dropping and
reacquiring locks to reduce lock contention does not help here - indeed, it
makes things worse. What's needed is to avoid taking the lock altogether.
Block requests enter the system via __make_request(), which is
responsible for getting a request (represented by a BIO structure) onto the
queue. Two lock acquisitions are required to do this job - three if the
CFQ I/O scheduler is in use. Those two acquisitions are the result of a
lock split done to reduce contention in the past; that split, when the
system is handling requests at SSD speeds, makes things worse. Eliminating
it led to a roughly 3% increase in IOPS with a reduction in CPU time on a
32-core system. It is, Jens says, a "quick hack," but it demonstrates the
kind of changes that need to be made.
The next step for this patch is to drop the I/O request allocation batching
- a mechanism added to increase throughput on rotating drives by allowing
the simultaneous submission of multiple requests. Jens also plans to drop
the allocation accounting code, which tracks the number of requests in
flight at any given time. Counting outstanding I/O operations requires
global counters and the associated contention, but it can be done without
most of the time. Some accounting will still be done at the request queue
level to ensure that some control is maintained over the number of
outstanding requests. Beyond that, there is some per-request accounting
which can be cleaned up and, Jens thinks, request completion can be made
completely lockless. He hopes that this work will be ready for merging
into 2.6.38.
Another important technique for reducing contention is keeping processing
on the same CPU as often as possible. In particular, there are a number of
costs which are incurred if the CPU which handles the submission of a specific I/O request is
not the CPU which handles that request's completion. Locks are bounced
between CPUs in an unpleasant way, and the slab allocator tends not to
respond well when memory allocated on one processor is freed elsewhere in
the system. In the networking layer, this problem has been addressed with
techniques like receive packet
steering, but, unlike some networking hardware, block I/O controllers
are not able to direct specific I/O completion interrupts to specific
CPUs. So a different solution was required.
That solution took the form of smp_call_function(), which performs
fast cross-CPU calls. Using smp_call_function(), the block I/O
completion code can direct the completion of specific requests to the CPU
where those requests were initially submitted. The result is a relatively
easy performance improvement. A dedicated administrator who is willing to
tweak the system manually can do better, but
that takes a lot of work and the solution tends to be fragile. This
code - which was merged back in 2.6.27 and made the default in 2.6.32 -
is an easier way that takes away a fair amount of the pain of cross-CPU
contention. Jens
noted with pride that the block layer was not chasing the networking code
with regard to completion steering - the block code had it first.
On the other hand, the blk-iopoll interrupt mitigation
code was not just inspired by the networking layer - some of the code was
"shamelessly stolen" from there. The blk-iopoll code turns off completion
interrupts when I/O traffic is high and uses polling to pick up completed
events instead. On a test system, this code reduced 20,000
interrupts/second to about 1,000. Jens says that the results are less
conclusive on real-world systems, though.
An approach which "has more merit" is "context plugging," a rework of the
queue plugging code. Currently, queue plugging is done implicitly on I/O
submission, with an explicit unplug required at a later time. That has
been the source of a lot of bugs; forgetting to unplug queues is a common
mistake to make. The plan is to make plugging and unplugging fully
implicit, but give I/O
submitters a way to inform the block layer that more requests are coming
soon. It makes the code more clear and robust; it also gets rid of a lot
of expensive per-queue state which must be maintained. There are still
some problems to be solved, but the code works, is "tasty on many levels,"
and yields a net reduction of some 600 lines of code. Expect a merge in
2.6.38 or 2.6.39.
Finally, there is the "weird territory" of a multiqueue block layer - an
idea which, once again, came from the networking layer. The creation of
multiple I/O queues for a given device will allow multiple processors to
handle I/O requests simultaneously with less contention. It's currently
hard to do, though, because block I/O controllers do not (yet) have
multiqueue support. That problem will be fixed eventually, but there will
be some other challenges to overcome: I/O barriers will become
significantly more complicated, as will per-device accounting. All told,
it will require some major changes to the block layer and a special I/O
scheduler. Jens offered no guidance as to when we might see this code
merged.
The conclusion which comes from this talk is that the Linux block layer is
facing some significant challenges driven by hardware changes. These
challenges are being addressed, though, and the code is moving in the
necessary direction. By the time most of us can afford a system with one
of those massive, 1 MIOPS arrays on it, Linux should be able to use it
to its potential.
(
Log in to post comments)