Supporting solid-state hybrid drives
Of course, one can simply leave everything up to the drive itself. Left to its own devices (so to speak), the drive will observe which blocks are frequently accessed and work to keep those blocks in fast storage. But the operating system — or the programs running on that system — will often have a better idea of which data will be most useful in the future. If that information is communicated to the drives, the result should be better use of fast storage, and, thus, better performance.
Enabling that communication is the goal of this patch set posted by Jason Akers. The response to that patch set from the kernel community makes it clear, though, that there is still some work to be done to figure out the best way to get the best possible performance from such drives.
This patch set uses the per-process I/O priority value as a way of signaling information about cache usage. That priority can be set by way of the ionice command. Using a few bits of the priority field, the user can specify one of a number of policies (listed here in symbolic form):
- IOPRIO_ADV_EVICT says that the data involved in
I/O operations should be actively removed from the cache, should it be
found there. It's a way of saying that the data will, with certainty,
not be used again in the near future.
- IOPRIO_ADV_DONTNEED says that the data should not be cached,
but that there is no need to actively evict it from the cache if it's
already there.
- IOPRIO_ADV_NORMAL leaves caching policy up to the drive, as
if no advice had been provided at all.
- IOPRIO_ADV_WILLNEED indicates that the data will be needed again in the near future and, thus, should be stored in the cache.
This patch set is unlikely to be merged in anything close to its current form for a few reasons. One of those is that, as a few developers pointed out, associating I/O caching policy with a process is a bit strange. Any given process may want different caching policies for different files it works with; indeed, it may want different policies for different parts of the same file. Creating a single, per-process policy makes this kind of use nearly impossible.
Beyond that, as Dave Chinner pointed out, the process that generates an I/O operation in user space may not be the process that submits the I/O to the block subsystem. Many filesystems use worker threads to perform actual submission; that breaks the link with the process that originally created the I/O operation. Filesystems, too, may wish to adjust caching policy; giving metadata a higher priority for the cache than data is one obvious possibility. As it happens, there is a way for filesystems to adjust the I/O priority value on individual requests, but it is not the most elegant of APIs.
For these reasons, some developers have suggested that the caching policy should be set on a per-file basis with a system call like fadvise() rather than on a per-process basis. Even better, as Jens Axboe noted, would be to add a mechanism by which processes could provide hints on a per-operation basis. The approach used in the non-blocking buffered read proposal might be applicable for that type of use.
There is another problem with this patch set, though: the types of "advice" that can be provided is tied tightly to the specifics of how the current generation of hybrid drives operates. It offers low-level control over a single level of cache and not much else. Future drives may operate in different ways that do not correspond well to the above-described operations. Beyond that, hybrid drives are not the only place where this kind of advice can be provided; it can also be useful over NFS 4.2, with persistent memory devices, and with the upcoming T10/T13 "logical block markup descriptors." There is a strong desire to avoid merging a solution that works with one type of current technology, but that will lack relevance with other technologies.
Martin Petersen has put some time into trying to find an optimal way to
provide advice to storage devices in general. His approach is to avoid
specific instructions ("evict this data from the cache") in favor of a
description why the I/O is being performed. He described his results as
"
That table consists of a set of I/O classes, along with the performance
implications of each class. There is a "transaction" class with stringent
completion-time and latency requirements and a high likelihood that the
data will be accessed again in the near future. The "streaming" class also
wants fast command completion, but the chances of needing the data again
soon are quite low. Other classes include "metadata" (which is like
transactions but with a lower likelihood of needing the data again),
"paging," "data," and "background" (which has low urgency and no need for
caching).
Given an (unspecified) API that uses these I/O classes, the low-level
driver code can map the class of any specific I/O operation onto the
proper advice for the hardware. That mapping might be a bit trickier than
one might imagine, though, as the hardware gets more complex. There is
also the problem of consistency across devices; if drivers interpret the
classes differently, the result could be visible performance differences
that create unhappy users.
These issues will need to be worked out, though, if Linux systems are to
drive hybrid devices in anything other than the default, device-managed
mode. Given a suitable kernel and user-space API, the class-based
approach looks like it should be flexible enough to get the most out of
near-future hardware. Getting there, though, means a trip back to the
drawing board for the authors of the current hybrid-drive patches.
a huge twisted mess of a table with ponies of various sizes
",
but it's not all that complicated in the end.
Index entries for this article Kernel Block layer/Solid-state storage devices Kernel Solid-state storage devices
(Log in to post comments)
Supporting solid-state hybrid drives
Posted Nov 6, 2014 16:18 UTC (Thu) by josh (subscriber, #17465) [Link]
Supporting solid-state hybrid drives
Posted Nov 9, 2014 17:47 UTC (Sun) by marcH (subscriber, #57642) [Link]
Sounds very similar to network QoS. Any potential for overlap/mapping/re-use? "tc" for disks some time soon?
Network QoS has generally not been a great success... except in closed, tightly controlled environments, which should be the case here.
Supporting solid-state hybrid drives
Posted Nov 9, 2014 17:51 UTC (Sun) by marcH (subscriber, #57642) [Link]
(No pony this time - thanks)
Methinks we need to be closer to the primtives
Posted Nov 16, 2014 2:16 UTC (Sun) by davecb (subscriber, #1574) [Link]
From where I was, it looked like stable memory could be used first for write cache, so that what was written to disk in critical order would be honoured. The cache was a circular buffer, with the most recently written being what was to be returned, and the buffer never to overflow. Bottleneck, yes, at the ultimate speed of the underlying disk, but that would be an insanely rate event with current speeds.
Once we know that write is ordered and atomic, we can look at read, and cache in a less permanent form everything we've read in the last two minutes. Recent writes, of course take precedence over old cache entries.
We had two different devices: sequential-oriented write cache, and random-oriented but invalidateable read cache. The first was most important. The second could be done with cheaper, non-stable memory, or with slower flash.
That takes the problem into a different regime, and one that can be handled far easier than the I/O of a process.
