Brief items
The current development kernel is 3.10-rc3,
released on May 26 with some grumbles
about how it contains more changes than he would have liked. "
I can
pretty much guarantee that -rc4 is going to be smaller, because (a) I'm
going to be grumpy if people try to push as much to rc4 as happened to rc3,
and (b) I'm going to be traveling for most of next week (and part of the
week after). I'll have internet, but I really really hope and expect that
things should be calmer coming up. Right? RIGHT GUYS?"
Stable updates: 3.9.4, 3.4.47,
and 3.0.80 were released on May 24.
Comments (none posted)
Sarah Sharp
reports
on the response to the availability of a set of Outreach Program for
Women internships working on the Linux kernel. "
As coordinator for
the Linux kernel OPW project, I was really worried about whether applicants
would be able to get patches into the kernel. Everyone knows that kernel
maintainers are the pickiest bastards^Wperfectionists about coding style,
getting the proper Signed-off-by, sending plain text email, etc. I thought
a couple applicants would be able to complete maybe one or two patches,
tops. Boy was I wrong!" In the end, 41 applicants submitted
applications and 18 of those submitted 374
patches to the kernel, of which 137 were accepted.
Comments (2 posted)
Kernel development news
By Jake Edge
May 30, 2013
As part of the developer track at this year's Automotive
Linux Summit Spring, Greg
Kroah-Hartman talked about interprocess communication (IPC) in the kernel with
an eye toward the motivations behind kdbus. The work on kdbus is
progressing well and Kroah-Hartman expressed optimism that it would be
merged before the end of the year. Beyond just providing a faster D-Bus
(which could be accomplished without moving it into the kernel, he said),
it is his hope that kdbus can eventually replace Android's binder IPC
mechanism.
Survey of IPC
There are a lot of different ways to communicate between processes
available in Linux (and, for many of the mechanisms, more widely in Unix).
Kroah-Hartman
strongly recommended Michael Kerrisk's book, The Linux Programming Interface, as
a reference to these IPC mechanisms (and most other things in the Linux
API). Several of his slides
[PDF] were taken directly from the book.
All of the different IPC mechanisms fall into one of three categories, he
said: signals, synchronization, or communication. He used diagrams from Kerrisk's book (page 878) to show the
categories and their members.
There are two types of
signals in the kernel, standard and realtime, though the latter doesn't see much
use, he said.
Synchronization methods are numerous, including futexes and
eventfd(), which
are both relatively new. Semaphores are also available, both as the
"old style" System V semaphores and as "fixed up" by
POSIX. The latter come in both named and unnamed varieties. There is also
file locking, which has two flavors: record locks to lock a
portion of a file and file locks to prevent access to the whole
file. However, the code that implements file locking is "scary", he said.
Threads have four separate types of synchronization methods (mutex,
condition variables, barriers, and read/write locks) available as well.
For communication, there are many different kernel services available too.
For data transfer, one can use pseudo-terminals. For byte-stream-oriented
data, there are pipes, FIFOs, and stream sockets. For communicating via
messages, there are
both POSIX and System V flavored message queues. Lastly, there is
shared memory which
also comes in POSIX and System V varieties along with mmap()
for anonymous and file mappings. Anonymous mappings with mmap()
were not something Kroah-Hartman knew about until recently; they ended up
using them in kdbus.
Android IPC
"That is everything we have today, except for Android", Kroah-Hartman said.
All of the existing IPC mechanisms were "not enough for Android", so that
project
added ashmem, pmem, and binder. Ashmem is "POSIX shared memory for the
lazy" in his estimation. The Android developers decided to write kernel
code rather than user-space code, he said. Ashmem uses virtual memory and
can discard memory segments when the system is under memory pressure.
Currently, ashmem lives in the staging tree, but he thinks that Google is
moving to other methods, so it may get deleted from the tree soon.
Pmem is a mechanism to share physical memory. It was used to talk to GPUs.
Newer versions of Android don't use pmem, so it may also go away. Instead,
Android is using the ION memory allocator now.
Binder is "weird", Kroah-Hartman said. It came from BeOS and its
developers were from academia. It was developed and used on systems
without the System V IPC APIs available and, via Palm and Danger, came
to Android. It is "kind of like D-Bus", and some (including him) would
argue that Android should have used D-Bus, but it didn't. It has a large
user-space library that must be used to perform IPC with binder.
Binder has a number of serious security issues when used outside of an
Android environment, he said, so he stressed that it should never be
used by other Linux-based systems.
In Android, binder is used for intents and app
separation; it is good for passing around small messages, not pictures or
streams of data. You can also use it to pass file descriptors to other
processes. It is not particularly efficient, as sending a message
makes lots of hops through the library. A presentation
[YouTube] at this year's Android
Builders Summit showed that one message required eight kernel-to-user-space
transitions.
More IPC
A lot of developers in the automotive world have used QNX, which has a nice
message-passing model. You can send a message and pass control to another
process, which is good for realtime and single processor systems,
Kroah-Hartman said.
Large automotive companies have built huge systems on top of QNX messages,
creating large libraries used by their applications. They would like to be
able to use those libraries on Linux, but often don't know that there is a
way to get the QNX message API for Linux. It is called SIMPL and it works well.
Another solution, though it is not merged into the kernel, is KBUS, which was created by some
students in England. It provides simple message passing through the
kernel, but cannot pass file descriptors. Its implementation involves
multiple data copies, but for 99% of use cases, that's just fine, he said.
Multiple copies are still fast on today's fast processors. The KBUS
developers never asked for it to be merged, as far as he knows, but if they
did, there is "no
reason not to take it".
D-Bus is a user-space messaging solution with strong typing and process
lifecycle handling. Applications subscribe to messages or message types
they are interested in. They can also create an application bus to listen
for messages sent to them. It is widely used on Linux desktops and
servers, is well-tested, and well-documented too. It uses the operating system
IPC services and can run on Unix-like systems as well as Windows.
The D-Bus developers have always said that it is not optimized for speed.
The original developer, Havoc Pennington, created a list of ideas on how to
speed it up if that was of interest, but speed was not the motivation
behind its development. In the automotive industry, there have been
numerous efforts to speed D-Bus up.
One of those efforts was the AF_BUS address
family, which came about because in-vehicle infotainment (IVI) systems
needed better D-Bus performance. Collabora was sponsored by GENIVI to come
up with a solution and AF_BUS was the result. Instead of the four
system calls required for a D-Bus message, AF_BUS reduced that to
two, which made it "much faster". But that solution was rejected by the
kernel network maintainers.
The systemd project rewrote libdbus in an effort to simplify the
code, but it turned out to significantly increase the performance of D-Bus
as well. In preliminary benchmarks, BMW found
[PPT] that the systemd D-Bus library increased performance by 360%.
That was unexpected, but the rewrite did take some shortcuts and listened
to what Pennington had said about D-Bus performance. Kroah-Hartman's
conclusion is that "if you want a faster D-Bus, rewrite the daemon, don't
mess with the kernel". For example, there is a Go implementation of D-Bus
that is "really fast". The Linux kernel IPC mechanisms are faster than any
other operating system, he said, though it may "fight" with some of the
BSDs for performance supremacy on some IPC types.
kdbus
In the GNOME project, there is plan for something called "portals" that
will containerize GNOME applications. That would allow running
applications from multiple versions of GNOME at the same time while also
providing application separation so that misbehaving or malicious
applications could not affect others. Eventually, something like Android's
intents will also be part of portals, but the feature is still a long way
out, he said. Portals provides one of the main motivations behind kdbus.
So there is a need for an enhanced D-Bus that has some additional
features. At a recent GNOME hackfest, Kay Sievers, Lennart Poettering,
Kroah-Hartman,
and some other GNOME developers sat down to discuss a new messaging
scheme, which is what kdbus is. It will support multicast and
single-endpoint messages, without any extra wakeups from the kernel, he said.
There will be no blocking calls to kdbus, unlike binder which can sleep,
as the API for kdbus is completely asynchronous.
Instead of
doing the message filtering in user space, kdbus will do it in the kernel
using Bloom
filters, which will allow the kernel to only wake up the destination
process, unlike D-Bus. Bloom filters have been publicized by Google
engineers recently,
and they are an "all math" scheme that uses hashes to make searching very
fast. There are hash collisions, so there is still some searching that
needs to be done, but the vast majority of the non-matches are eliminated
immediately.
Kdbus ended up with a naming database in the kernel to track the message
types and bus names, which "scared the heck out of me", Kroah-Hartman
said. But it turned to be "tiny" and worked quite well. In some ways, it
is similar to DNS, he said.
Kdbus will provide reliable order guarantees, so that messages will be
received in the order they were sent. Only the kernel can make that
guarantee, he said, and the current D-Bus does a lot of extra work to try to
ensure the ordering. The guarantee only applies to messages sent from a
single process, the order of "simultaneous" messages from multiple
processes is not guaranteed.
Passing file descriptors over kdbus will be supported. There is also a
one-copy message passing mechanism that Tejun Heo and Sievers came up
with. Heo actually got zero-copy working, but it was "even scarier", so
they decided against using it. Effectively, with one-copy, the kernel
copies the message from
user space directly into the receive buffer for the destination process.
Kdbus might be fast enough to handle data streams as well as messages, but
Kroah-Hartman does not know if that will be implemented.
Because it is in the kernel, kdbus gets a number of attributes almost for
free. It is namespace aware, which was easy to add because the namespace
developers have made it straightforward to do so. It also integrated with
the audit subystem,
which is important to the enterprise distributions. For D-Bus,
getting SELinux support was a lot of work, but kdbus is Linux Security
Module (LSM) aware, so it got SELinux (Smack, TOMOYO, AppArmor, ...)
support for free.
Current kdbus status
As a way to test kdbus, the systemd team has replaced D-Bus in systemd with
kdbus. The code is available in the systemd tree, but it is still a work
in progress. The kdbus developers are not even looking at speed yet, but
some rudimentary tests suggest that it is "very fast". Kdbus will require
a recent kernel as it uses control groups (cgroups); it also requires some
patches that were only merged into 3.10-rc kernels.
The plan is to merge kdbus when it is "ready", which he hopes will be
before the end of the year. His goal, though it is not a general project
goal, is to replace Android's binder with kdbus. He has talked to the
binder people at Google and they are amenable to that, as it would allow
them to delete a bunch of code they are currently carrying in their trees.
Kdbus will not "scale to the cloud", Kroah-Hartman said in answer to a
question from the audience, because it only sends
messages on a single system. There are already inter-system messaging
protocols that can be used for that use case. In addition, the network
maintainers placed a restriction on kdbus: don't touch the networking
code. That makes sense because it is an IPC mechanism, and that is where
AF_BUS ran aground.
The automotive industry will be particularly interested because it is used
to using the QNX message passing, which it mapped to libdbus. It chose
D-Bus because it is well-documented, well-understood, and is as easy to use
as QNX. But, it doesn't just want a faster D-Bus (which could be achieved
by rewriting it), it wants more: namespace support, audit support, SELinux,
application separation, and so on.
Finally, someone asked whether Linus Torvalds was "on board" with kdbus.
Kroah-Hartman said that he didn't know, but that kdbus is self-contained,
so he doesn't think Torvalds will block it. Marcel Holtmann said that
Torvalds was "fine with it" six years ago when another, similar idea had
been proposed. Kroah-Hartman noted that getting it past Al Viro might be
more difficult than getting it past Torvalds, but binder is "hairy code"
and Viro is the one who found the security problems there.
Right now, they are working on getting the system to boot with systemd
using kdbus. There are some tests for kdbus, but booting with systemd will
give them a lot of confidence in the feature. The kernel side of the code
is done, he thinks, but they thought that earlier and then Heo came up with
zero and one-copy. He would be happy if it is merged by the end of the
year, but if it isn't, it shouldn't stretch much past that, and he
encouraged people to start looking at kdbus for their messaging needs in
the future.
[ I would like to thank the Linux Foundation for travel assistance so that I
could attend
the Automotive Linux Summit Spring and LinuxCon Japan. ]
Comments (12 posted)
According to Btrfs developer Chris Mason, tuning Linux filesystems to work
well on solid-state storage devices is a lot like working on an old,
clunky car. Lots of work goes into just trying to make the thing run with
decent performance. Old cars may have mainly hardware-related problems,
but, with Linux,
the bottleneck is almost always to be found in the software. It is, he
said, hard to give a customer a high-performance device and expect them to
actually see that performance in their application. Fixing this problem
will require work in a lot of areas. One of those areas, supporting and
using atomic I/O operations, shows particular potential.
The problem
To demonstrate the kind of problem that filesystem developers are grappling
with, Chris started with a plot from a problematic customer workload on an
ext4 filesystem; it showed alternating periods of high and low I/O
throughput rates. The source of the problem, in this case, was a
combination of (1) overwriting an existing file and (2) a
filesystem that had been
mounted in the data=ordered mode. That combination causes data
blocks to be put into a special list that must get flushed to disk every
time that the filesystem commits a transaction. Since the system in
question had a fair amount of memory, the normal asynchronous writeback
mechanism didn't kick in, so dirty blocks were not being written steadily;
instead, they all had to be flushed when the transaction commit happened.
The periods of low throughput corresponded to the transaction commits;
everything just stopped while the filesystem caught up with its pending work.
In general, a filesystem commit operation involves a number of steps, the
first of which is to write all of the relevant file data and wait for that
write to complete. Then the critical metadata can be written to the log;
once again, the filesystem must wait until that write is done. Finally,
the commit block can be written — inevitably followed by a wait. All of
those waits are critical for filesystem integrity, but they can be quite
hard on performance.
Quite a few workloads — including a lot of database workloads — are
especially sensitive to the latency imposed by waits in the filesystem.
If the number of
waits could be somehow reduced, latency would improve. Fewer waits would
also make it possible to send larger I/O operations to the device, with a
couple of significant benefits: performance would improve, and, since large
chunks are friendlier to a flash-based device's garbage-collection
subsystem, the lifetime of the device would also improve. So reducing the
number of wait operations executed in a filesystem transaction commit is an
important prerequisite for getting the best performance out of contemporary
drives.
Atomic I/O operations
One way to reduce waits is with atomic I/O operations — operations that are
guaranteed (by the hardware) to either succeed or fail as a whole. If the
system performs an atomic write of four blocks to the device, either all
four blocks will be successfully written, or none of them will be. In
many cases, hardware that supports atomic operations can provide the same
integrity guarantees that are provided by waits now, making those waits
unnecessary. The T10 (SCSI) standard committee has approved a simple
specification for atomic operations; it only supports contiguous I/O
operations, so it is "not very exciting." Work is proceeding on vectored
atomic operations that would handle writes to multiple discontiguous areas
on the disk, but that has not yet been finalized.
As an example of how atomic I/O operations can help performance, Chris
looked at the special log used by Btrfs to implement the fsync()
system call. The filesystem will respond to an fsync() by writing
the important data to a new log block. In the current code, each commit
only has to wait twice, thanks to some recent work by Josef Bacik: once
for the write of the data and metadata, and once for the superblock write.
That work brought a big performance boost, but atomic I/O can push things
even further. By using atomic operations to eliminate one more wait,
Chris was able to improve performance by 10-15%;
he said he thought the improvement should be better than that, but even
that level of improvement is the kind of thing database vendors send out
press releases for. Getting a 15% improvement without even trying that
hard, he said, was a nice thing.
At Fusion-io, work has been done to enable atomic I/O operations in the MariaDB
and Percona database management systems. Currently, these operations are
only enabled with the
Fusion-io software development kit and its "DirectFS" filesystem. Atomic
I/O operations allowed the elimination of the MySQL-derived
double-buffering mode, resulting in 43% more transactions per second and
half the wear on the storage device. Both improvements matter: if you have
made a large investment in flash storage, getting twice the life is worth a
lot of money.
Getting there
So it's one thing to hack some atomic operations into a database
application; making atomic I/O operations more generally available is a
larger problem. Chris has developed a set of API changes that will allow
user-space programs to make use of atomic I/O operations, but there are
some significant limitations, starting with the fact that only direct I/O
is supported. With buffered I/O, it just is not possible for the kernel to
track the various pieces through the stack and guarantee atomicity. There
will also need to be some limitations on the maximum size of any given I/O
operation.
An application will request atomic I/O with the new O_ATOMIC flag
to the open() system call. That is all that is required; many
direct I/O applications, Chris said, can benefit from atomic I/O operations
nearly for free. Even at this level, there are benefits. Oracle's
database, he said, pretends it has atomic I/O when it doesn't; the result
can be "fractured blocks" where a system crash interrupts the writing of a
data block that been scattered across a fragmented filesystem, leading to
database corruption. With atomic I/O operations, those fragmented blocks
will be a thing of the past.
Atomic I/O operation support can be taken a step further, though, by adding
asynchronous I/O (AIO) support. The nice thing about the Linux AIO
interface (which is not generally acknowledged to have many nice aspects)
is that it allows an application to enqueue multiple I/O operations with a
single system call. With atomic support, those multiple operations — which
need not all involve the same file — can all be done as a single atomic
unit. That allows multi-file atomic operations, a feature which can be
used to simplify database transaction engines and improve performance.
Once this functionality is in place, Chris hopes, the (currently small)
number of users of the kernel's direct I/O and AIO capabilities will
increase.
Some block-layer changes will clearly be needed to make this all work, of
course. Low-level drivers will need to advertise the maximum number of
atomic segments any given device will support. The block layer's plugging
infrastructure, which temporarily stops the issuing of I/O requests to
allow them to accumulate and be coalesced, will need to be extended.
Currently, a plugged queue is automatically unplugged when the current
kernel thread schedules out of the processor; there will need to be a means
to require an explicit unplug operation instead. This, Chris noted, was
how plugging used to work, and it caused a lot of problems with lost unplug
operations. Explicit unplugging was removed for a reason; it would have to
be re-added carefully and used "with restraint." Once that feature is
there, the AIO and direct I/O code will need to be modified to hold queue
plugs for the creation of atomic writes.
The hard part, though, is, as usual, the error handling. The filesystem
must stay consistent even if an atomic operation grows too large to
complete. There are a number of tricky cases where this can come about.
There are also challenges with deadlocks while waiting for plugged I/O.
The hardest problem, though, may be related to the simple fact that the
proper functioning of atomic I/O operations will only be tested when things
go wrong — a system crash, for example. It is hard to know that
rarely-tested code works well. So there needs to be a comprehensive test
suite that can verify that the hardware's atomic I/O operations are working
properly. Otherwise, it will be hard to have full confidence in the
integrity guarantees provided by atomic operations.
Status and future work
The Fusion-io driver has had atomic I/O operation support for some time,
but Chris would like to make this support widely available so that
developers can count on its presence.
Extending it to NVM Express is in progress now; SCSI support will probably
wait until the specification for vectored atomic operations is complete.
Btrfs can use (vectored) atomic I/O operations in its transaction commits;
work on other filesystems is progressing. The changes to the plugging code
are done with the small exception of the deadlock handler; that gap needs
to be filled before the patches can go out, Chris said.
From here, it will be necessary to finalize the proposed changes to the
kernel API and submit them for review. The review process itself could
take some time; the AIO and direct I/O interfaces tend to be contentious,
with lots of developers arguing about them but few being willing to
actually work on that code. So a few iterations on the API can be expected
there. The FIO benchmark
needs to be extended to test atomic I/O. Then
there is the large task of enabling atomic I/O operation in
applications.
For the foreseeable future, a number of limitations will apply to atomic
I/O operations. The most annoying is likely to be the small maximum I/O
size: 64KB for the time being. Someday, hopefully, that maximum will be
increased significantly, but for now it applies. The complexity of the AIO
and direct I/O code will challenge filesystem implementations; the code is
far more complex than one might expect, and each filesystem interfaces with
that code in a slightly different way. There are worries about performance
variations between vendors; Fusion-io devices can implement atomic I/O
operations at very low cost, but that may not be the case for all hardware.
Atomic I/O operations also cannot work across multiple devices; that means
that the kernel's RAID implementations will require work to be able to
support atomic I/O. This work, Chris said, will not be in the initial
patches.
There are alternatives to atomic I/O operations, including explicit I/O
ordering (used in the kernel for years) or I/O checksumming (to detect
incomplete I/O operations after a crash). But, for many situations, atomic
I/O operations look like a good way to let the hardware help the software
get better performance from solid-state drives. Once this functionality
finds its way into the mainline, taking advantage of fast drives might just
feel a bit less like coaxing another trip out of that old jalopy.
[Your editor would like to thank the Linux Foundation for supporting his
travel to LinuxCon Japan.]
Comments (20 posted)
May 30, 2013
This article was contributed by Andrew Shewmaker
The Linux kernel manages sparse sets of index ranges for various subsystems. For
instance, the I/O memory management unit (IOMMU) keeps track of the
addresses of outstanding device memory mappings for each
PCIE device's domain and tracks holes for new allocations. File systems cache
pending operations, extent state, free extents, and more. A simple linked
list fails to provide good performance for search, insertion, and
deletion operations as it grows, so more advanced abstract data types like
red-black trees (rbtrees) were implemented. The
kernel's rbtrees generally provide good performance as long as they are only
being accessed by one reader/writer, but they suffer when multiple readers and
writers are contending for locks.
There are variants of rbtrees that allow efficient
read-copy-update (RCU) based reads, but not fine-grained
concurrent writes. So, Chris Mason of Btrfs fame is developing a
skiplist implementation that will allow file systems like Btrfs
and XFS to perform many concurrent updates to their extent indexes.
This article will describe the basic skiplist and what makes this skiplist variant
cache-friendly. In part two, I'll describe the skiplist API and compare the
performance of skiplists to rbtrees.
Basic skiplist
William Pugh first described skiplists in 1990 as a probabilistic alternative
to balanced trees (such as rbtrees) that is simpler, faster, and more
space-efficient. A skiplist is composed of a hierarchy of ordered linked lists, where
each higher level contains a sparser subset of the list below it. The size of
the skiplist decreases at a constant rate for each higher level. For instance,
if level zero has 32 elements and a probability p=0.5 that an element
appears in the next higher level, then level one has 16 elements, level two has eight,
etc. The subset selected for the next higher level is random, but the quantity
of items in the subset isn't random.
By way of analogy, consider an old-fashioned printed dictionary with multiple
search aids built into it. This diagram shows how the different aids are similar
to levels of a skiplist:
At the highest level there are grooves on the edge
of the book that mark sections for each letter, A-Z. The search can continue by
looking at word prefixes centered at the top of each page. Next, guide
words at the top left and right of
each page show the span of words for each page. At the lowest level, a page is
scanned word by word.
A skiplist replicates something like this index structure in a digital form.
History of kernel skiplists
Josh MacDonald investigated
cache-friendly skiplists
on Linux 2.4 more than a decade ago. His skiplists are more space-efficient than
rbtrees with more than 100 keys, searches are faster after 1,000 keys, and
deletions and insertions at 10,000 keys. He also performed various concurrency
experiments. When Josh posted an
RFC to
the kernel mailing list he presented them as a "solution waiting for
the right problem to come along". At the time, no discussion ensued, but
Chris later became aware of Josh's work during his research.
Unfortunately, Josh's implementation focused on reader/writer locks, which only
allow one writer at a time, and Chris told me that he wanted to use RCU.
Also, Josh's coding style differs from the kernel's and his macros make his code
a little difficult to work with.
Other well-known Linux kernel developers have experimented with skiplists
as well. Con
Kolivas initially thought they would be useful for his
BFS CPU scheduler. Ultimately,
BFS
experienced worse behavior in general when using skiplists,
so Con returned to his previous list structure.
This was due in part to the BFS scheduler's common case of having few tasks to
schedule, so it didn't benefit from theoretically faster lookups for larger
numbers of items in the list. Also, the scheduler process wasn't parallel
itself, so it didn't benefit from a skiplist's more easily exploited concurrency.
Andi Kleen recalled the discouraging results from his own
skiplist
investigation to Con. He found that the variable number of pointers needed
to link different levels of a skiplist made it difficult to achieve efficient
use of memory and cache without limiting the size of the skiplist.
In 2011, Chris asked Liu Bo to try replacing Btrfs's rbtree-based
extent_map code, which maps logical file offsets to physical disk
offsets, with a skiplist-based implementation. The mappings are
read-mostly, until a random write workload triggers
a copy-on-write operation. Liu created an
initial skiplist
patch for Btrfs beginning with Con's implementation and adding support for
concurrency with RCU. Results were mixed, and Chris's user-space experimentation
led him to start work to make Liu's skiplists more cache-friendly.
You may be saying to yourself, "I thought the whole point of Btrfs was to use
B-trees for everything." Btrfs does use its copy-on-write B-trees for any
structure read from or written to the disk. However, it also uses rbtrees for
in-memory caches. Some rbtrees batch pending inode and extent operations.
Other caches are: the extent state cache — tracking whether extents are
locked, damaged, dirty, etc.; the free space cache — remembering free
extents for quicker allocations; and the previously mentioned extent_map.
In addition to the rbtrees, a radix tree manages the extent buffer cache,
which is used like the page cache, but for blocks of metadata that might be larger
than a page.
All of these data structures have multiple threads contending for access to them
and might benefit from skiplists, though the delayed operation trees and free
space cache have the most contention. However, Chris's real target for this
skiplist is some pending RAID5/6 parity logging code. It needs to enable
"fast concurrent reads and writes into an exception store with new
locations for some extents." Ideally, Chris hopes to make his skiplist
general purpose enough for others to use. If skiplists can provide lockless
lookups and be used for both buffer cache indexes and extent trees, then
Dave Chinner would consider using them in XFS.
Cache-friendly skiplist
When reading the following description of Chris Mason's skiplist implementation, keep in mind
that it is a skiplist for range indexes, or extents. The extents being managed
are not identified by a single number. Each has an index/key pointing to its
beginning and a range/size. Furthermore, each element of the skiplist is
composed of multiple extents, which will be referred to as slots.
This new implementation of a cache-friendly skiplist is a bit more
complicated than the picture of the basic skiplist may suggest; it is best
examined in pieces. The first of those pieces is described by this
diagram (a subset of the full data structure
diagram):
This figure shows a skiplist anchored by an sl_list structure
pointing to the initial entry (represented by struct sl_node) in
the list. That sl_node structure has an array of pointers (called
ptrs), indexed by level, to the head sl_node_ptr
structures of each skiplist level. The sl_node_ptr structure
functions like a typical kernel list_head structure, except that
each level's head sl_node_ptr->prev is used, possibly
confusingly, to point to the item with the greatest key at that level of
the skiplist. All locking is done at the sl_node level and will be
described in part two of this article.
The skiplist grows from the head sl_node_ptr array on the right of
the diagram above
into an array of linked lists as shown below:
Note that none of the structures listed so far contain any keys or
data, unlike traditional skiplists. That's because Chris's skiplist items
are associated with more than one key.
Another difference is that previous skiplist implementations lack prev
pointers. Pugh's original skiplist didn't need something like prev
pointers because it didn't support concurrency. MacDonald's skiplist does
support concurrency, but its protocol uses context structures that combine
parent/child pointer information with lock states. Mason's skiplists use a
different concurrency protocol that uses prev pointers.
A superficial difference between this skiplist and others is its apparent lack
of down pointers (allowing movement from one level of index to the
next), although they exist as the ptrs[] array
in sl_node. While the pointer array name is unimportant, its position
is, because different level nodes are created with different sizes of arrays.
See the ptrs array at the end of sl_node and sl_node
at the end of struct sl_leaf (which holds the actual leaf data):
The variable size of nodes could
cause memory fragmentation if they were allocated from the same pool of memory,
so Chris designed his skiplist with one slab cache for each level. This
effectively addresses Andi Kleen's concerns mentioned earlier regarding wasting
memory while keeping good performance.
The way in which keys are handled is cache-friendly in a couple of ways. First,
keys are not associated directly with individual items in this skiplist, but each
leaf manages a set of keys. SKIP_KEYS_PER_NODE is currently set to 32
as a balance between increasing cache hits and providing more chances for
concurrency. Second, an sl_leaf contains an array of keys.
Technically a keys array is unnecessary, since the keys are part of
the slots linked to a leaf. However, a search only needs the keys and
not the rest of the sl_slot structures until the end of the search. So,
the array helps skiplist operations avoid thrashing the cache.
Each sl_slot is an extent defined by its key and size
(index range). A developer can adapt the skiplist for their own use by embedding
sl_slot into their own data structure. The job of reference counting
belongs to the container of sl_slot.
See the full diagram for a picture of how
all the pieces described above fit together.
With 32 keys per item, a maximum level of 32, and half the
number of items in each higher level, Chris's skiplist should be able to
efficiently manage around 137 billion extents (index ranges).
The statistically balanced nature of the skiplist allows it to handle sparse
sets of indexes in such a way that modifications avoid the need for an rbtree's
rebalancing operations, and so they lend themselves to simpler concurrency
protocols. Chris also designed his skiplist to be cache-friendly by
having each element represent a set of keys/slots, keeping duplicates of
the slot's keys in an array, and creating a slab cache for each level of
the skiplist.
Part two of this series will offer a description of the skliplist API and a
discussion of the performance of skiplists vs. rbtrees.
Comments (4 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>