LWN.net Logo

Kernel development

Brief items

Kernel release status

The current development kernel is 3.10-rc3, released on May 26 with some grumbles about how it contains more changes than he would have liked. "I can pretty much guarantee that -rc4 is going to be smaller, because (a) I'm going to be grumpy if people try to push as much to rc4 as happened to rc3, and (b) I'm going to be traveling for most of next week (and part of the week after). I'll have internet, but I really really hope and expect that things should be calmer coming up. Right? RIGHT GUYS?"

Stable updates: 3.9.4, 3.4.47, and 3.0.80 were released on May 24.

Comments (none posted)

Sharp: Linux Kernel Internships (OPW) Update

Sarah Sharp reports on the response to the availability of a set of Outreach Program for Women internships working on the Linux kernel. "As coordinator for the Linux kernel OPW project, I was really worried about whether applicants would be able to get patches into the kernel. Everyone knows that kernel maintainers are the pickiest bastards^Wperfectionists about coding style, getting the proper Signed-off-by, sending plain text email, etc. I thought a couple applicants would be able to complete maybe one or two patches, tops. Boy was I wrong!" In the end, 41 applicants submitted applications and 18 of those submitted 374 patches to the kernel, of which 137 were accepted.

Comments (2 posted)

Kernel development news

ALS: Linux interprocess communication and kdbus

By Jake Edge
May 30, 2013

As part of the developer track at this year's Automotive Linux Summit Spring, Greg Kroah-Hartman talked about interprocess communication (IPC) in the kernel with an eye toward the motivations behind kdbus. The work on kdbus is progressing well and Kroah-Hartman expressed optimism that it would be merged before the end of the year. Beyond just providing a faster D-Bus (which could be accomplished without moving it into the kernel, he said), it is his hope that kdbus can eventually replace Android's binder IPC mechanism.

Survey of IPC

[Greg Kroah-Hartman]

There are a lot of different ways to communicate between processes available in Linux (and, for many of the mechanisms, more widely in Unix). Kroah-Hartman strongly recommended Michael Kerrisk's book, The Linux Programming Interface, as a reference to these IPC mechanisms (and most other things in the Linux API). Several of his slides [PDF] were taken directly from the book. All of the different IPC mechanisms fall into one of three categories, he said: signals, synchronization, or communication. He used diagrams from Kerrisk's book (page 878) to show the categories and their members.

There are two types of signals in the kernel, standard and realtime, though the latter doesn't see much use, he said.

Synchronization methods are numerous, including futexes and eventfd(), which are both relatively new. Semaphores are also available, both as the "old style" System V semaphores and as "fixed up" by POSIX. The latter come in both named and unnamed varieties. There is also file locking, which has two flavors: record locks to lock a portion of a file and file locks to prevent access to the whole file. However, the code that implements file locking is "scary", he said. Threads have four separate types of synchronization methods (mutex, condition variables, barriers, and read/write locks) available as well.

For communication, there are many different kernel services available too. For data transfer, one can use pseudo-terminals. For byte-stream-oriented data, there are pipes, FIFOs, and stream sockets. For communicating via messages, there are both POSIX and System V flavored message queues. Lastly, there is shared memory which also comes in POSIX and System V varieties along with mmap() for anonymous and file mappings. Anonymous mappings with mmap() were not something Kroah-Hartman knew about until recently; they ended up using them in kdbus.

Android IPC

"That is everything we have today, except for Android", Kroah-Hartman said. All of the existing IPC mechanisms were "not enough for Android", so that project added ashmem, pmem, and binder. Ashmem is "POSIX shared memory for the lazy" in his estimation. The Android developers decided to write kernel code rather than user-space code, he said. Ashmem uses virtual memory and can discard memory segments when the system is under memory pressure. Currently, ashmem lives in the staging tree, but he thinks that Google is moving to other methods, so it may get deleted from the tree soon.

Pmem is a mechanism to share physical memory. It was used to talk to GPUs. Newer versions of Android don't use pmem, so it may also go away. Instead, Android is using the ION memory allocator now.

Binder is "weird", Kroah-Hartman said. It came from BeOS and its developers were from academia. It was developed and used on systems without the System V IPC APIs available and, via Palm and Danger, came to Android. It is "kind of like D-Bus", and some (including him) would argue that Android should have used D-Bus, but it didn't. It has a large user-space library that must be used to perform IPC with binder.

Binder has a number of serious security issues when used outside of an Android environment, he said, so he stressed that it should never be used by other Linux-based systems.

In Android, binder is used for intents and app separation; it is good for passing around small messages, not pictures or streams of data. You can also use it to pass file descriptors to other processes. It is not particularly efficient, as sending a message makes lots of hops through the library. A presentation [YouTube] at this year's Android Builders Summit showed that one message required eight kernel-to-user-space transitions.

More IPC

A lot of developers in the automotive world have used QNX, which has a nice message-passing model. You can send a message and pass control to another process, which is good for realtime and single processor systems, Kroah-Hartman said. Large automotive companies have built huge systems on top of QNX messages, creating large libraries used by their applications. They would like to be able to use those libraries on Linux, but often don't know that there is a way to get the QNX message API for Linux. It is called SIMPL and it works well.

Another solution, though it is not merged into the kernel, is KBUS, which was created by some students in England. It provides simple message passing through the kernel, but cannot pass file descriptors. Its implementation involves multiple data copies, but for 99% of use cases, that's just fine, he said. Multiple copies are still fast on today's fast processors. The KBUS developers never asked for it to be merged, as far as he knows, but if they did, there is "no reason not to take it".

D-Bus is a user-space messaging solution with strong typing and process lifecycle handling. Applications subscribe to messages or message types they are interested in. They can also create an application bus to listen for messages sent to them. It is widely used on Linux desktops and servers, is well-tested, and well-documented too. It uses the operating system IPC services and can run on Unix-like systems as well as Windows.

The D-Bus developers have always said that it is not optimized for speed. The original developer, Havoc Pennington, created a list of ideas on how to speed it up if that was of interest, but speed was not the motivation behind its development. In the automotive industry, there have been numerous efforts to speed D-Bus up.

One of those efforts was the AF_BUS address family, which came about because in-vehicle infotainment (IVI) systems needed better D-Bus performance. Collabora was sponsored by GENIVI to come up with a solution and AF_BUS was the result. Instead of the four system calls required for a D-Bus message, AF_BUS reduced that to two, which made it "much faster". But that solution was rejected by the kernel network maintainers.

The systemd project rewrote libdbus in an effort to simplify the code, but it turned out to significantly increase the performance of D-Bus as well. In preliminary benchmarks, BMW found [PPT] that the systemd D-Bus library increased performance by 360%. That was unexpected, but the rewrite did take some shortcuts and listened to what Pennington had said about D-Bus performance. Kroah-Hartman's conclusion is that "if you want a faster D-Bus, rewrite the daemon, don't mess with the kernel". For example, there is a Go implementation of D-Bus that is "really fast". The Linux kernel IPC mechanisms are faster than any other operating system, he said, though it may "fight" with some of the BSDs for performance supremacy on some IPC types.

kdbus

In the GNOME project, there is plan for something called "portals" that will containerize GNOME applications. That would allow running applications from multiple versions of GNOME at the same time while also providing application separation so that misbehaving or malicious applications could not affect others. Eventually, something like Android's intents will also be part of portals, but the feature is still a long way out, he said. Portals provides one of the main motivations behind kdbus.

So there is a need for an enhanced D-Bus that has some additional features. At a recent GNOME hackfest, Kay Sievers, Lennart Poettering, Kroah-Hartman, and some other GNOME developers sat down to discuss a new messaging scheme, which is what kdbus is. It will support multicast and single-endpoint messages, without any extra wakeups from the kernel, he said. There will be no blocking calls to kdbus, unlike binder which can sleep, as the API for kdbus is completely asynchronous.

Instead of doing the message filtering in user space, kdbus will do it in the kernel using Bloom filters, which will allow the kernel to only wake up the destination process, unlike D-Bus. Bloom filters have been publicized by Google engineers recently, and they are an "all math" scheme that uses hashes to make searching very fast. There are hash collisions, so there is still some searching that needs to be done, but the vast majority of the non-matches are eliminated immediately.

Kdbus ended up with a naming database in the kernel to track the message types and bus names, which "scared the heck out of me", Kroah-Hartman said. But it turned to be "tiny" and worked quite well. In some ways, it is similar to DNS, he said.

Kdbus will provide reliable order guarantees, so that messages will be received in the order they were sent. Only the kernel can make that guarantee, he said, and the current D-Bus does a lot of extra work to try to ensure the ordering. The guarantee only applies to messages sent from a single process, the order of "simultaneous" messages from multiple processes is not guaranteed.

Passing file descriptors over kdbus will be supported. There is also a one-copy message passing mechanism that Tejun Heo and Sievers came up with. Heo actually got zero-copy working, but it was "even scarier", so they decided against using it. Effectively, with one-copy, the kernel copies the message from user space directly into the receive buffer for the destination process. Kdbus might be fast enough to handle data streams as well as messages, but Kroah-Hartman does not know if that will be implemented.

Because it is in the kernel, kdbus gets a number of attributes almost for free. It is namespace aware, which was easy to add because the namespace developers have made it straightforward to do so. It also integrated with the audit subystem, which is important to the enterprise distributions. For D-Bus, getting SELinux support was a lot of work, but kdbus is Linux Security Module (LSM) aware, so it got SELinux (Smack, TOMOYO, AppArmor, ...) support for free.

Current kdbus status

As a way to test kdbus, the systemd team has replaced D-Bus in systemd with kdbus. The code is available in the systemd tree, but it is still a work in progress. The kdbus developers are not even looking at speed yet, but some rudimentary tests suggest that it is "very fast". Kdbus will require a recent kernel as it uses control groups (cgroups); it also requires some patches that were only merged into 3.10-rc kernels.

The plan is to merge kdbus when it is "ready", which he hopes will be before the end of the year. His goal, though it is not a general project goal, is to replace Android's binder with kdbus. He has talked to the binder people at Google and they are amenable to that, as it would allow them to delete a bunch of code they are currently carrying in their trees.

Kdbus will not "scale to the cloud", Kroah-Hartman said in answer to a question from the audience, because it only sends messages on a single system. There are already inter-system messaging protocols that can be used for that use case. In addition, the network maintainers placed a restriction on kdbus: don't touch the networking code. That makes sense because it is an IPC mechanism, and that is where AF_BUS ran aground.

The automotive industry will be particularly interested because it is used to using the QNX message passing, which it mapped to libdbus. It chose D-Bus because it is well-documented, well-understood, and is as easy to use as QNX. But, it doesn't just want a faster D-Bus (which could be achieved by rewriting it), it wants more: namespace support, audit support, SELinux, application separation, and so on.

Finally, someone asked whether Linus Torvalds was "on board" with kdbus. Kroah-Hartman said that he didn't know, but that kdbus is self-contained, so he doesn't think Torvalds will block it. Marcel Holtmann said that Torvalds was "fine with it" six years ago when another, similar idea had been proposed. Kroah-Hartman noted that getting it past Al Viro might be more difficult than getting it past Torvalds, but binder is "hairy code" and Viro is the one who found the security problems there.

Right now, they are working on getting the system to boot with systemd using kdbus. There are some tests for kdbus, but booting with systemd will give them a lot of confidence in the feature. The kernel side of the code is done, he thinks, but they thought that earlier and then Heo came up with zero and one-copy. He would be happy if it is merged by the end of the year, but if it isn't, it shouldn't stretch much past that, and he encouraged people to start looking at kdbus for their messaging needs in the future.

[ I would like to thank the Linux Foundation for travel assistance so that I could attend the Automotive Linux Summit Spring and LinuxCon Japan. ]

Comments (12 posted)

Atomic I/O operations

By Jonathan Corbet
May 30, 2013
LinuxCon Japan 2013
According to Btrfs developer Chris Mason, tuning Linux filesystems to work well on solid-state storage devices is a lot like working on an old, clunky car. Lots of work goes into just trying to make the thing run with decent performance. Old cars may have mainly hardware-related problems, but, with Linux, the bottleneck is almost always to be found in the software. It is, he said, hard to give a customer a high-performance device and expect them to actually see that performance in their application. Fixing this problem will require work in a lot of areas. One of those areas, supporting and using atomic I/O operations, shows particular potential.

The problem

To demonstrate the kind of problem that filesystem developers are grappling with, Chris started with a plot from a problematic customer workload on an ext4 filesystem; it showed alternating periods of high and low I/O throughput rates. The source of the problem, in this case, was a combination of (1) overwriting an existing file and (2) a filesystem that had been mounted in the data=ordered mode. That combination causes data blocks to be put into a special list that must get flushed to disk every [Chris Mason] time that the filesystem commits a transaction. Since the system in question had a fair amount of memory, the normal asynchronous writeback mechanism didn't kick in, so dirty blocks were not being written steadily; instead, they all had to be flushed when the transaction commit happened. The periods of low throughput corresponded to the transaction commits; everything just stopped while the filesystem caught up with its pending work.

In general, a filesystem commit operation involves a number of steps, the first of which is to write all of the relevant file data and wait for that write to complete. Then the critical metadata can be written to the log; once again, the filesystem must wait until that write is done. Finally, the commit block can be written — inevitably followed by a wait. All of those waits are critical for filesystem integrity, but they can be quite hard on performance.

Quite a few workloads — including a lot of database workloads — are especially sensitive to the latency imposed by waits in the filesystem. If the number of waits could be somehow reduced, latency would improve. Fewer waits would also make it possible to send larger I/O operations to the device, with a couple of significant benefits: performance would improve, and, since large chunks are friendlier to a flash-based device's garbage-collection subsystem, the lifetime of the device would also improve. So reducing the number of wait operations executed in a filesystem transaction commit is an important prerequisite for getting the best performance out of contemporary drives.

Atomic I/O operations

One way to reduce waits is with atomic I/O operations — operations that are guaranteed (by the hardware) to either succeed or fail as a whole. If the system performs an atomic write of four blocks to the device, either all four blocks will be successfully written, or none of them will be. In many cases, hardware that supports atomic operations can provide the same integrity guarantees that are provided by waits now, making those waits unnecessary. The T10 (SCSI) standard committee has approved a simple specification for atomic operations; it only supports contiguous I/O operations, so it is "not very exciting." Work is proceeding on vectored atomic operations that would handle writes to multiple discontiguous areas on the disk, but that has not yet been finalized.

As an example of how atomic I/O operations can help performance, Chris looked at the special log used by Btrfs to implement the fsync() system call. The filesystem will respond to an fsync() by writing the important data to a new log block. In the current code, each commit only has to wait twice, thanks to some recent work by Josef Bacik: once for the write of the data and metadata, and once for the superblock write. That work brought a big performance boost, but atomic I/O can push things even further. By using atomic operations to eliminate one more wait, Chris was able to improve performance by 10-15%; he said he thought the improvement should be better than that, but even that level of improvement is the kind of thing database vendors send out press releases for. Getting a 15% improvement without even trying that hard, he said, was a nice thing.

At Fusion-io, work has been done to enable atomic I/O operations in the MariaDB and Percona database management systems. Currently, these operations are only enabled with the Fusion-io software development kit and its "DirectFS" filesystem. Atomic I/O operations allowed the elimination of the MySQL-derived double-buffering mode, resulting in 43% more transactions per second and half the wear on the storage device. Both improvements matter: if you have made a large investment in flash storage, getting twice the life is worth a lot of money.

Getting there

So it's one thing to hack some atomic operations into a database application; making atomic I/O operations more generally available is a larger problem. Chris has developed a set of API changes that will allow user-space programs to make use of atomic I/O operations, but there are some significant limitations, starting with the fact that only direct I/O is supported. With buffered I/O, it just is not possible for the kernel to track the various pieces through the stack and guarantee atomicity. There will also need to be some limitations on the maximum size of any given I/O operation.

An application will request atomic I/O with the new O_ATOMIC flag to the open() system call. That is all that is required; many direct I/O applications, Chris said, can benefit from atomic I/O operations nearly for free. Even at this level, there are benefits. Oracle's database, he said, pretends it has atomic I/O when it doesn't; the result can be "fractured blocks" where a system crash interrupts the writing of a data block that been scattered across a fragmented filesystem, leading to database corruption. With atomic I/O operations, those fragmented blocks will be a thing of the past.

[Chris Mason] Atomic I/O operation support can be taken a step further, though, by adding asynchronous I/O (AIO) support. The nice thing about the Linux AIO interface (which is not generally acknowledged to have many nice aspects) is that it allows an application to enqueue multiple I/O operations with a single system call. With atomic support, those multiple operations — which need not all involve the same file — can all be done as a single atomic unit. That allows multi-file atomic operations, a feature which can be used to simplify database transaction engines and improve performance. Once this functionality is in place, Chris hopes, the (currently small) number of users of the kernel's direct I/O and AIO capabilities will increase.

Some block-layer changes will clearly be needed to make this all work, of course. Low-level drivers will need to advertise the maximum number of atomic segments any given device will support. The block layer's plugging infrastructure, which temporarily stops the issuing of I/O requests to allow them to accumulate and be coalesced, will need to be extended. Currently, a plugged queue is automatically unplugged when the current kernel thread schedules out of the processor; there will need to be a means to require an explicit unplug operation instead. This, Chris noted, was how plugging used to work, and it caused a lot of problems with lost unplug operations. Explicit unplugging was removed for a reason; it would have to be re-added carefully and used "with restraint." Once that feature is there, the AIO and direct I/O code will need to be modified to hold queue plugs for the creation of atomic writes.

The hard part, though, is, as usual, the error handling. The filesystem must stay consistent even if an atomic operation grows too large to complete. There are a number of tricky cases where this can come about. There are also challenges with deadlocks while waiting for plugged I/O. The hardest problem, though, may be related to the simple fact that the proper functioning of atomic I/O operations will only be tested when things go wrong — a system crash, for example. It is hard to know that rarely-tested code works well. So there needs to be a comprehensive test suite that can verify that the hardware's atomic I/O operations are working properly. Otherwise, it will be hard to have full confidence in the integrity guarantees provided by atomic operations.

Status and future work

The Fusion-io driver has had atomic I/O operation support for some time, but Chris would like to make this support widely available so that developers can count on its presence. Extending it to NVM Express is in progress now; SCSI support will probably wait until the specification for vectored atomic operations is complete. Btrfs can use (vectored) atomic I/O operations in its transaction commits; work on other filesystems is progressing. The changes to the plugging code are done with the small exception of the deadlock handler; that gap needs to be filled before the patches can go out, Chris said.

From here, it will be necessary to finalize the proposed changes to the kernel API and submit them for review. The review process itself could take some time; the AIO and direct I/O interfaces tend to be contentious, with lots of developers arguing about them but few being willing to actually work on that code. So a few iterations on the API can be expected there. The FIO benchmark needs to be extended to test atomic I/O. Then there is the large task of enabling atomic I/O operation in applications.

For the foreseeable future, a number of limitations will apply to atomic I/O operations. The most annoying is likely to be the small maximum I/O size: 64KB for the time being. Someday, hopefully, that maximum will be increased significantly, but for now it applies. The complexity of the AIO and direct I/O code will challenge filesystem implementations; the code is far more complex than one might expect, and each filesystem interfaces with that code in a slightly different way. There are worries about performance variations between vendors; Fusion-io devices can implement atomic I/O operations at very low cost, but that may not be the case for all hardware. Atomic I/O operations also cannot work across multiple devices; that means that the kernel's RAID implementations will require work to be able to support atomic I/O. This work, Chris said, will not be in the initial patches.

There are alternatives to atomic I/O operations, including explicit I/O ordering (used in the kernel for years) or I/O checksumming (to detect incomplete I/O operations after a crash). But, for many situations, atomic I/O operations look like a good way to let the hardware help the software get better performance from solid-state drives. Once this functionality finds its way into the mainline, taking advantage of fast drives might just feel a bit less like coaxing another trip out of that old jalopy.

[Your editor would like to thank the Linux Foundation for supporting his travel to LinuxCon Japan.]

Comments (20 posted)

A kernel skiplist implementation (Part 1)

May 30, 2013

This article was contributed by Andrew Shewmaker

The Linux kernel manages sparse sets of index ranges for various subsystems. For instance, the I/O memory management unit (IOMMU) keeps track of the addresses of outstanding device memory mappings for each PCIE device's domain and tracks holes for new allocations. File systems cache pending operations, extent state, free extents, and more. A simple linked list fails to provide good performance for search, insertion, and deletion operations as it grows, so more advanced abstract data types like red-black trees (rbtrees) were implemented. The kernel's rbtrees generally provide good performance as long as they are only being accessed by one reader/writer, but they suffer when multiple readers and writers are contending for locks.

There are variants of rbtrees that allow efficient read-copy-update (RCU) based reads, but not fine-grained concurrent writes. So, Chris Mason of Btrfs fame is developing a skiplist implementation that will allow file systems like Btrfs and XFS to perform many concurrent updates to their extent indexes. This article will describe the basic skiplist and what makes this skiplist variant cache-friendly. In part two, I'll describe the skiplist API and compare the performance of skiplists to rbtrees.

Basic skiplist

William Pugh first described skiplists in 1990 as a probabilistic alternative to balanced trees (such as rbtrees) that is simpler, faster, and more space-efficient. A skiplist is composed of a hierarchy of ordered linked lists, where each higher level contains a sparser subset of the list below it. The size of the skiplist decreases at a constant rate for each higher level. For instance, if level zero has 32 elements and a probability p=0.5 that an element appears in the next higher level, then level one has 16 elements, level two has eight, etc. The subset selected for the next higher level is random, but the quantity of items in the subset isn't random.

By way of analogy, consider an old-fashioned printed dictionary with multiple search aids built into it. This diagram shows how the different aids are similar to levels of a skiplist:

[Skiplist diagram]

At the highest level there are grooves on the edge of the book that mark sections for each letter, A-Z. The search can continue by looking at word prefixes centered at the top of each page. Next, guide words at the top left and right of each page show the span of words for each page. At the lowest level, a page is scanned word by word. A skiplist replicates something like this index structure in a digital form.

History of kernel skiplists

Josh MacDonald investigated cache-friendly skiplists on Linux 2.4 more than a decade ago. His skiplists are more space-efficient than rbtrees with more than 100 keys, searches are faster after 1,000 keys, and deletions and insertions at 10,000 keys. He also performed various concurrency experiments. When Josh posted an RFC to the kernel mailing list he presented them as a "solution waiting for the right problem to come along". At the time, no discussion ensued, but Chris later became aware of Josh's work during his research. Unfortunately, Josh's implementation focused on reader/writer locks, which only allow one writer at a time, and Chris told me that he wanted to use RCU. Also, Josh's coding style differs from the kernel's and his macros make his code a little difficult to work with.

Other well-known Linux kernel developers have experimented with skiplists as well. Con Kolivas initially thought they would be useful for his BFS CPU scheduler. Ultimately, BFS experienced worse behavior in general when using skiplists, so Con returned to his previous list structure. This was due in part to the BFS scheduler's common case of having few tasks to schedule, so it didn't benefit from theoretically faster lookups for larger numbers of items in the list. Also, the scheduler process wasn't parallel itself, so it didn't benefit from a skiplist's more easily exploited concurrency.

Andi Kleen recalled the discouraging results from his own skiplist investigation to Con. He found that the variable number of pointers needed to link different levels of a skiplist made it difficult to achieve efficient use of memory and cache without limiting the size of the skiplist.

In 2011, Chris asked Liu Bo to try replacing Btrfs's rbtree-based extent_map code, which maps logical file offsets to physical disk offsets, with a skiplist-based implementation. The mappings are read-mostly, until a random write workload triggers a copy-on-write operation. Liu created an initial skiplist patch for Btrfs beginning with Con's implementation and adding support for concurrency with RCU. Results were mixed, and Chris's user-space experimentation led him to start work to make Liu's skiplists more cache-friendly.

You may be saying to yourself, "I thought the whole point of Btrfs was to use B-trees for everything." Btrfs does use its copy-on-write B-trees for any structure read from or written to the disk. However, it also uses rbtrees for in-memory caches. Some rbtrees batch pending inode and extent operations. Other caches are: the extent state cache — tracking whether extents are locked, damaged, dirty, etc.; the free space cache — remembering free extents for quicker allocations; and the previously mentioned extent_map. In addition to the rbtrees, a radix tree manages the extent buffer cache, which is used like the page cache, but for blocks of metadata that might be larger than a page.

All of these data structures have multiple threads contending for access to them and might benefit from skiplists, though the delayed operation trees and free space cache have the most contention. However, Chris's real target for this skiplist is some pending RAID5/6 parity logging code. It needs to enable "fast concurrent reads and writes into an exception store with new locations for some extents." Ideally, Chris hopes to make his skiplist general purpose enough for others to use. If skiplists can provide lockless lookups and be used for both buffer cache indexes and extent trees, then Dave Chinner would consider using them in XFS.

Cache-friendly skiplist

When reading the following description of Chris Mason's skiplist implementation, keep in mind that it is a skiplist for range indexes, or extents. The extents being managed are not identified by a single number. Each has an index/key pointing to its beginning and a range/size. Furthermore, each element of the skiplist is composed of multiple extents, which will be referred to as slots.

This new implementation of a cache-friendly skiplist is a bit more complicated than the picture of the basic skiplist may suggest; it is best examined in pieces. The first of those pieces is described by this diagram (a subset of the full data structure diagram):

[skiplist head]

This figure shows a skiplist anchored by an sl_list structure pointing to the initial entry (represented by struct sl_node) in the list. That sl_node structure has an array of pointers (called ptrs), indexed by level, to the head sl_node_ptr structures of each skiplist level. The sl_node_ptr structure functions like a typical kernel list_head structure, except that each level's head sl_node_ptr->prev is used, possibly confusingly, to point to the item with the greatest key at that level of the skiplist. All locking is done at the sl_node level and will be described in part two of this article.

The skiplist grows from the head sl_node_ptr array on the right of the diagram above into an array of linked lists as shown below:

[skiplist node pointers]

Note that none of the structures listed so far contain any keys or data, unlike traditional skiplists. That's because Chris's skiplist items are associated with more than one key. Another difference is that previous skiplist implementations lack prev pointers. Pugh's original skiplist didn't need something like prev pointers because it didn't support concurrency. MacDonald's skiplist does support concurrency, but its protocol uses context structures that combine parent/child pointer information with lock states. Mason's skiplists use a different concurrency protocol that uses prev pointers.

A superficial difference between this skiplist and others is its apparent lack of down pointers (allowing movement from one level of index to the next), although they exist as the ptrs[] array in sl_node. While the pointer array name is unimportant, its position is, because different level nodes are created with different sizes of arrays. See the ptrs array at the end of sl_node and sl_node at the end of struct sl_leaf (which holds the actual leaf data):

[Skiplist leaf]

The variable size of nodes could cause memory fragmentation if they were allocated from the same pool of memory, so Chris designed his skiplist with one slab cache for each level. This effectively addresses Andi Kleen's concerns mentioned earlier regarding wasting memory while keeping good performance.

The way in which keys are handled is cache-friendly in a couple of ways. First, keys are not associated directly with individual items in this skiplist, but each leaf manages a set of keys. SKIP_KEYS_PER_NODE is currently set to 32 as a balance between increasing cache hits and providing more chances for concurrency. Second, an sl_leaf contains an array of keys. Technically a keys array is unnecessary, since the keys are part of the slots linked to a leaf. However, a search only needs the keys and not the rest of the sl_slot structures until the end of the search. So, the array helps skiplist operations avoid thrashing the cache.

Each sl_slot is an extent defined by its key and size (index range). A developer can adapt the skiplist for their own use by embedding sl_slot into their own data structure. The job of reference counting belongs to the container of sl_slot.

See the full diagram for a picture of how all the pieces described above fit together. With 32 keys per item, a maximum level of 32, and half the number of items in each higher level, Chris's skiplist should be able to efficiently manage around 137 billion extents (index ranges).

The statistically balanced nature of the skiplist allows it to handle sparse sets of indexes in such a way that modifications avoid the need for an rbtree's rebalancing operations, and so they lend themselves to simpler concurrency protocols. Chris also designed his skiplist to be cache-friendly by having each element represent a set of keys/slots, keeping duplicates of the slot's keys in an array, and creating a slab cache for each level of the skiplist. Part two of this series will offer a description of the skliplist API and a discussion of the performance of skiplists vs. rbtrees.

Comments (4 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Networking

Architecture-specific

Security-related

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds