Kernel development
Brief items
Kernel release status
The current development kernel is 3.10-rc5, released on June 8.  In the announcement Linus
Torvalds made it clear he was not completely pleased with the patches he
was getting this 
late in the cycle: "Guys, guys, guys. I'm going to have to start
cursing again unless you stop sending me non-critical stuff. So the next
pull request I get that has "cleanups" or just pointless churn, I'm going
to call you guys out on, and try to come up with new ways to insult you,
your mother, and your deceased pet hamster.
"
Stable updates: Five stable kernels were released on June 7: 3.9.5, 3.4.48, and 3.0.81 by Greg Kroah-Hartman; from Canonical's extended stable trees Kamal Mostafa released 3.8.13.2 and Luis Henriques released 3.5.7.14.
The 3.9.6, 3.4.49, and 3.0.82 stable kernels are currently under review. They can be expected June 13 or soon after.
Quotes of the week
Kernel development news
Skiplists II: API and benchmarks
A skiplist is composed of a hierarchy of ordered linked lists, where each higher level contains a sparser subset of the list below it. In part one, I described the basic idea of a skiplist, a little history of various attempts to use it in the Linux kernel, and Chris Mason's new cache-friendly skiplist for index ranges. This article will continue with a description of the current state of Chris's skiplist API and his future plans for it. I'll also discuss the performance of skiplists and rbtrees in a simple RAM test, as well as Chris's more interesting IOMMU comparison.
Skiplist API
A skiplist can be declared and initialized to an empty state with lines like:
    #include <linux/skiplist.h>
    struct sl_list list;
    sl_init_list(&list, GFP_ATOMIC);
Once the list exists, the next step is to populate it with data. As is shown in the data structure diagram, each structure to be placed in the list should embed an sl_slot structure; pointers to this embedded structure are used with the skiplist API.
Insertion into the skiplist requires the programmer to get a "preload token" — skiplist_preload() ensures that the necessary memory is available and disables preemption. With the token in hand, it's possible to actually insert the item, then re-enable preemption. Preloading helps avoid the need for atomic allocations and also to minimize time spent inside a leaf's lock during insertion. The preload function takes a pointer to a skiplist and a "get free page" mask describing the type of allocation to be performed, and it returns an integer token to be used later:
    int skiplist_preload(struct sl_list *list, gfp_t gfp_mask);
Note that preemption is disabled by skiplist_preload() and must not be re-enabled during insertion because the function is holding an RCU read lock and working with per-CPU data structures.
The function that actually adds an item to the list, skiplist_insert(), is called with that list, a slot to be inserted, and a token returned by skiplist_preload():
    int skiplist_insert(struct sl_list *list, struct sl_slot *slot, 
			int preload_token);
Here's an example insertion into a skiplist:
    int preload_token, ret;
    preload_token = skiplist_preload(skiplist, GFP_KERNEL);
    if (preload_token < 0)
    	return preload_token;
    ret = skiplist_insert(skiplist, slot, preload_token);
    preempt_enable();
Deletion only requires one function call, though it is implemented in two phases if a leaf becomes empty. In that case, the leaf is marked "dead," then it is unlinked from the skiplist level by level. In either case, it returns the slot pointer of what it deleted from the list.
    struct sl_slot *skiplist_delete(struct sl_list *list, unsigned long key,
           	       		    unsigned long size);
Adding or removing a key to/from an existing leaf is simple and only requires a lock at the leaf. However, if a leaf is created or destroyed, then more locking is required. Leaves with higher levels require locks to be taken on neighboring nodes all the way down to level zero so that everything can be re-linked without having a neighbor being deleted out from under. The list of affected leaves is kept track of in a temporary sl_node list referred to as a cursor. (Chris is reworking his code to get rid of cursors). The best-case scenario is a modification at level zero where only a couple of locks are required. Both the preallocation and the insertion code are biased in favor of creating a level-zero leaf. Regardless, the locking is only required for a small window of time.
Unlike an rbtree, rebalancing of the skiplist is not required, even when simultaneous insertions and deletions are being performed in different parts of the skiplist.
A specialized insertion function is provided that finds a free index range in the skiplist that is aligned and of a given size. This isn't required by filesystems, but Chris implemented it so that he could directly compare rbtrees to skiplists in the IOMMU code. The IOMMU requires this functionality because each PCIE device's domain requires an aligned range of memory addresses.
Calls to skiplist_insert_hole() take a hint of where a hole might be inserted, and must be retried with a new hint if the return value is -EAGAIN. That error return happens when simultaneous holes are being created and the one you hinted at was good, but was stolen before you could use it. On successful insertion, the slot passed in is updated with the location of the hole.
    int skiplist_insert_hole(struct sl_list *list, unsigned long hint,
    			     unsigned long limit, unsigned long size, unsigned long align,
    			     struct sl_slot *slot, gfp_t gfp_mask);
Tearing down a whole skiplist requires a fair amount of work. First free the structures embedding the slots of each leaf, then use sl_free_leaf(), and finally, zero the pointers in the head of the skiplist. Wrappers around container_of() for obtaining the leaf embedding a node or the structure embedding a slot are provided by sl_entry(ptr) and sl_slot_entry(ptr, type, member), respectively. Comments in the code indicate future plans to add skiplist zeroing helpers, but for now you must roll your own as Chris did for his IOMMU patch.
Here's a generic example of destroying a skiplist:
    struct sl_node *p;
    struct sl_leaf *leaf;
    struct sl_slot *slot;
    struct mystruct *mystruct;
    sl_lock_node(skiplist->head);
    p = skiplist->head->ptrs[0].next;
    while (p) {
	    leaf = sl_entry(p);
	    for (i = 0; i < leaf->nr; i++) {
		    slot = leaf->ptrs[i];
		    mystruct = sl_slot_entry(slot, struct mystruct, slot);
		    free_mystruct_mem(mystruct);
	    }
	    p = leaf->node.ptrs[0].next;
	    sl_free_leaf(leaf);
    }
    memset(skiplist->head->ptrs, 0, sl_node_size(SKIP_MAXLEVEL));
    sl_unlock_node(skiplist->head);
Chris considered including slot iterators equivalent to rb_next() and rb_prev(), but decided against it because of the overhead involved in validating a slot with each call. Instead, skiplist_next() and skiplist_prev() are leaf iterators that allow a caller to more efficiently operate on slots in bulk. Chris hasn't posted the updated API yet, but it seems likely that the iterators will resemble the existing sl_next_leaf() and friends.
Calls to sl_first_leaf() and sl_last_leaf() return pointers to the first and last entries of the skiplist. The sl_next_leaf() call is a little different in that you must provide it with an sl_node (embedded in your current leaf), and since each node potentially has many next entries, you must also provide the level l you want to traverse.
    struct sl_leaf *sl_first_leaf(struct sl_list *list);
    struct sl_leaf *sl_last_leaf(struct sl_list *list);
    struct sl_leaf *sl_next_leaf(struct sl_list *list, struct sl_node *p, int l);
Since this skiplist implementation focuses on index ranges (or extents) defined by key and size parameters, it can provide search functions. This is in contrast to rbtrees—they are more diverse, so users must roll their own search functions. Each of the skiplist search functions needs to be passed a pointer to the skiplist, the key you are looking for, and the slot size (the number of extents in a leaf). If successful, they return a pointer to the slot matching the key.
    struct sl_slot *skiplist_lookup(struct sl_list *list, unsigned long key,
				    unsigned long size);
    struct sl_slot *skiplist_lookup_rcu(struct sl_list *list, unsigned long key,
    				        unsigned long size);
The first, skiplist_lookup(), is appropriate for when a skiplist is experiencing high read/write contention. It handles all the locking for you. It protects the skiplist with read-copy-update (RCU) while it finds the correct leaf and then it protects the leaf with a spinlock during a binary search to find the slot. If no slot corresponds to the key, then a NULL pointer is returned.
If skiplist contention is low or you need more control, then use the second variant. Before calling skiplist_lookup_rcu(), you must call rcu_read_lock() and you must take care of details such as reference counting yourself. The search for the leaf uses the same helper function as skiplist_lookup(), but the leaf spinlock is not held. Instead, it depends on the skiplist's RCU read lock being held to also protect the slots in a leaf while it performs a sequential search. This search is sequential because Chris does not do the copy part of RCU. He does order the operations of insertion/deletion to try to make the sequential search safe, and that should usually work. However, it might not return the slot of interest, so it is the responsibility of the caller to verify the key of the returned slot, and then call skiplist_lookup_rcu() again if it the returned slot's key doesn't match the key being searched for.
Chris elaborated on his future plans for the API in a private email:
It'll get split out into search and insert steps the caller can control, and you'll be able to call insert with just a locked leaf from any level...
The searching API will also improve, returning both the leaf and the slot. This allows skiplist versions of rb_next() and rb_prev().
The skiplist code also indicates that there is work to be done to make lockdep understand Chris's skiplist locking. It needs to be taught that holding multiple locks on the same level of a skiplist is allowed as long as they are taken left to right.
Testing
In addition to the IOMMU comparison between rbtrees and skiplists that Chris posted numbers for, his patch also includes a simple RAM-only comparison in the form of a kernel module called skiplist_test. I tested 100,000 items for 100,000 rounds with multiple numbers of threads.
This table shows the results:
These results show skiplists beating rbtrees in fill time, but losing on check and delete times. The skiplist average thread time is only slightly better with two threads, and beats rbtree soundly with four threads (they take half the time). However, rbtree wins the single threaded case, which surprises Chris because it doesn't match what he sees in user-space testing. He told me, "
ADT Threads Fill 
Time
(ms)Check 
Time
(ms)Delete 
Time
(ms)Avg. Thread 
Time
(s)rbtree 1 37 9 12 0.752 skiplist-rcu 1 18 15 23 2.615 rbtree 2 36 8 12 2.766 skiplist-rcu 2 19 19 27 2.713 rbtree 4 36 11 10 6.660 skiplist-rcu 4 23 24 21 3.161 
Most of the difference is the cost of calling spin_lock (even single threaded)."
The more interesting numbers are from Chris's IOMMU comparison.
Even though he is mostly interested in using skiplists for Btrfs extents, he
chose to use the IOMMU because it is easier to isolate the performance of the
two data structures, which makes it both easier for non-Btrfs people to
understand and more meaningful to them. He also says, "... with the IOMMU,
it is trivial to consume 100% system time on the rbtree lock.
"
The rbtree lock is, in effect, a global lock held once at the start and once at
the end of an IO.
Chris kept the basic structure of the IOMMU code so that he could compare skiplists to rbtrees. He was not trying to design a better IOMMU that looked for free ranges of addresses differently or fix the IOMMU contention, though he told me he would work with David Woodhouse on a proper solution that tracks free extents later this year.
His benchmarks were run on a single socket server with two SSD cards. He used a few FIO jobs doing relatively large (20MB) asynchronous/direct IOs with 16 concurrent threads and 10 pending IOs each (160 total). Here are his results for streaming and random writes:
Streaming writes IOMMU off 2,575MB/s skiplist 1,715MB/s rbtree 1,659MB/s Not a huge improvement, but the CPU time was lower. [...] 16 threads, iodepth 10, 20MB random writes IOMMU off 2,548MB/s skiplist 1,649MB/s rbtree 33MB/s
The existing rbtree-based IOMMU slows streaming writes down 64.4% of the maximum, and the skiplist's throughput is slightly better at 66.6% while using less CPU time. Evidently the skiplist's advantages in concurrency and in maintaining a balanced overall structure only give it a modest advantage in the streaming write case. However, random writes cause rbtree performance to only achieve 1.3% of the maximum throughput. In this case, a skiplist fares much better, dropping only to 64.7% of the maximum because different threads can hold locks simultaneously while in different parts of the skiplist and it doesn't need to go through a costly rebalancing operation like the rbtree.
16 threads, iodepth 10, 20MB random reads IOMMU off 2,861MB/s (mostly idle) skiplist 2,484MB/s (100% system time) rbtree 99MB/s (100% system time) ... lowering the thread count did improve the rbtree performance, but the best I could do was around 300MB/s ...
Reads are easier than writes, and we could expect streaming read results to all be close and relatively uninteresting. Certainly both the rbtree and skiplist do better at random reads than random writes. In fact, the skiplist achieves higher throughput for random reads than it does for streaming writes although it has to work hard to do so. And in case anyone thought the thread count was particularly unfair for rbtree in these tests, Chris points out that the best he got for random IOs with rbtree was around 300MB/s. That's still only 10% of the maximum throughput. Furthermore, Chris noted that all of the CPU time spent in the skiplist was in skiplist_insert_hole(), which isn't optimized.
In a recent discussion on the Linux filesystems mailing list, Mathieu Desnoyers proposed another data structure that he is calling RCU Judy arrays. They can't be compared with skiplists just yet since the Judy arrays are only implemented in user space so far, but the competition between the two ideas should improve them both.
Even though there are plenty of opportunities for refinement, this is a promising start for a cache-friendly skiplist for the Linux kernel. It should provide better performance for any subsystem that has high levels of contention between concurrent accesses of its rbtrees: various filesystem indexes, virtual memory areas (VMAs), the high-resolution timer code, etc. CPU schedulers will probably not see any benefit from skiplists because only one thread is making the scheduling decision, but perhaps multiqueue schedulers for the network or block layer might in the case where they have one queue per NUMA node.
Plans for hot adding and removing memory
At LinuxCon Japan, Yasuaki Ishimatsu of Fujitsu talked about the status of memory hotplug, with a focus on what still needs to be done to fully support both hot adding and hot removing memory. If a memory device is broken in a laptop or desktop, you can just replace that memory, but for servers, especially ones that need to stay running, it is more difficult. In addition, having a way to add and remove memory would allow for dynamic reconfiguration on systems where the hardware has been partitioned into two or more virtual machines.
The focus of the memory hotplug work is for both scenarios: broken memory hardware and dynamic reconfiguration. Memory hotplug will be supported in KVM, Ishimatsu said. It is currently supported by several operating systems, but Linux does not completely support it yet. Fixing that is the focus this work.
There are two phases to memory hotplug: physically adding or removing memory (hot add or hot remove) and logically changing the amount of memory available to the system (onlining or offlining memory). Both phases have to be completed before Linux can use any new memory, and taking the memory offline (so that Linux is no longer using it) is required before it can be removed.
The memory management subsystem manages physical memory by using two structures, he said. The page tables hold a direct mapping for virtual to physical addresses. The virtual memory map manages page structures. In order to offline memory, any data needs to be moved out of the memory and those data structures need to be updated. Likewise, when adding memory, new page table and virtual memory map entries must be added.
Pages are managed in zones and, when using the sparse memory model that is needed for memory hotplug systems, zones are broken up into sections that are 128M in size. Sections can be switched from online to offline and vice versa using the /sys/devices/system/memory/memoryX/state file. By echoing offline or online into that file, the pages in that section have their state changed to unusable or usable respectively.
In the 3.2 kernel, hot adding memory and onlining it were fully supported. Offlining memory was supported with limitations, and hot removing it was not supported at all. Work started in July 2012 to remove the offline limitations and to add support for hot remove, Ishimatsu said.
The work for hot remove has been merged for the 3.9 kernel. It will invalidate page table and virtual memory map entries that correspond to the memory being removed. But, since the memory must be taken offline before it is removed, the limitations on memory offline still make it impossible to remove arbitrary memory hardware from the system.
When memory that is to be offlined has data in it, that data is migrated to other memory in the system. But the only pages that are migratable this way are the page cache and anonymous pages, which are known as "movable" pages. If the memory contains non-movable memory, which Ishimatsu called "kernel memory", the section cannot be offlined.
There are two ways to handle that problem that are being considered. The first is to support moving kernel memory when offlining pages that contain it. The advantages to that are that all memory can be offlined and there is no additional performance impact for NUMA systems since there are no restrictions on the types of allocations that can be made. On the downside, though, the kernel physical to virtual address relationship will need to change completely. The other alternative is to make all of a node's memory movable, which would reuse the existing movable memory feature, but means that only page cache and anonymous pages can be stored there, which will impact the performance of that NUMA node.
Ishimatsu said that he prefers the first solution personally, but, as a first step they are implementing the second: creating a node that consists only of movable memory. Linux has the idea of a movable zone (i.e. ZONE_MOVABLE), but zones of that type are not created automatically. If a node consists only of movable memory, all of it can be migrated elsewhere so that the node can be taken offline.
A new boot option, movablecore=acpi, is under development that will use the memory affinity structure in the ACPI static resource affinity table (SRAT) to choose which nodes will be constructed of movable memory. The existing use for movablecore allows setting aside a certain amount of memory that will be movable in the system, but it spreads it evenly across all of the nodes rather than concentrating it only on the nodes of interest. The "hotpluggable" bit for a node in the SRAT will be used to choose the target nodes in the new mode.
Using the online_movable flag to the sysfs memory state file (rather than just online) allows an administrator to tell the system to make that memory movable. Without that, the onlined memory is treated as ZONE_NORMAL, so it may contain kernel memory and thus not be able to be offlined. The online_movable feature was merged for 3.8. That reduces the limitations on taking memory offline, but there is still work to do.
Beyond adding the movablecore=acpi boot option (and possibly a vm.hotadd_memory_treat_as_movable sysctl), there are some other plans. Finding a way to put the page tables and virtual memory map into the hot-added memory is something Ishimatsu would like to see, because it would help performance on that node, but would not allow that memory to be offlined unless those data structures can be moved. He is thinking about solutions for that. Migrating vmalloc() data to other nodes when offlining a node is another feature under consideration.
Eventually, being able to migrate any kernel memory out of a node is something he would like to see, but solutions to that problem are still somewhat elusive. He encouraged those in attendance to participate in the discussions and to help find solutions for these problems.
[I would like to thank the Linux Foundation for travel assistance to Tokyo for LinuxCon Japan.]
Outreach program for women—kernel edition
While three kernel internships for women were originally announced in late April, the size of the program has more than doubled since then. Seven internships have been established for kernel work through the Outreach Program for Women (OPW); each comes with a $5000 stipend and a $500 travel grant. The program officially kicks off on June 17, but the application process already brought in several hundred patch submissions from eighteen applicants, 137 of which were accepted into the staging and Xen trees—all in thirteen days.
The program was initiated by the Linux Foundation, which found sponsors for the first three slots, but Intel's Open Source Technology Center added three more while the OPW itself came up with funding for another. The OPW has expanded well beyond its GNOME project roots, with eighteen different organizations (e.g. Debian, KDE, Mozilla, Perl, Twisted, and many more) participating in this round.
The program pairs the interns with a mentor from a participating project to assist the intern with whatever planned work she has taken on for the three months of the each program round. OPW is patterned after the Google Summer of Code project, but is not only for students and programmers as other kinds of projects (and applicants) are explicitly allowed. As the name would imply, it also restricts applicants to those who self-identify as a woman.
The kernel effort has been guided by Sarah Sharp, who is a USB 3.0 kernel hacker for Intel. She is also one of the mentors for this round. In late May, she put together a blog post that described the application process and the patches it brought in. Sharp filled us in on the chosen interns. In addition, most of the patches accepted can be seen in her cherry-picked kernel git tree.
The interns
Sharp will be mentoring Ksenia (Xenia) Ragiadakou who will be working on the USB 3.0 host driver. Ragiadakou is currently studying for her bachelor's degree in computer science at the University of Crete in Greece. In addition to her cleanup patches for the rtl8192u wireless staging driver, Ragiadakou has already found a bug in Sharp's host controller driver.
Two of the interns will be working on the Xen subsystem of the kernel with mentors Konrad Wilk of Oracle and Stefano Stabellini of Citrix. They are Lisa T. Nguyen, who received a bachelor's degree in computer science from the University of Washington in 2007, and Elena Ufimtseva, who got a master's degree in computer science from St. Petersburg University of Information Technologies in 2006. Nguyen did several cleanup patches for Xen (along with various other cleanups) as part of the application process, while Ufimtseva focused on cleanups in the ced1401 (Cambridge Electronics 1401 USB device) driver in staging.
Lidza Louina will be working with Greg Kroah-Hartman as a mentor on further cleanups in staging drivers. She was working on a bachelor's degree in computer science at the University of Massachusetts but had to take time off to work full-time. Her contributions were to the csr wireless driver in the staging tree.
Tülin İzer is working on parallelizing the x86 boot process with mentor PJ Waskiewicz of Intel. She is currently pursuing a bachelor's degree in computer engineering at Galatasary University in Istanbul, Turkey. Her application included fixes for several staging drivers.
Two other Intel-mentored interns are in the mix: Hema Prathaban will be working with Jacob Pan on an Ivy Bridge temperature sensor driver, while Laura Mihaela Vasilescu will be working on Intel Ethernet drivers, mentored by Carolyn Wyborny and Anjali Singhai. Prathaban graduated in 2011 from KLN College of Engineering in India with a bachelor's degree in computer science. She has been a full-time mother for the last year, so the internship provides her a way to get back into the industry. Vasilescu is a master's student at the University of Politehnica of Bucharest, Romania and is also the student president of ROSEdu, an organization for Romanian open source education. Both did a number of patches; Prathaban in the staging tree (including fixing a bug in one driver) and Vasilescu in Intel Ethernet drivers.
Getting started
As with many budding kernel developers, most of the applicants' patches were to various staging drivers. There was a short application window as the kernel portion didn't get announced until a little under two weeks before the deadline. But that didn't seem to slow anything down as there were 41 applicants for the internships, with eighteen submitting patches and eleven having those patches accepted into the mainline.
That level of interest—and success—is partly attributable to a first
patch tutorial that she wrote, Sharp said. 
The tutorial 
helps anyone get started with kernel development from a fresh Ubuntu 12.04
install. It looks at setting up email, getting a kernel tree, using git,
building the kernel, creating a patch, and more.  The success was also due to strong
applicants and mentors that were "patient and encouraging
",
she said.
The kernel OPW program was mentioned multiple times at the recently held Linux Foundation conferences in Japan as a helpful step toward making the gender balance of kernel developers better represent the world we live in (as Dirk Hohndel put it). It is also nice to see the geographical diversity of the interns, with Asia, Europe, and North America all represented. Hopefully South America, Africa, and Oceania will appear in follow-on rounds of the program—Antarctica may not make the list for some time to come.
Another round of the OPW, including kernel internships, is planned for January through March 2014 (with application deadlines in December). The program is seeking more interested projects, mentors, and financial backers for the internships. While there are certainly critics of these types of efforts, they have so far proved to be both popular and effective. Other experiments, using different parameters or criteria, are definitely welcome, but reaching out and making an effort to bring more women into the free-software fold is something that will hopefully be with us for some time—until that hoped-for day when it isn't needed at all anymore.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jake Edge
Next page:
                  Distributions>>
                  
 
           