Brief items
The current development kernel is 3.10-rc5, released on June 8. In the announcement Linus
Torvalds made it clear he was not completely pleased with the patches he
was getting this
late in the cycle: "Guys, guys, guys. I'm going to have to start
cursing again unless you stop sending me non-critical stuff. So the next
pull request I get that has "cleanups" or just pointless churn, I'm going
to call you guys out on, and try to come up with new ways to insult you,
your mother, and your deceased pet hamster."
Stable updates:
Five stable kernels were released on June 7: 3.9.5, 3.4.48, and
3.0.81 by Greg Kroah-Hartman;
from Canonical's extended stable trees Kamal Mostafa released 3.8.13.2 and Luis Henriques released 3.5.7.14.
The 3.9.6, 3.4.49, and 3.0.82 stable kernels are currently under
review. They can be expected June 13 or soon after.
Comments (2 posted)
If companies are going to go off and invent the square wheel, and
that makes *them* suffer the loss of being able to merge back into
the mainline kernel, thereby making *their* job of moving forward
with their kernel versions much harder, then yes, we *are* happy.
—
Russell King
Randomly not being able to connect AT ALL to wireless networks is
not a valid "rate control" model.
—
Linus Torvalds
First of all, we really need to stop thinking about choosing [CPU]
frequency (at least for x86). that concept basically died for x86
6 years ago.
—
Arjan van de Ven
Comments (5 posted)
Kernel development news
June 12, 2013
This article was contributed by Andrew Shewmaker
A skiplist is composed of a hierarchy of ordered linked lists, where each higher
level contains a sparser subset of the list below it. In
part
one, I described the basic idea of a skiplist, a little history of
various attempts to use it in the Linux kernel, and Chris Mason's new
cache-friendly skiplist for index ranges.
This article will continue with a description of the current state of Chris's skiplist API and his
future plans for it. I'll also discuss the performance of skiplists and rbtrees
in a simple RAM test, as well as Chris's more interesting IOMMU comparison.
Skiplist API
A skiplist can be declared and initialized to an empty state with lines like:
#include <linux/skiplist.h>
struct sl_list list;
sl_init_list(&list, GFP_ATOMIC);
Once the list exists, the next step is to populate it with data. As is
shown in the data structure diagram, each
structure to be placed in the list should embed an sl_slot
structure; pointers to this embedded structure are used with the skiplist
API.
Insertion into the skiplist requires the programmer to get a "preload
token" — skiplist_preload() ensures that the necessary
memory is available and
disables preemption. With the token in hand, it's possible to actually
insert the item, then re-enable preemption. Preloading helps
avoid the need
for atomic allocations and also to minimize time spent inside a leaf's lock
during insertion. The preload function takes a pointer to a skiplist and a "get
free page" mask describing the type of allocation to be performed, and it
returns an integer token to be used later:
int skiplist_preload(struct sl_list *list, gfp_t gfp_mask);
Note that preemption is disabled by skiplist_preload() and must
not be re-enabled during insertion because the function is holding an RCU read lock
and working with per-CPU data structures.
The function that actually adds an item to the list,
skiplist_insert(), is called with that list, a slot to be inserted, and
a token returned by skiplist_preload():
int skiplist_insert(struct sl_list *list, struct sl_slot *slot,
int preload_token);
Here's an example insertion into a skiplist:
int preload_token, ret;
preload_token = skiplist_preload(skiplist, GFP_KERNEL);
if (preload_token < 0)
return preload_token;
ret = skiplist_insert(skiplist, slot, preload_token);
preempt_enable();
Deletion only requires one
function call, though it is implemented in two phases if a leaf becomes empty.
In that case, the leaf is marked "dead," then it is unlinked from the skiplist
level by level. In either case, it returns the slot pointer of what it deleted from
the list.
struct sl_slot *skiplist_delete(struct sl_list *list, unsigned long key,
unsigned long size);
Adding or removing a key to/from an existing leaf is simple and only requires a
lock at the leaf. However, if a leaf is created or destroyed, then more locking
is required. Leaves with higher levels require locks to be taken on neighboring
nodes all the way down to level zero so that everything can be re-linked
without having a neighbor being deleted out from under. The list of affected
leaves is kept track of in a temporary sl_node list referred to as a
cursor.
(Chris is reworking his code to get
rid of cursors).
The best-case scenario is a modification at level zero where only
a couple of locks are required. Both the preallocation and the insertion code are
biased in favor of creating a level-zero leaf. Regardless, the locking is only
required for a small window of time.
Unlike an rbtree, rebalancing of the skiplist is not required, even when
simultaneous insertions and deletions are being performed in different
parts of the skiplist.
A specialized insertion function is provided that finds a free index range
in the skiplist that is aligned and of a given size. This isn't required by
filesystems, but Chris implemented it so that he could directly
compare rbtrees to skiplists in the IOMMU code. The IOMMU requires this
functionality because each PCIE device's domain requires an aligned
range of memory addresses.
Calls to
skiplist_insert_hole() take a hint of where a hole might be inserted,
and must be retried with a new hint if the return value is -EAGAIN.
That error return happens when simultaneous holes are being created and the
one you hinted at
was good, but was stolen before you could use it. On successful insertion,
the slot passed in is updated
with the location of the hole.
int skiplist_insert_hole(struct sl_list *list, unsigned long hint,
unsigned long limit, unsigned long size, unsigned long align,
struct sl_slot *slot, gfp_t gfp_mask);
Tearing down a whole skiplist requires a fair amount of work. First free
the structures embedding the slots of
each leaf, then use sl_free_leaf(), and finally, zero the pointers in
the head of the skiplist. Wrappers around container_of() for obtaining
the leaf embedding a node or the structure embedding a slot are provided by
sl_entry(ptr) and sl_slot_entry(ptr, type, member),
respectively. Comments in the code indicate future plans to add skiplist
zeroing
helpers, but for now you must roll your own as Chris did for his IOMMU patch.
Here's a generic example of destroying a skiplist:
struct sl_node *p;
struct sl_leaf *leaf;
struct sl_slot *slot;
struct mystruct *mystruct;
sl_lock_node(skiplist->head);
p = skiplist->head->ptrs[0].next;
while (p) {
leaf = sl_entry(p);
for (i = 0; i < leaf->nr; i++) {
slot = leaf->ptrs[i];
mystruct = sl_slot_entry(slot, struct mystruct, slot);
free_mystruct_mem(mystruct);
}
p = leaf->node.ptrs[0].next;
sl_free_leaf(leaf);
}
memset(skiplist->head->ptrs, 0, sl_node_size(SKIP_MAXLEVEL));
sl_unlock_node(skiplist->head);
Chris considered including slot iterators equivalent to rb_next() and
rb_prev(), but decided against it because of the overhead involved in
validating a slot with each call. Instead, skiplist_next() and
skiplist_prev() are leaf iterators that allow a caller to more
efficiently operate on slots in bulk. Chris hasn't posted the updated API
yet, but it seems likely that the iterators will resemble the existing
sl_next_leaf() and
friends.
Calls to
sl_first_leaf() and sl_last_leaf() return pointers to the
first and last entries of the skiplist. The sl_next_leaf() call is a
little different in that you must provide it with an sl_node (embedded
in your current leaf), and since each node potentially has many next entries, you
must also provide the level l you want to traverse.
struct sl_leaf *sl_first_leaf(struct sl_list *list);
struct sl_leaf *sl_last_leaf(struct sl_list *list);
struct sl_leaf *sl_next_leaf(struct sl_list *list, struct sl_node *p, int l);
Since this skiplist implementation focuses on index ranges (or extents) defined
by key and size parameters, it can provide search functions. This is in contrast
to rbtrees—they are more diverse, so users must roll their own search
functions. Each of the skiplist search functions needs to be passed a pointer to
the skiplist, the key you are looking for, and the slot size (the number of
extents in a leaf). If successful, they return a pointer to the slot matching
the key.
struct sl_slot *skiplist_lookup(struct sl_list *list, unsigned long key,
unsigned long size);
struct sl_slot *skiplist_lookup_rcu(struct sl_list *list, unsigned long key,
unsigned long size);
The first, skiplist_lookup(), is appropriate for when a skiplist is
experiencing high read/write contention. It handles all the locking for you.
It protects the skiplist with read-copy-update
(RCU) while it finds the correct leaf and then it protects the leaf with a
spinlock during a binary search to find the slot. If no slot corresponds to the
key, then a NULL pointer is returned.
If skiplist contention is low or you need more control, then use the second
variant. Before calling skiplist_lookup_rcu(), you must call
rcu_read_lock() and you must take care of details such as reference
counting yourself. The search for the leaf uses the same helper function
as skiplist_lookup(), but the leaf spinlock is not held. Instead, it
depends on the skiplist's RCU read lock being held to also protect the slots in
a leaf while it performs a sequential search. This search is sequential because
Chris does not do the copy part of RCU. He does order the operations of
insertion/deletion to try to make the sequential search safe, and that should
usually work. However, it might not return the slot of interest, so it is
the responsibility of the caller
to verify the key of the returned slot, and then call
skiplist_lookup_rcu() again if it the returned slot's key doesn't match
the key being searched for.
Chris elaborated on his future plans for the API in a private email:
In terms of changes coming to the patches, the biggest will be in the
insert code. Right now skiplist_insert does the search, cursor
maintenance, and the insert, but that won't work for XFS because they
need more control over the
EEXIST condition.
It'll get split out into search and insert steps the caller can control,
and you'll be able to call insert with just a locked leaf from any
level...
The searching API will also improve, returning both the leaf and the
slot. This allows skiplist versions of rb_next() and rb_prev().
The skiplist code also indicates that there is work to be done to make lockdep
understand Chris's skiplist locking. It needs to be taught that holding multiple
locks on the same level of a skiplist is allowed as long as they are taken left
to right.
Testing
In addition to the IOMMU comparison between rbtrees and skiplists that
Chris posted
numbers for, his patch also includes a simple RAM-only comparison in the
form of a
kernel module called skiplist_test.
I tested 100,000 items for 100,000 rounds with multiple numbers of threads.
This table shows the results:
| ADT | Threads | Fill Time (ms) |
Check Time (ms) | Delete Time (ms) |
Avg. Thread Time (s) |
| rbtree | 1 | 37 | 9 | 12 | 0.752 |
| skiplist-rcu | 1 | 18 | 15 | 23 | 2.615 |
| rbtree | 2 | 36 | 8 | 12 | 2.766 |
| skiplist-rcu | 2 | 19 | 19 | 27 | 2.713 |
| rbtree | 4 | 36 | 11 | 10 | 6.660 |
| skiplist-rcu | 4 | 23 | 24 | 21 | 3.161 |
These results show skiplists beating rbtrees in fill time, but losing on
check and delete times. The skiplist average thread time is only slightly
better with two threads, and beats rbtree soundly with four threads (they
take half the time). However, rbtree wins the single threaded case, which
surprises Chris because it doesn't match what he sees in user-space
testing. He told me, "
Most of the difference is the cost of calling
spin_lock (even single threaded)."
The more interesting numbers are from Chris's IOMMU comparison.
Even though he is mostly interested in using skiplists for Btrfs extents, he
chose to use the IOMMU because it is easier to isolate the performance of the
two data structures, which makes it both easier for non-Btrfs people to
understand and more meaningful to them. He also says, "... with the IOMMU,
it is trivial to consume 100% system time on the rbtree lock."
The rbtree lock is, in effect, a global lock held once at the start and once at
the end of an IO.
Chris kept the basic structure of the IOMMU code so that he could compare
skiplists to rbtrees. He was not trying to design a better IOMMU that
looked for free ranges of addresses differently or fix the IOMMU contention,
though he told me he would work with David Woodhouse on a proper solution that
tracks free extents later this year.
His benchmarks were run on a single socket server with two SSD cards.
He used a few FIO jobs doing
relatively large (20MB) asynchronous/direct IOs with 16
concurrent threads and 10 pending IOs each (160 total).
Here are his results for streaming and random writes:
Streaming writes
IOMMU off 2,575MB/s
skiplist 1,715MB/s
rbtree 1,659MB/s
Not a huge improvement, but the CPU time was lower.
[...]
16 threads, iodepth 10, 20MB random writes
IOMMU off 2,548MB/s
skiplist 1,649MB/s
rbtree 33MB/s
The existing rbtree-based IOMMU slows streaming writes down 64.4% of the
maximum, and the skiplist's throughput is slightly better at 66.6% while
using less CPU time. Evidently the skiplist's advantages in concurrency and
in maintaining a balanced overall structure only give it a modest advantage
in the streaming write case. However, random writes cause rbtree
performance to only achieve 1.3% of the maximum throughput. In this case, a skiplist fares much better, dropping only to 64.7% of the maximum because different threads can hold locks simultaneously while in different parts of the skiplist and it doesn't need to go through a costly rebalancing operation like the rbtree.
16 threads, iodepth 10, 20MB random reads
IOMMU off 2,861MB/s (mostly idle)
skiplist 2,484MB/s (100% system time)
rbtree 99MB/s (100% system time)
... lowering the thread count did improve the rbtree performance, but the
best I could do was around 300MB/s ...
Reads are easier than writes, and we could expect streaming read results to all be close and relatively uninteresting. Certainly both the rbtree and skiplist do better at random reads than random writes. In fact, the skiplist achieves higher throughput for random reads than it does for streaming writes although it has to work hard to do so. And in case anyone thought the thread count was particularly unfair for rbtree in these tests, Chris points out that the best he got for random IOs with rbtree was around 300MB/s. That's still only 10% of the maximum throughput. Furthermore, Chris noted that all of the CPU time spent in the skiplist was in skiplist_insert_hole(), which isn't optimized.
In a recent discussion on the Linux filesystems mailing list, Mathieu
Desnoyers proposed another data structure that he is calling
RCU Judy
arrays. They can't be compared with skiplists just yet since the Judy
arrays are
only implemented in user space so far, but the competition between the two
ideas should improve them both.
Even though there are plenty of opportunities for refinement, this is a
promising start for a cache-friendly skiplist for the Linux kernel. It should
provide better performance for any subsystem that has high levels of contention
between concurrent accesses of its rbtrees:
various filesystem indexes, virtual memory areas (VMAs), the high-resolution
timer code, etc. CPU schedulers will probably not see any benefit from skiplists
because only one thread is making the scheduling decision, but perhaps
multiqueue schedulers for the network or block layer might in the case where
they have one queue per NUMA node.
Comments (3 posted)
At LinuxCon
Japan, Yasuaki Ishimatsu of Fujitsu talked about the status of memory
hotplug, with a focus on what still needs to be done to fully support both
hot adding and hot removing memory. If a memory device is broken in a
laptop or desktop, you can just replace that memory, but for servers, especially
ones that need to stay running, it is more difficult. In addition,
having a way to add and remove memory would allow for dynamic
reconfiguration on systems where the hardware has been partitioned into two
or more virtual machines.
The focus of the memory hotplug work is for both scenarios: broken memory
hardware and dynamic reconfiguration. Memory hotplug will be supported in
KVM, Ishimatsu said. It is currently supported by several operating
systems, but Linux does not completely support it yet. Fixing that is the
focus this work.
There are two phases to memory hotplug: physically adding or removing
memory (hot add or hot remove) and logically changing the amount of memory
available to the system (onlining or offlining memory). Both phases have
to be completed before Linux can use any new memory, and taking the memory
offline (so that Linux is no longer using it) is required before it can be
removed.
The memory management subsystem manages physical memory by using two
structures, he said. The page tables hold a direct mapping for virtual to
physical addresses. The virtual memory map manages page structures. In
order to offline memory, any data needs to be moved out of the memory and
those data structures need to be updated. Likewise, when adding memory,
new page table and
virtual memory map entries must be added.
Pages are managed in zones and, when using the sparse memory model that is
needed for memory hotplug systems, zones are broken up into sections that are
128M in size. Sections can be switched from online to offline and vice
versa using the /sys/devices/system/memory/memoryX/state file. By
echoing offline or online into that file, the pages in
that section have their state changed to unusable or usable respectively.
In the 3.2 kernel, hot adding memory and onlining it were fully
supported. Offlining memory was supported with limitations, and hot
removing it was not supported at all. Work started in July 2012 to remove
the offline limitations and to add support for hot remove, Ishimatsu said.
The work for hot remove has been merged for the 3.9 kernel. It will invalidate
page table and virtual memory map entries that correspond to the memory
being removed. But, since the memory must be taken offline before it is
removed, the limitations on memory offline still make it impossible to
remove arbitrary memory hardware from the system.
When memory that is to be offlined has data in it, that data is migrated to
other memory in the system. But the only pages that are migratable this
way are the page cache and anonymous pages, which are known as "movable"
pages. If the memory contains non-movable memory, which Ishimatsu called
"kernel memory",
the section cannot be offlined.
There are two ways to handle that problem that are being considered. The
first is to support moving kernel memory when offlining pages
that contain it. The advantages to that are that all memory can be
offlined and there is no additional performance impact for NUMA systems
since there are no restrictions on the types of allocations that can be
made. On the
downside, though, the kernel physical to virtual address relationship will
need to change completely. The other alternative is to make all of a
node's memory movable, which would reuse the existing movable memory
feature, but means that only page cache and anonymous pages can be stored
there, which will impact the performance of that NUMA node.
Ishimatsu said that he prefers the first solution personally, but, as a
first step they are implementing the second: creating a node that consists
only of movable memory. Linux has the idea of a movable zone
(i.e. ZONE_MOVABLE), but zones of that type are not created
automatically. If a node consists only of movable memory, all of it can be
migrated elsewhere so that the node can be taken offline.
A new boot option, movablecore=acpi, is under development that will use
the memory affinity structure in the ACPI static resource affinity table
(SRAT) to choose which nodes will
be constructed of movable memory. The existing use for
movablecore allows setting aside a certain amount of memory that
will be
movable in the system, but it spreads it evenly across all of the nodes
rather than concentrating it only on the nodes of interest. The "hotpluggable"
bit for a node in the SRAT will be used to choose the target nodes in the
new mode.
Using the online_movable flag to the sysfs memory state
file (rather than just online) allows an administrator to tell the
system to make that memory movable. Without that, the onlined memory is
treated as ZONE_NORMAL, so it may contain kernel memory and thus
not be able to be offlined. The online_movable feature was merged
for 3.8. That reduces the limitations on taking memory offline, but there
is still work to do.
Beyond adding the movablecore=acpi boot option (and possibly a
vm.hotadd_memory_treat_as_movable sysctl), there are some other
plans. Finding a way to put the page tables and virtual memory map into the
hot-added memory is something Ishimatsu would like to see, because it would
help performance on that node, but would not allow that memory to be
offlined unless those data structures can be moved. He is thinking about
solutions for
that. Migrating vmalloc() data to other nodes when offlining a
node is another feature under consideration.
Eventually, being able to migrate any kernel memory out of a node is
something he would like to see, but solutions to that problem are still somewhat
elusive. He encouraged those in attendance to participate in the
discussions and to help find solutions for these problems.
[I would like to thank the Linux Foundation for travel assistance to Tokyo
for LinuxCon Japan.]
Comments (7 posted)
By Jake Edge
June 12, 2013
While three kernel internships for women were originally
announced in late April, the size of the program has more than doubled
since then. Seven internships have been established for kernel work
through the Outreach Program for
Women (OPW); each comes with a $5000 stipend and a $500 travel grant. The
program officially kicks off on June 17, but the application process
already brought in several hundred patch submissions from eighteen
applicants, 137 of which were
accepted into the staging and Xen trees—all in thirteen days.
The program was initiated by the Linux Foundation, which found sponsors for
the first three slots, but Intel's Open Source Technology Center added
three more
while the OPW itself came up with funding for another. The OPW has
expanded well beyond its GNOME project roots, with eighteen different
organizations (e.g. Debian, KDE, Mozilla, Perl, Twisted, and many more)
participating in this round.
The program pairs the interns with a mentor from
a participating project to assist the intern with whatever planned work
she has
taken on for the three months of the each program round. OPW is patterned
after the Google Summer of Code project, but is not only for students and
programmers as other kinds of projects (and applicants) are explicitly
allowed. As the name would imply, it also restricts applicants to those
who self-identify as a woman.
The kernel effort has been guided by Sarah Sharp, who is a USB 3.0 kernel
hacker for Intel. She is also one of the mentors for this round. In late
May, she put together a blog
post that described the application process and the patches it brought
in. Sharp filled us in on the chosen interns. In addition, most of
the patches accepted can be seen in her cherry-picked kernel
git tree.
The interns
Sharp will be mentoring Ksenia (Xenia) Ragiadakou who will be working on
the USB 3.0 host driver. Ragiadakou is currently studying for her bachelor's degree in
computer science at the University of Crete in Greece. In addition to her
cleanup patches for the rtl8192u wireless staging driver,
Ragiadakou has already found a bug
in Sharp's host controller driver.
Two of the interns will be working on the Xen subsystem of the kernel with
mentors Konrad Wilk of Oracle and Stefano Stabellini of Citrix. They are Lisa
T. Nguyen, who received a bachelor's degree in computer science from the
University of Washington in 2007, and Elena Ufimtseva, who got a master's
degree in
computer science from St. Petersburg University of Information
Technologies in 2006. Nguyen did several cleanup
patches for Xen (along with various other cleanups) as part of the
application process, while Ufimtseva focused on cleanups in the ced1401
(Cambridge Electronics 1401 USB device) driver in staging.
Lidza Louina will be working with Greg Kroah-Hartman as a mentor on further
cleanups in staging drivers. She was working on a bachelor's degree in
computer science at the
University of Massachusetts but had to take time off to work full-time.
Her contributions were to the csr wireless driver in the staging tree.
Tülin İzer is working on parallelizing the x86 boot process with mentor PJ
Waskiewicz of Intel. She is currently pursuing a bachelor's
degree in computer engineering at Galatasary University in Istanbul,
Turkey. Her application included fixes for several staging drivers.
Two other Intel-mentored interns are in the mix: Hema Prathaban will be
working with Jacob Pan on an Ivy Bridge temperature sensor driver, while
Laura Mihaela Vasilescu will be working on Intel Ethernet drivers, mentored
by Carolyn Wyborny and Anjali Singhai. Prathaban graduated in 2011 from KLN College
of Engineering in India with a bachelor's degree in computer science. She
has been a full-time mother for the last year, so the internship provides
her a way to get back into the industry. Vasilescu is a master's student at
the University of Politehnica of Bucharest, Romania and is also the student
president of ROSEdu, an organization
for Romanian open source education. Both did a number of patches; Prathaban
in the staging tree (including fixing a bug
in one driver) and Vasilescu in Intel Ethernet drivers.
Getting started
As with many budding kernel developers, most of the applicants' patches
were to various staging drivers. There was a short application window as
the kernel portion didn't get announced until a little under two weeks before
the deadline. But that didn't seem to slow anything down as there were 41
applicants for the internships, with eighteen submitting patches and eleven
having those patches accepted into the mainline.
That level of interest—and success—is partly attributable to a first
patch tutorial that she wrote, Sharp said.
The tutorial
helps anyone get started with kernel development from a fresh Ubuntu 12.04
install. It looks at setting up email, getting a kernel tree, using git,
building the kernel, creating a patch, and more. The success was also due to strong
applicants and mentors that were "patient and encouraging",
she said.
The kernel OPW program was mentioned multiple times at the recently held
Linux Foundation conferences in Japan as a helpful step toward making the
gender balance of kernel developers better represent the world we live in
(as Dirk Hohndel put it). It is also nice to see the geographical diversity
of the interns, with Asia, Europe, and North America all represented.
Hopefully South America, Africa, and Oceania will appear in follow-on
rounds of the program—Antarctica may not make the list for some time to come.
Another round of the OPW, including kernel internships, is planned for
January through March 2014 (with application deadlines in December). The
program is seeking more interested projects, mentors, and financial backers
for the internships. While there are certainly critics of these types of
efforts, they have so far proved to be both popular and effective. Other
experiments, using different parameters or criteria, are definitely
welcome, but reaching out and making an effort to bring more women into the
free-software fold is something that will hopefully be with us for some
time—until that hoped-for day when it isn't needed at all anymore.
Comments (2 posted)
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jake Edge
Next page: Distributions>>