Kernel development

Brief items

Kernel release status

The current development kernel is 3.18-rc4, released on November 9. "Hey, things are finally calming down. In fact, it looked *really* calm until yesterday, at which point some people clearly realized 'hey, I should push my stuff to Linus so that it makes it into -rc4', and then a third of all changes came in the last day, but despite that, rc4 finally looks like things are falling into place, and we'll get to stabilize this release after all."

Stable updates: no stable kernel updates have been released in the last week. The 3.17.3 (319 patches!), 3.14.24, and 3.10.60 updates are in the review process as of this writing; they can be expected on or after November 14.

Comments (none posted)

Kernel development news

An introduction to compound pages

By Jonathan Corbet
November 11, 2014

Your editor was digging through a patch set that makes changes involving compound pages when he realized that his understanding of these pages was a bit on the weak side. After some time digging through the source to rectify that situation, a thought surfaced: the world must be full of people wishing they knew more about compound pages. For all of you whose list of desired lifetime accomplishments includes a better understanding of this subject, here is a quick introduction to compound pages in the Linux kernel.

A compound page is simply a grouping of two or more physically contiguous pages into a unit that can, in many ways, be treated as a single, larger page. They are most commonly used to create huge pages, used within hugetlbfs or the transparent huge pages subsystem, but they show up in other contexts as well. Compound pages can serve as anonymous memory or be used as buffers within the kernel; they cannot, however, appear in the page cache, which is only prepared to deal with singleton pages.

Allocating a compound page is a matter of calling a normal memory allocation function like alloc_pages() with the __GFP_COMP allocation flag set and an order of at least one. It is not possible to create an order-zero (single-page) compound page due to the way compound pages are implemented. (The "order" of an allocation is the base-2 logarithm of the number of pages to allocate; zero thus corresponds to a single page, one to two pages, etc.).

Note that a compound page differs from the pages returned from a normal higher-order allocation request. A call like:

    pages = alloc_pages(GFP_KERNEL, 2);  /* no __GFP_COMP */

will return four physically contiguous pages, but they will not be a compound page. The difference is that creating a compound page involves the creation of a fair amount of metadata; much of the time, that metadata is unneeded so the expense of creating it can be avoided.

So what does that metadata look like? Since most of it is stored in the associated page structures, one can assume that it's complicated. Let's start with the page flags. The first (normal) page in a compound page is called the "head page"; it has the PG_head flag set. All other pages are "tail pages"; they are marked with PG_tail. At least, that is the case on systems where page flags are not in short supply — 64-bit systems, in other words. On 32-bit systems, there are no page flags to spare, so a different scheme is used; all pages in a compound page have the PG_compound flag set, and the tail pages have PG_reclaim set as well. The PG_reclaim bit is normally used by the page cache code, but, since compound pages cannot be represented in the page cache, that flag can be reused here.

Code dealing with compound pages need not worry about the different marking conventions, though. No matter which convention is in use, a call to PageCompound() will return a true value if the passed-in page is a compound page. Head and tail pages can be distinguished, should the need arise, with PageHead() and PageTail().

Every tail page has a pointer to the head page stored in the first_page field of struct page. This field occupies the same storage as the private field, the spinlock used when the page holds page table entries, or the slab_cache pointer used when the page is owned by a slab allocator. The compound_head() helper function can be used to find the head page associated with any tail page.

There is a bit of information describing the compound page as a whole: the order (size) of the page, and a destructor used to return the page to the system when it is no longer needed. One might first think to store that information in the head page's struct page, but there is no room for it there. Instead, the order is stored in the lru.prev field in the page structure for the first tail page. While unions are used for many of the overlaid fields in struct page, here the order is simply cast into a pointer type before being stored in a pointer field. Similarly, a pointer to the destructor is stored in the lru.next field of the first tail page's struct page. This extension of compound-page metadata into the second page structure is why compound pages must consist of at least two pages.

Incidentally, there are only two compound page destructors declared in the kernel. By default, free_compound_page() is used; all it does is return the memory to the page allocator. The hugetlbfs subsystem, though, uses free_huge_page() to keep its accounting up to date.

In most cases, compound pages are unnecessary and ordinary allocations can be used; calling code needs to remember how many pages it allocated, but otherwise the metadata that would be stored in a compound page is unneeded. A compound page is indicated, though, whenever it is important to treat the group of pages as a whole even if somebody references a single page within it. Transparent huge pages are a classic example; if user space attempts to change the protections on a portion of a huge page, the entire huge page will need to be located and broken up. Various drivers also use compound pages to ease the management of larger buffers.

And that is pretty much everything that distinguishes a compound page from an ordinary, higher-order allocation. Most developers will not encounter compound pages in their area of the kernel. In cases where it is truly necessary to treat a set of pages as a single unit, though, compound pages may well be part of the solution toolkit.

Comments (3 posted)

Transparent huge page reference counting

By Jonathan Corbet
November 11, 2014

While most architectures supported by Linux use a 4KB page size, most of them also are able to work with much larger pages, varying from 2MB to 1GB in size. These "huge pages" offer significant performance advantages for many workloads, the biggest of which is usually the reduction of pressure on the translation lookaside buffer which short-circuits the process of turning a virtual address into a physical address. The kernel's transparent huge pages (THP) feature enables the use of huge pages without the need for any sort of developer or user intervention. THP suffers from some limitations that prevent it from being used as fully as one might like, though; a patch set from Kirill A. Shutemov aims to reduce those limitations, at the cost of making changes to some fairly complex code.

Transparent huge pages

THP works by quietly substituting huge pages into a process's address space when (1) those page are available and (2) it appears that the process would benefit from the change. When Andrea Arcangeli first added this feature to the 2.6.38 kernel, he had a formidable problem to face: there were many places in the kernel's memory-management code that were not prepared to cope with huge pages scattered randomly in a process's address space. Andrea dealt with some of those problems by avoiding them entirely; that is why, for example, page-cache pages (those backed by files on disk) cannot be huge pages in current kernels.

In many other situations, Andrea placed a call to split_huge_page(), a function which breaks a huge page down into its component small pages. Whenever he encountered code that could not cope with a huge page, and that he wasn't able to fix at the time, he put in a split_huge_page() call to simply make the huge page go away. These calls have a clear performance cost, since they undo the work that put the huge page into place to begin with. But they reduced the problem to something more tractable; split_huge_page() can thus be thought of as a crutch similar to the big kernel lock. It is almost never the optimal solution to the problem, but it is a solution that can be made to work now, deferring a hard problem for a later time.

Over time, some of the split_huge_page() calls in the kernel have been replaced with code that can handle huge pages. But they still exist in the page migration code, in the implementation of kernel samepage merging (KSM), in the bad-memory poisoning code, in the implementation of mprotect() and mlock(), and in the swap code, among other places. Some of these cases are unlikely to change; KSM will probably never be able to merge duplicate pages if it cannot split them out of huge pages. Others might seem just as resistant to change; what does it mean to change the protection of, say, half of a huge page with mprotect()? It turns out that this latter case can be optimized, though, as part of a bigger effort to rework and simplify how the management of transparent huge pages and their references counts is done.

PMD- and PTE-level mappings

To understand Kirill's patch set, it is good to remember that huge pages are represented in the kernel as compound pages. Readers who are unfamiliar with how compound pages work may want to take a look at this article for some basic background and terminology.

Kirill's ultimate goal is to enable the use of transparent huge pages with the page cache. Currently, only anonymous pages can be replaced with huge pages, limiting their use to only a fraction of total memory. That is an ambitious goal, and the current patch set does not even try to approach it. Instead, Kirill has worked to simplify the management of transparent huge pages and make them more flexible.

In particular, he has eliminated the hard separation between normal and huge pages in the system. In current kernels, a specific 4KB page can be treated as an individual page, or it can be part of a huge page, but not both. If a huge page must be split into individual pages, it is split completely for all users, the compound page structure is torn down, and the huge page no longer exists. The fundamental change in Kirill's patch set is to allow a huge page to be split in one process's address space, while remaining a huge page in any other address space where it is found.

Time for a quick reminder of how page tables are structured on Linux systems; this diagram was taken from this 2005 LWN article on the subject:

A huge page is represented in a process's page table with a single entry at the page middle directory (PMD) level. Individual pages, instead, have entries at the bottom page-table entry (PTE) level, as shown in the diagram. But there is nothing that says that the same memory must be mapped in the same way in all processes; it is perfectly legitimate for one process to see a 2MB range as a single huge page while another has it mapped as 512 individual PTEs. If this type of different mapping were supported, one process could call mprotect() to change the protections on a portion of a huge page (causing the mapping to be split in that process's address space) while not disturbing the huge page mapping in other processes, which are not affected by the protection change.

In other words, if split_huge_page() could be replaced by a new function, call it split_huge_pmd(), that would only split up a single process's mapping of a huge page, code needing to deal with individual pages could often be accommodated while preserving the benefits of the huge page for other processes. But, as noted above, the kernel currently does not support different mappings of huge pages; all processes must map the memory in the same way. This restriction comes down to how various parameters — reference counts in particular — are represented in huge pages.

Huge-page reference counting

A reference count tracks the number of users an object (such as a page in memory) has, allowing the kernel to determine when the object is free and can be deleted. There are actually two types of reference counts for a normal page. The first, stored in the _count field of struct page, is the total number of references held to the page. The second, kept in _mapcount, is the number of page table entries referring to this page. A page-table mapping is a reference, so every such reference counted in _mapcount is also tracked in _count; the latter should thus always be greater than or equal to the former. Situations where _count can exceed _mapcount include pages mapped for DMA and pages mapped into the kernel's address space with a function like get_user_pages(). Locking a page into memory with mlock() will also increase _count. The relative value of these two counters is important; if _count equals _mapcount, the page can be reclaimed by locating and removing all of the page table entries. But if _count is greater than _mapcount, the page is "pinned" and cannot be reclaimed until the extra references are removed.

The reference count rules change with compound pages, though. In that case, the value of _count is held at zero for all tail pages to avoid confusing other parts of the memory management code. Reference counts are, as a rule, tracked in the head page for the compound page as a whole, but it is still necessary to track references on individual (small) pages within the compound page. Imagine a situation where part of such a page is used for an I/O operation, for example. The trick that is used is to keep that reference in _mapcount instead and to have the various helper functions that access reference counts pick the right field depending on whether a given page is a tail page or not.

This trick works because huge pages are mapped and unmapped as a unit, so there is no need to track the mapping of tail pages separately. If, however, one wants to allow the mapping of individual pages within a huge page, things fall apart, because it will become necessary to track the mappings to those individual pages. So this trick must go; it must be replaced by a scheme that can track both the mappings to the huge page as a whole and the individual pages that make up that huge page.

The first part, tracking mappings to the huge page as a whole, is done with another increasingly familiar (to those who read the article on compound pages) trick: this count is placed into the mapping field of the first tail page in the huge page. This field is normally used to track the file that has been mapped into this page of memory; since huge pages are not used in the page cache, there is no file mapping and this space is available for other uses. The count is an atomic_t, while mapping is a pointer to struct address_space, so yet another cast is used. One might argue that another union field should be added to make the overloading explicit, but that was not done here.

There is still the matter of tracking non-mapping reference counts to tail pages — the reason _mapcount was used in those pages to begin with. This tracking is done so that, should the huge page be split, the individual pages with elevated reference counts can be properly marked. Kirill's approach is to give up on that objective and, in the process, change how non-mapping references to tail pages are tracked. Instead, a call to get_page() on a tail page will increment the reference count on the head page. So the knowledge that a specific tail page is pinned is replaced with the knowledge that some page, somewhere within the compound page, is pinned.

That, naturally, will destroy the ability to mark individual pages as being pinned if the huge page is split. Kirill's answer to that is to simply cause split_huge_page() to fail if any pages within the huge page are pinned. Note that it will not cause split_huge_pmd(), which just splits the mapping for one process, to fail. So most cases that formerly had to call split_huge_page() will still work since the huge page is no longer truly being split. The cases where a split is absolutely necessary will simply fail, but, Kirill says, the code is prepared for that eventuality in all cases.

Removing tail page reference counting enables the removal of a surprising amount of special-case code. It also frees up _mapcount for its original purpose: to track the number of page-table mappings to the page. At that point, it becomes possible for one process to map a huge page as a single page, while another maps it as a set of individual pages.

Kirill claims that the simplification of the code yields performance improvements on their own, though no benchmark results have been posted. Allowing some processes to retain a huge-page mapping when others need a split view should also make things go faster in cases where memory is shared. This feature should make a bigger difference, though, in the future when THP can be used with the page cache, where sharing of pages happens rather more often. That day will not come soon, though; first this set of patches must find its way into the mainline. Given that the work is being done in low-level memory-management code, and given that Kirill thinks there are probably a few surprises (code that doesn't expect an individual page to be a tail page, for example) yet to be found, that probably is not going to happen in the immediate future.

Comments (none posted)

Recent read-mostly research

November 11, 2014

This article was contributed by Paul McKenney

One of my more unusual hobbies is occasionally reviewing academic papers. It should come as no surprise that I pay special attention to papers related to read-copy update (RCU), which is a category that I am happy to see has been growing significantly over the past few years. This article gives a quick summary of some recent papers in this category.

But first, what is RCU? It is a synchronization mechanism that was added to the Linux kernel in October 2002. RCU is most frequently described as a replacement for reader-writer locking, but has also been used in a number of other ways. It is notable in that RCU readers do not directly synchronize with updaters, which makes RCU read paths extremely fast. It also permits readers to accomplish useful work even when running concurrently with updaters—and vice versa.

Although the earliest known work on mechanisms resembling RCU was carried out by academics (see for example Kung and Lehman's landmark 1980 paper), by the early 1990s, with a few notable exceptions (Ben Gamsa, University of Toronto, et al. [PDF], Robert Olsson, Uppsala University, et al. [PDF], Thomas E. Hart, University of Toronto, et al. [PDF], and Jonathan Appavoo, IBM T.J. Watson Research Center, et al. [PDF]), much of the work in this area was carried out by practitioners. By the early 2000s, the initiative had passed to open-source projects, most notably to the Linux kernel community.

Quick Quiz 1: But what can I do if I work with production-quality code but nevertheless would like to discover something new?
Answer

However, there are welcome signs that some in academia are showing interest in RCU; see, for example, Michael L. Scott's textbook that includes a chapter on RCU. One reason that this interest is welcome is that academics, unlike large open-source projects, are unconstrained by the need to develop production-quality code. This allows them to try crazier and more experimental ideas.

Although many of these ideas might seem to have no value outside of academia, the fact is that many of the best yet-undiscovered ideas are protected by wide moats of insanity—with RCU itself being a case in point. Therefore, if you are unwilling to deal with insanity, the odds on your discovering something new decrease, and, of course, those of us who deal with production-quality code must be quite careful with any insanity we encounter. Furthermore, even an otherwise useless idea might inspire a developer to come up with something that is both valuable and useful, or failing that, to avoid a pitfall. Although there are no guarantees, the hope is that increased academic work in the area of read-mostly concurrency will result in new breakthroughs.

Validation

Validation is one important area in which breakthroughs would be quite welcome. Peter Sewell's and Susmit Sarkar's groups at the University of Cambridge have done some important work in validating memory barriers and atomic instructions, as was reported earlier. This work has been extended with greatly increased performance by some of Sewell's and Sarkar's collaborators at University College London (Jade Alglave), INRIA (Luc Maranget), and Queen Mary University of London (Michael Tautschnig), most notably in this paper [PDF], which was described in this LWN article. Researchers at Stony Brook University have produced an RCU-aware data-race detector, described by Abhinav Duggal [PDF] and Justin Seyster [PDF]. Alexey Gotsman of IMDEA, Noam Rinetzky of Tel Aviv University, and Hongseok Yang of the University of Oxford have published a paper [PDF] expressing the formal semantics of RCU in terms of separation logic, and have continued with other aspects of concurrency. With some luck, all of this validation work will eventually result in more and better tools for validating concurrent code.

Using RCU

Phil Howard and Jon Walpole of Portland State University (PSU) have applied RCU to red-black trees [PDF] combined with updates synchronized using software transactional memory. Josh Triplett and Jon Walpole (again of PSU) applied RCU to resizable hash tables [PDF], reported on in two parts: Part 1 and Part 2. (Other RCU-protected resizable hash tables have been created by Herbert Xu and by Mathieu Desnoyers.)

Quick Quiz 2: Why don't more academic research projects make full use of open-source projects such as the Linux kernel?
Answer

Austin Clements, Frans Kaashoek, and Nickolai Zeldovich of MIT created an RCU-optimized balanced binary tree (Bonsai) [PDF], and applied this tree to the Linux kernel's VM subsystem (Git trees) in order to reduce read-side contention on mmap_sem. This work resulted in order-of-magnitude speedups and scalability up to at least 80 CPUs for a microbenchmark featuring large numbers of minor page faults. This is similar to a patch developed earlier by Peter Zijlstra, and both were limited by the fact that, at the time, filesystem data structures were not safe for RCU readers. Clements et al. avoided this limitation by optimizing the page-fault path for anonymous pages only. More recently, filesystem data structures have been made safe for RCU readers (covered in a 2010 article and another from 2011), so perhaps this work can be implemented for all page types, not just anonymous pages—Peter Zijlstra has, in fact, recently prototyped exactly this. In any case, this use of the Linux kernel itself as a testbed for cutting-edge academic research is a welcome development.

Yandong Mao and Robert Morris of MIT and Eddie Kohler of Harvard University created another RCU-protected tree named Masstree [PDF] that combines ideas from B+ trees and tries. Although this tree is about 2.5x slower than an RCU-protected hash table, it supports operations on key ranges, unlike hash tables. In addition, Masstree supports efficient storage of objects with long shared key prefixes and, furthermore, provides persistence via logging to mass storage.

The paper notes that Masstree's performance rivals that of memcached, even given that Masstree is persistently storing updates and memcached is not. The paper also compares Masstree's performance to the persistent datastores MongoDB, VoltDB, and Redis, reporting significant performance advantages for Masstree, in some cases exceeding two orders of magnitude. Another paper [PDF], by Stephen Tu, Wenting Zheng, Barbara Liskov, and Samuel Madden of MIT and Kohler, applies Masstree to an in-memory database named Silo, achieving 700K transactions per second (42M transactions per minute) on a well-known transaction-processing benchmark. Interestingly enough, Silo guarantees linearizability without incurring the overhead of grace periods while holding locks.

For my part, I have been doing some work enabling complex atomic updates to RCU-protected data structures with minimal copying, which I recently presented [PDF] at the C++ Conference (CppCon). (Next step: Deal with bottlenecks in user-mode memory allocators.)

Using and implementing RCU

Maya Arbel and Hagit Attiya of Technion took a more rigorous approach [PDF] to an RCU-protected search tree that, like Masstree, allows concurrent updates. This paper includes a proof of correctness, including proof that all operations on this tree are linearizable. Unfortunately, this implementation achieves linearizability by incurring the full latency of grace-period waits while holding locks, which degrades scalability of update-only workloads. One way around this problem is to abandon linearizability (as advocated here [PDF] and here [PDF]), however, Arbel and Attiya instead created an RCU variant that reduces low-end grace-period latency. Of course, nothing comes for free, and this RCU variant appears to hit a scalability limit at about 32 CPUs. Although I personally favor dropping linearizability, thus gaining both performance and scalability, it is very good to see academics experimenting with alternative RCU implementations. Researchers at Charles University in Prague have also been working on RCU implementations, including dissertations by Andrej Podzimek [PDF] and Adam Hraska [PDF].

RCU-like mechanisms are also finding their way into Java. Sivaramakrishnan et al. use an RCU-like mechanism to eliminate the read barriers that are otherwise required when interacting with Java's garbage collector, resulting in significant performance improvements.

A specialized RCU implementation and its use

The final piece of academic work that is described in this article was carried out at the Shanghai Jiao Tong University by Ran Liu, Heng Zhang, and Haibo Chen, who created a specialized variant of RCU that they used for an optimized reader-writer lock. This is important because traditional reader-writer lock implementations, such as the Linux kernel's rwlock_t, do not scale well. The reason for this is that all readers must atomically update the single memory location that represents the rwlock_t, both with entering and when leaving their read-side critical sections. The cache line containing this memory location will shuttle among all of these CPUs. Each access to this memory location will therefore result in a cache miss, which will in turn result in the accessing CPU stalling for hundreds or even thousands of clock cycles.

The more CPUs attempting to acquire or release the lock, the longer the stalls. Unless the read-side critical sections are extremely long (hundreds of thousands or millions of instructions), the result will be poor performance and scalability. Furthermore, some of the Linux kernel's lock implementations favor readers over writers (which can be OK), but to the extent of starving waiting writers (which can be a big problem).

This is, of course, why RCU is often used, but RCU has different semantics than do reader-writer locks. In a surprisingly large number of situations, this difference in semantics is not a problem. However, there are some parts of the Linux kernel that are notoriously difficult to apply RCU to. This situation has motivated a number of attempts to find more scalable implementations of reader-writer locking, going all the way back to brlock in the 2.4 Linux kernel (which has since been replaced by lglocks). These implementations provide a lock for each CPU. To read-acquire a lock, a CPU acquires its own lock, which still requires an atomic instruction and memory barrier on most architectures, but does not incur a cache miss in the absence of writers. Writers must acquire all CPUs' locks, which requires an atomic instruction and usually a cache miss per CPU. This overhead increases linearly with the number of CPUs, which can result in high overhead for updates.

In some cases, decreased read-side overhead is required. One scheme was put forward by Gautham Shenoy, and a similar approach was put forward by Srivatsa Bhat. The idea behind both of these patches is that readers check a flag before entering their read-side critical sections. If that flag indicates that there are no writers, then the readers proceed without executing any atomic instructions or memory barriers, and, in the common case, without incurring any cache misses. However, nothing comes for free, and for these approaches, the penalty is paid by the writers.

A writer must first set the flag, then wait for all the pre-existing readers, who might have loaded the flag before the writer set it. This is handled by requiring readers to check the flag within an RCU read-side critical section, and, if the flag is clear, to execute their entire read-side critical section under RCU protection. This allows the writer to use synchronize_rcu() after setting the flag to wait for pre-existing readers. Of course, the resulting grace-period latency can be problematic in many cases. This latency can be reduced by using expedited grace-period primitives such as synchronize_rcu_expedited(), but these result in high CPU overhead as well as a set of inter-processor interrupts (IPIs) to other CPUs. The high overhead can reduce throughput, and the IPIs can spell trouble for real-time applications.

Liu, Zhang, and Chen take a slightly different approach in their USENIX paper. As noted earlier, they use a special-purpose RCU implementation tailored specifically for reader-writer locking, which they call a "passive reader-writer lock" or prwlock. In a manner roughly similar to signal-based user-space RCU, writers increment a version number, and then wait for all CPUs to acknowledge the new version. Readers automatically acknowledge when acquiring or releasing the lock, but CPUs that are not using the lock will not acknowledge the new version.

One way of handling this would be to place full memory barriers in the read-side code, but this group decided to strive for ultimate read-side performance, which means that the read-to-write transition involves IPIs. The idea is that if a given CPU has not either entered or exited a read-side critical section in a timely fashion, it is sent an IPI, which causes it to respond. The group also looked at a few optimizations, including scoping the IPIs on a per-process basis for mmap_sem, optimizing the wakeup path, and, for user-space implementations, using an in-kernel helper similar in some ways to sys_membarrier(). This optimized reader-writer lock provided significant increases in performance for systems of up to 64 CPUs on a number of workloads (see pages 11-13 of the paper for details).

Like most conferences, USENIX imposes strict length limits, which means that this paper does leave some unanswered questions.

One question is whether the number of IPIs could be reduced by taking advantage of the Linux kernel's per-CPU dyntick-idle state. This seems possible, and it seems especially beneficial for NO_HZ_FULL kernels, where it would prevent IPIing time-critical user-mode realtime and HPC applications. It would be interesting to see how this approach plays out.

A second question involves Figure 17 on page 12, which shows that an RCU-based resizable hash table takes considerably longer to resize than does one protected by the prwlock. This is quite true, but it is also true that when using RCU, reads and updates can proceed concurrently with the resizing. It would be interesting to directly measure the actual effect of a resize operation on read and update throughput and latency.

A third question involves the paper's comparison of a user-mode implementation of the reader-writer lock to signal-based user-space RCU. In this case, the reader-writer lock used an in-kernel optimization. However, user-space RCU also has an in-kernel optimization in the form of sys_membarrier(). It would therefore be interesting to compare both mechanisms using their in-kernel optimizations. Similarly, it would be interesting to compare prwlock to RCU expedited grace periods as well as the slower normal grace periods.

A fourth question involves configuration of the benchmark runs. For example, the comparisons of prwlock to user-space RCU set user-space RCU's batch size to one. This is of course an attractive setting if you wish to highlight prwlock, but it would be more interesting to compare against the much larger default setting of 4096 for the batch size. It also raises the question of how and to what extent prwlock can amortize single sets of IPIs over the write acquisitions of multiple prwlocks.

Quick Quiz 3: Given the need to examine per-CPU data for each and every CPU, isn't prwlock inherently non-scalable?
Answer

A final question involves scalability, both in terms of numbers of prwlocks and in terms of system size. Exploring these areas should provide much interesting future work. All these questions aside, and in part because of these questions, this was an extremely interesting paper that might well have significant practical applicability.

Conclusions

It is very good to see increased academic interest in read-mostly techniques such as RCU. We can all look forward to much interesting (and hopefully some useful) work along the way.

Acknowledgments

We all owe thanks to Davidlohr Bueso, Austin Clements, Hagit Attiya, Maya Arbel, Eddie Kohler, and Haibo Chen for their help making this human-readable. I am grateful to Jim Wasko for his support of this effort.

Answers to Quick Quizzes

Quick Quiz 1: But what can I do if I work with production-quality code but nevertheless would like to discover something new?

Answer: There are no guarantees. That said, here are some things you can try:

Set up automated testing for your code, and make this testing as vicious as you possibly can. Otherwise, you will be spending all your time dealing with bugs and will therefore have no time to discover something new. (Yes, it is quite possible that one of your bugs will lead to something new, but the odds are stacked against you.)
Learn as much as you can about the area of your interest. However, emphasize learning by doing, and especially try doing things wrong to see what happens.
Study things completely unrelated to your area of interest. It is surprising how often standard practice in one area inspires something new in another area.
Seek out trouble, especially when the trouble involves doing something that has never been done before. (Important safety tip: It is almost as easy to overdo this particular piece of advice as it is to avoid it completely.)
Make heavy use of source-code control systems such as Git in order to manage the resulting chaos—or at least to keep the chaos safely removed from your production systems.

With some luck, persistence, and hard work, who knows what you might find?

Back to Quick Quiz 1.

Quick Quiz 2: Why don't more academic research projects make full use of open-source projects such as the Linux kernel?

Answer: As you can see from this article, a goodly number of researchers do use Linux. That said, the size and complexity of the Linux kernel can be problematic for some research projects—considerable courage and talent are required to take on Linux-kernel internals.

Back to Quick Quiz 2.

Quick Quiz 3: Given the need to examine per-CPU data for each and every CPU, isn't prwlock inherently non-scalable?

Answer: Not necessarily.

For example, there is no reason why multiple kernel threads could not be brought to bear in order to parallelize the scan of the per-CPU data. In fact, if examining one CPU's data required 100 ns and waking up a helper kthread required 5 us, then it might make sense to provision a helper kthread for each group of (say) 100 CPUs.

That said, this approach would get quite “interesting” from irq-disabled regions of code.

Back to Quick Quiz 3.

Comments (6 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 3.18-rc4 ?

Kamal Mostafa Linux 3.13.11.11 ?

Jiri Slaby Linux 3.12.32 ?

Ben Hutchings Linux 3.2.64 ?

Architecture-specific

Marc Zyngier arm64: PCI/MSI: GICv3 ITS support (stacked domain edition) ?

Punit Agrawal Legacy instruction emulation for arm64 ?

Jiang Liu Enable hierarchy irqdomian on x86 platforms ?

Alexander Shishkin perf: Add infrastructure and support for Intel PT ?

Ross Zwisler add support for new persistent memory instructions ?

Dave Hansen [v10] Intel MPX support ?

Anshuman Khandual Add new PowerPC specific ELF core notes ?

Build system

Darren Hart scripts/ksize: Add kernel build size report ?

Core kernel code

David Drysdale syscalls,x86: Add execveat() system call ?

Guenter Roeck kernel: Add support for power-off handler call chain ?

Shilpasri G Bhat CPU frequency scaled from a task's load on an idle wakeup ?

Development tools

Daniel Thompson kdb: Allow selective reduction in capabilities ?

Shuah Khan kselftest install target feature ?

Device drivers

Markezana, William iio: (ms5637) Add Measurement Specialties MS5637 support ?

Ezequiel Garcia iio: Add Cosmic Circuits ADC support ?

Karol Wrona iio: Add sensorhub driver ?

Peter Rosin NXP Semiconductors TFA9879 Amplifier Driver ?

Krzysztof Kozlowski power/mfd: Add max77693 charger driver ?

Beomho Seo mfd: rt5033: Add Richtek RT5033 drivers ?

Jiri Pirko introduce rocker switch driver with hardware accelerated datapath api - phase 1: bridge fdb offload ?

Laurent Pinchart v4l: xilinx: Add Xilinx Video IP core ?

Oded Gabbay AMDKFD Kernel Driver ?

Jarkko Sakkinen TPM 2.0 support ?

Beniamino Galvani media: rc: add support for Amlogic Meson IR receiver ?

Stephanie Wallick added media agnostic (MA) USB HCD driver ?

Hongzhou Yang Add Mediatek SoC Pinctrl/GPIO driver for MT8135. ?

Richard Fitzgerald Driver for Cirrus Logic CS47L24 codec ?

Ondrej Zary mmc: add Toshiba PCI SD controller driver ?

James Hartley crypto: Add support for the IMG hash accelerator ?

Thierry Reding NVIDIA Tegra memory controller and IOMMU support ?

Tomeu Vizoso Tegra124 EMC (external memory controller) support ?

Caesar Wang Rockchip soc thermal driver ?

Gregory CLEMENT Add support for USB cluster(PHY muxer) on the Armada 375 ?

Laurentiu Palcu Add SPI support for Diolan DLN2 ?

Mahesh Bandewar ipvlan: Initial check-in of the IPVLAN driver. ?

Device driver infrastructure

Boris Brezillon [media] Make mediabus format subsystem neutral ?

Daniel Baluta Introduce IIO interface for Android advanced features ?

Christoph Hellwig tag handling refactor V2 ?

Hans Verkuil vb2: improve dma-sg, expbuf ?

Documentation

Michael Kerrisk (man-pages) Edited seccomp.2 man page for review ?

Michael Kerrisk (man-pages) Edited kexec_load(2) [kexec_file_load()] man page for review ?

Stephan Mueller crypto: Documentation of kernel crypto API ?

Filesystems and block I/O

Milosz Tanski vfs: Non-blockling buffered fs read (page cache only) ?

Theodore Ts'o ext4: add lazytime mount option ?

Memory management

j.glisse@gmail.com HMM (heterogeneous memory management) v6 ?

SeongJae Park introduce gcma ?

Joonsoo Kim Resurrect and use struct page extension for some debugging features ?

Networking

Eric Dumazet net: SO_INCOMING_CPU support ?

Security-related

Stephan Mueller crypto: AF_ALG: add AEAD and RNG support ?

Virtualization and containers

Feng Wu Add VT-d Posted-Interrupts support for KVM ?

Miscellaneous

Seth Jennings Kernel Live Patching ?

Jacek Anaszewski Add a plugin for the Exynos4 camera ?

Luis R. Rodriguez backports: add kernel integration support ?

Adrian Hunter perf tools: Introduce an abstraction for Instruction Tracing ?

Page editor: Jonathan Corbet
Next page: Distributions>>