|
|
Log in / Subscribe / Register

Kernel development

Brief items

Kernel release status

The current development kernel is 4.4-rc5, released on December 13. "If you have all your Christmas shopping done, I would heartily recommend giving rc5 a whirl in between the eggnogs and the decorations. And if you're not celebrating the holidays, you have no excuse for not testing it all out." Linus expects the 4.4 kernel to come out on time at the beginning of January, but is contemplating delaying the 4.5 merge window to allow developers to recover from the holidays.

Stable updates: 4.3.3, 4.2.8, and 4.1.15 were released on December 14. Note that 4.2.8 is the end of the line for the 4.2.x series from Greg Kroah-Hartman, though the Canonical kernel team has announced plans to maintain this release through August 2016.

Comments (none posted)

Quotes of the week

I can't reasonably expect people to get ready for the merge window while being drunk on eggnog or whatever.
Linus Torvalds

Maybe people can think of ftrace as a poem. But people will never think that I give such a short speech.
Steve Rostedt

All's fair in love, war, and code defense.
Dan Williams

Comments (none posted)

Luu: Files are hard

Here is a lengthy posting from Dan Luu on why it is so hard to safely write files on Unix-like systems. It comes down to a combination of POSIX semantics and filesystem bugs. "Something to note here is that while btrfs’s semantics aren’t inherently less reliable than ext3/ext4, many more applications corrupt data on top of btrfs because developers aren’t used to coding against filesystems that allow directory operations to be reordered (ext2 was the only other filesystem that allowed that reordering). We’ll probably see a similar level of bug exposure when people start using NVRAM drives that only have byte-level atomicity. People almost always just run some tests to see if things work, rather than making sure they’re coding against what’s legal in a POSIX filesystem."

Comments (141 posted)

AMD's 2016 Linux driver plans (AnandTech)

AnandTech reports on AMD's plans for Linux graphics driver support. In short: more open code, but some proprietary components will remain. "The significant change here is that by having the RTG closed source driver based around the open source driver, the company is now only maintaining a single code base, is pushing as much as possible into open source, and that the open source driver is receiving these features far sooner than it was previously. This greatly improves the quality of life for open source driver users, but it’s also reciprocal for RTG: it’s a lot easier to keep up to date with Linux kernel changes with an open source kernel mode driver than a closed source driver, and quickly integrate improvements submitted by other developers."

Comments (34 posted)

Kernel development news

Toward more predictable and reliable out-of-memory handling

By Jonathan Corbet
December 16, 2015
The kernel's out-of-memory (OOM) behavior has been a topic of discussion almost since the inception of the linux-kernel mailing list. Opinions abound on how the kernel should account for memory, whether it should allow memory to be overcommitted, what it means to be out of memory, and what should be done when that situation comes about. There seems to be general agreement on only one thing: OOM situations are bad, and the kernel's handling of OOM situations is even worse. Over the years, numerous developers have tried to improve the situation; the latest attempt can be seen in two patch sets from Michal Hocko.

OOM detection

The first patch set tries to answer a question that one might think would be obvious: how do we know when the system is out of memory? The problem is that a running system is highly dynamic. The lack of a free page to allocate at the moment does not mean that such pages could not be created; given the high cost of invoking the OOM killer, it is best not to declare an OOM situation if the kernel might be able to scrounge some memory from somewhere. Current kernels, though, are a bit unpredictable regarding when they give up and, in some cases, might wait too long.

If there are no pages to satisfy an allocation request, the kernel will perform direct reclaim to try to free some memory. In some cases, direct reclaim will be successful; that happens, for example, if it finds clean pages that can be immediately repurposed. In other cases, though, reclaiming pages requires writing them back to backing store; those pages will not be available for what is, from a computer's perspective, a long time. Still, they should become available eventually, so the kernel is justifiably reluctant to declare an OOM situation for as long as reclaimable pages exist.

The problem is that there are no real bounds on how long it might take for "reclaimable" pages to actually be reclaimed, for a number of reasons. Additionally, the allocator can conceivably find itself endlessly retrying if a single page is reclaimed, even if that page cannot be used for the current allocation request. As a result, the kernel can find itself hung up in allocation attempts that do not succeed, but which do not push the system into OOM handling.

Michal's patch defines a new heuristic for deciding when the system is truly out of memory. When an allocation attempt initially fails, the logic is similar to what is done in current kernels: a retry will be attempted (after an I/O wait) if there is a memory zone in the system where the sum of free and reclaimable pages is at least as large as the allocation request. If the retries continue to fail, though, a couple of changes come into play.

The first of those is that there is an upper bound of sixteen retries; after that, the kernel gives up and goes into OOM-handling mode. That may bring about an OOM situation sooner than current kernels (which can loop indefinitely) will, but, as Michal put it: "OOM killer would be more appropriate than looping without any progress for unbounded amount of time". Beyond that, the kernel's count of the number of reclaimable pages is discounted more heavily after each unsuccessful retry; after eight retries, that number will be cut in half. That makes it increasingly unlikely that the estimate of reclaimable pages will motivate the kernel to keep retrying.

The result of these changes is that the kernel will go into OOM handling in a more predictable manner when memory gets tight. Users will still curse the results, but the system as a whole should more reliably survive OOM situations.

The OOM reaper

At least, that should be the case if the OOM killer is actually able to free pages when the kernel invokes it. As has been seen in recent years, it is not that hard to create a situation where the OOM killer is unable to make any progress, usually because the targeted process is blocked on a lock and the OOM situation itself prevents that lock from being released. If an OOM-killed process cannot run, it cannot exit and, thus, it cannot free its memory; as a result, the entire OOM-killing mechanism fails.

The observation (credited to Mel Gorman and Oleg Nesterov) at the core of Michal's OOM reaper patch set is that it is not necessary to wait for the targeted process to die before stripping it of much of its memory. That process has received a SIGKILL signal, meaning it will not run again in user mode. That, in turn, means that it will no longer access any of its anonymous pages. Those pages can be reclaimed immediately without changing the end result.

The OOM reaper is implemented as a separate thread; this is done because the reaper must be able to run when it is called upon to do its work. Other kernel execution mechanisms, such as workqueues, might themselves be blocked by the OOM situation, so they cannot be counted upon. If this patch is merged, the oom_reaper thread will sit unused on the majority of Linux systems out there, but it will be certain to be available on the systems where it is needed.

The reaper is not without its rough edges. It must still take the mmap_sem lock to free the pages, meaning that it could be blocked if mmap_sem is held elsewhere. Still, Michal says that the probability of trouble "is reduced considerably" compared to current kernels. One other potential problem is that, if the targeted process is dumping core at the time it is killed, removing its pages may corrupt the dump. This tradeoff is worthwhile, though, Michal says, since keeping the system running is more important in such situations.

Memory-management patches are notoriously difficult to get merged into the kernel. With regard to the OOM detection patch, Michal said the work "has been sitting and waiting for the fundamental objections for quite some time and there were none". He would like to see it merged in 4.6 or thereafter. Objections to the OOM reaper have also been hard to find, but there has been no talk yet as to when that patch might head for the mainline. Once these patches get there, the OOM-handling subsystem may work a little better, but it seems unlikely that users will appreciate it any more than they do now.

Comments (10 posted)

Encrypted file backup for ext4

By Jake Edge
December 16, 2015

Backing up files from an encrypted filesystem without having access to the key is a more complicated process than one might think. For one thing, the directory containing the files is likely encrypted as well, so some mechanism to extract an encrypted version of the file name would be needed. A recent patch set posted to the linux-ext4 mailing list would provide support for extracting files from an encrypted ext4 filesystems for backup or other purposes.

Ted Ts'o posted the patches on December 10. They add some ioctl() commands and a mount option to allow administrators to extract the information needed to store a copy of the file without having a copy of the encryption key. It would not, however, allow the restoration of the file without the key. According to Ts'o, manipulating both the encrypted directory and file without the user's key is tricky: "establishing a link means that we have to manipulate both the encrypted directory and the encrypted file, and doing this through the VFS interface is non-trivial"

There are three chunks of information that need to be stored so that a file can be restored: the encrypted file data, an encrypted version of the file name, and the metadata that contains the encryption context that is stored as an extended attribute (xattr). Before a privileged process (i.e. one with CAP_SYS_ADMIN) can read the encrypted file contents, though, the filesystem must be mounted with a newly added option (ciphertext_access).

The patches allow the encrypted file data to be read using the O_DIRECT mechanism. But there is a snag. The file size might not be a multiple of the AES block size (which is used for doing the encryption); reading the short chunk at the end of such a file is not supported by O_DIRECT, Ts'o said. Adding a way to do that would require changes to the direct I/O code, which may be controversial. So instead, a "shadow" copy of the inode is used with a file size that works and the encrypted file is padded with zeroes to that size when read. That is something of a hack, Ts'o acknowledged. In addition, the original file size will need to be added to the list of data that gets stored as part of the backup (i.e. along with the encrypted file name and the encryption metadata).

For the encryption metadata, two ioctl() commands have been added: EXT4_IOC_GET_ENCRYPTION_METADATA and EXT4_IOC_SET_ENCRYPTION_METADATA. As the names imply, they get and set the metadata needed to decrypt the file and file name when the user's key is available. Lastly, the EXT4_IOC_GET_ENCRYPTED_FILENAME command will retrieve the encrypted file name.

Andreas Dilger asked why ioctl() and O_DIRECT were used, rather than virtual xattrs and regular reads. If the key is not present, reading the file could simply return the encrypted contents. That, coupled with virtual xattrs for the other needed pieces, would allow existing tools (e.g. tar --xattrs) to back up the files, he said.

But Ts'o pointed out a few problems with that approach. The problem with the file size will need to be dealt with by any tool that is used. And he is unconvinced that the effort needed to handle file restoration without the key will be worth it. Beyond that, though, his choice of O_DIRECT was deliberate: direct I/O does not use the page cache, so the cache cannot get "poisoned" with pages of encrypted data.

In our current design, the page cache for an encrypted file always contains the plaintext. This is necessary so that mmap(2) will work correctly. So if user B is logged onto a machine where user A is active, and user A's keys are in the kernel, we don't determine access control [decisions] for user B based on whether or not the user has the appropriate keyring in a keyring accessible to them. We use Unix permissions to provide security isolation between two users which are logged in.

He continued by noting that user space should drop the plain text pages from the page cache when user A logs out (and A's keys are removed from the kernel), but that he doesn't want the integrity of an encrypted backup to depend on that. Also, it should still be possible to do an encrypted backup while A is logged in, which can be done when using direct I/O to read the encrypted data. In summary, he said that there are enough reasons that regular backup tools can't be used for this purpose, so "it's simpler all around to use specific ioctls".

Jan Kara expressed concern about the shadow inode "hack". He thought that it might not be needed, since direct I/O reads will return any partial block at the end of the file. In addition, the locking implications of effectively having two copies of the inode for a single file were worrisome, he said. So far, Ts'o has not replied, but removing the shadow inode, if that turns out to be possible, will only simplify matters.

Ts'o also posted a C program and script that he has been using to test the changes. While a mechanism to backup encrypted files is certainly needed, the interface proposed feels a little clunky overall, though there may be no real way around that. Ts'o also alluded to another use case that he has for the feature, which he was unable to disclose, that may partly be guiding the interface as well. Given that he is the ext4 maintainer, though, it seems likely we will see some kind of change along these lines before long.

Comments (2 posted)

Read-mostly research in 2015

December 16, 2015

This article was contributed by Paul McKenney

[Editor's note: this article was co-written by Paul McKenney and Aravinda Prasad].

It has been just over one year since the last LWN article on read-mostly research. However, it is good to see that there have been a number of interesting papers since then related to RCU and other read-mostly topics. This article gives a quick summary of them.

Validation

Joseph Tassarotti (Carnegie-Mellon University), Derek Dreyer (Max Planck Institute for Software Systems), and Viktor Vafeiadis (also MPI-SWS) carried out a manual proof of correctness of the quiescent-state-based reclamation (QSBR) variant of user-space RCU. This paper modeled rcu_dereference() with C11 memory_order_acquire loads rather than the memory_order_consume loads intended for that purpose. However, given the current parlous state of memory_order_consume [PDF], this shortcut is quite understandable. Their paper is titled “Verifying Read-Copy-Update in a Logic for Weak Memory [PDF]” and appeared in the 2015 Proceedings of the 36th annual ACM SIGPLAN conference on Programming Language Design and Implementation. Another researcher asked me if I felt that their assumptions were adequate, and as previously reported here, I replied that, since they found no bugs, their assumptions clearly must be unrealistic. In the authors' defense, the proof is highly non-trivial, so the lack of bugs was not due to lack of effort.

Iftekhar Ahmed of Oregon State University gave a presentation to the Scalability microconference of the 2015 Linux Plumbers Conference describing a “mutant” technique for verifying test suites that he is applying to Linux-kernel RCU. Each mutant is a copy of the RCU code, but with a single change, or “mutation”. For example, a given statement might be deleted, a comparison operator might be changed, or a constant might be changed. The question is then whether rcutorture will complain about the mutant. Mutants that rcutorture does not complain about are said to have “survived”.

Of course, some mutants will never produce a complaint. For example, a mutation that removes a BUG_ON() or that changes (say) “while (1)” to “while (2)” produces a correct program. Some manual inspection of surviving mutants is therefore required. However, such manual inspection has been rewarded by the finding of not one but two holes in rcutorture, one of which was hiding a real bug—in Tiny RCU, believe it or not. [Full disclosure: I am collaborating with Iftekhar and his professors, Alex Groce and Carlos Jensen, and am co-author on a resulting paper [PDF].]

Quick Quiz 1: Wait a minute!!! How can you possibly create a logic expression that represents all executions of a parallel program???
Answer
The past year also saw the release of version 5 of the University of Oxford C Bounded Model Checker (CBMC), led by Daniel Kröning. CBMC takes a C program as input, and creates a logic expression that takes the program's inputs as inputs, and that evaluates to true if some combination of those inputs can trigger an assert or result in an array-out-of-bounds condition. This new version introduces multiprocessor capability (including some weak-memory support), and is capable of automatically verifying some trivial RCU implementations, including the Linux kernel's Tiny RCU. It is early days for CBMC's multiprocessor feature, but its appearance is a welcome development.

Using and implementing RCU

Mike Ash posted a description of an RCU-like primitive in Apple's Objective-C runtime. Interestingly enough, this approach identifies read-side critical sections via designated code ranges. This of course means that it scans all CPUs' program counters in order to identify grace periods. This approach is interesting in being a distinct method of achieving zero-overhead read-side critical sections, albeit one that poses some interesting practical challenges for large read-side critical sections that call many functions and for reliable sampling of other CPUs' program counters without undue degradation of realtime response.

Editor's note: for those wondering how an algorithm that takes locks can be "lock-free," you're not alone. According to Paul, in common academic usage, "lock-free" means "at least one thread is guaranteed to make progress." Algorithms that do not take locks at all, instead, are "lockless."
Pedro Ramalhete and Andreia Correia took a much simpler approach, using reader-writer locking to implement RCU, resulting in what they call poor-man's URCU [PDF]. Although this is by no means the first lock-based implementation of RCU, it is, as far as I know, the first lock-based implementation that boasts lock-free readers. Their trick is to use not one but two reader-writer locks. Readers loop doing a read-side trylock on each of these two locks in turn, exiting the loop upon successful acquisition of either of the two locks. Writers are serialized by another lock, and simply write-acquire and write-release the two reader-writer locks in succession. If a writer is delayed while write-holding one of these two locks, readers can still make progress by acquiring the other lock. Although poor man's RCU seems unlikely to become a production-quality implementation of RCU, it is worth studying in its own right. After all, it is not every day that someone achieves non-blocking forward-progress guarantees with what is usually considered to be a blocking primitive!

Maya Arbel and Adam Morrison, both of Technion, wrote a paper titled “Predicate RCU: An RCU for Scalable Concurrent Updates [PDF]”. This title might come as quite a surprise to those of us for whom RCU has long provided eminently scalable updates. However, Arbel and Morrison were working with an internal tree whose non-leaf nodes can contain data, described here [PDF]. It turns out that deleting an item from an internal tree is more complex than in an external tree where only leaf nodes contain data. This added complexity is especially vexing because RCU-protected readers cannot be excluded, which is a particular problem when those readers are finding the spot in which to do an insertion. This interaction between complex deletions and RCU lookup-insertions is handled by holding locks across grace periods, which can degrade both performance and scalability. Of course, holding locks across grace periods places those grace periods on the critical path, which motivated the authors to work hard to reduce grace-period duration, hence predicate RCU.

Predicate RCU allows updaters to wait only on those readers that are involved with the update in question, which should shorten grace periods. This should in turn reduce the penalty for holding locks across grace periods, however, there is no free lunch: Shorter grace periods mean fewer updates per grace period and thus higher per-update overhead. This effect is anything but subtle: a 2004 USENIX paper notes that RCU has been observed satisfying more than 1,000 updates with a single grace period while running an unremarkable workload. Nevertheless, if you must hold a lock across a grace period, a shorter grace period is going to be a good thing, as can be seen in the Linux kernel's synchronize_net() function, which uses synchronize_rcu_expedited() when called with the networking layer's RTNL lock held.

Quick Quiz 2: Given the performance penalties, shouldn't someone stop researchers from holding locks across grace periods?
Answer
Therefore, given that Arbel and Morrison need to hold locks across RCU grace periods, they need short grace periods. To this end, predicate-RCU readers associate themselves with an algorithm-specific value. For example, readers of a hash table might associate themselves with the lookup key, which would allow updaters to wait only for those readers whose keys hashed to the hash bucket being updated. This is in some ways similar to the Linux kernel's SRCU, where the srcu_struct serves as the value associating readers and updaters. Either way, given that updaters only need to wait on a small subset of the readers, one would expect grace periods to elapse more quickly, which would in turn be expected to reduce the penalty for waiting for a grace period while holding a lock. Their performance results meet this expectation: Although they do not achieve linear scalability, shorter grace periods do improve their performance. Of course, avoiding waiting for grace periods while holding locks would likely improve performance and scalability even more!

Developers who assume that academics ignore their work will be happy to see that this paper cites a couple of LWN articles.

Yujie Liu (Lehigh University), Victor Luchangco (Oracle Labs), and Michael Spear (also Lehigh) wrote a 2013 paper titled “Mindicators: A Scalable Approach to Quiescence [PDF]”. This paper presses the scalable non-zero indicator (SNZI) [$PDF] technique into service as a grace-period mechanism, and compares it to several other approaches. The intent appears to be to use this mechanism to implement transactional memory (TM), which also appears to require low-latency grace periods, in part courtesy of TM's linearizability requirements, which in turn seems to limit scalability. They do call out the relationship to RCU.

Alexander Matveev (MIT), Nir Shavit (MIT and Tel-Aviv University), Pascal Felber (University of Neuchâtel), and Patrick Marlier (also University of Neuchâtel) recently published a Symposium on Operating Systems Principles (SOSP) paper titled “Read-Log-Update: A Lightweight Synchronization Mechanism for Concurrent Programming [PDF]”, which can be thought of as a software transactional memory (STM) extension that includes explicitly marked read-only transactions. However, unlike RCU readers, these read-only transactions are guaranteed to see a point-in-time snapshot of the union of all RLU-protected data structures across multiple traversals. Of course, this does impose significant additional overhead on RLU updaters, as they acknowledge in their Figure 7, which shows RLU updaters being 2-5 times slower than RCU updaters. That said, this figure shows a benchmark that favors RCU rather heavily. With or without the point-in-time snapshots, I believe that their realization of the importance of explicitly marking read-only operations is a great step forward. Updates to shared variables are also explicitly marked, providing performance benefits over pure STM similar to those of SwissTM [PDF].

Quick Quiz 3: If RCU does not provide readers a guaranteed consistent snapshot of the data structure, how can anyone successfully use it?
Answer
Their technique scales reasonably well up to 16 CPUs, however, this is a very small system for modern non-mobile workloads. They do have one graph (uppermost graph in Figure 8) that goes up to 80 CPUs, but this shows poor scalability. My first thought was that this poor scalability was due to their single global counter that is atomically incremented on each update, but this did not make sense because RLU is outperforming RCU. Instead, the culprit seems to be the need to hold locks across grace periods. Because RCU optimizes for low per-update overhead at the expense of grace-period latency, and because synchronize_rcu() was used (instead of call_rcu() or synchronize_rcu_expedited()), holding locks across grace periods hurts RCU even more than it hurts RLU. Once again, I recommend either releasing locks before waiting for grace periods or using the asynchronous call_rcu() primitive: Both approaches avoid degraded scalability. In (thankfully rare) Linux-kernel cases where it is absolutely necessary to wait for grace periods while holding a mutex, the synchronize_rcu_expedited() APIs can be used, though these are not particularly good for realtime applications (with the exception of synchronize_srcu_expedited()).

Their performance testing includes both user-space and Linux-kernel scenarios. In the Linux-kernel scenarios shown in Figure 9, their list-traversal code beat that of the Linux kernel by a surprisingly large margin. In an impressive display of good sportsmanship, one of the authors (Marlier) located the Linux-kernel performance bottleneck and submitted a fix that causes the Linux kernel's lists to outperform those of the paper. The problem was a single non-atomic store and load to an unshared location in the running task's stack, with no memory barriers. It appears that current microprocessors' pipelines can be a bit slow to handle a load from a location that the current CPU just stored to. Patrick eliminated this store and load, and his patch was accepted during the v4.4 merge window.

As far as I know, this is the first academic use of the rcutorture test suite to test an alternative RCU-like implementation, which is a nice milestone. Those wanting some more detailed discussion on RLU, including graphics showing scalability issues on larger systems, can find it on this page.

At the end of the paper, the authors express hope that RLU will be used both in kernel and in user-space programs. This of course raises the question of what situations would be best suited for an RLU solution. Two possibilities come to mind:

Quick Quiz 4: Shouldn't RLU be tried on complex RCU uses such as the Linux kernel's VFS dentry cache-walk code?
Answer
  • Situations where RCU readers are used in conjunction with sequence locking. These situations are already paying the complexity and performance costs of retries, so these RLU disadvantages might be less of a problem in this case.

  • Situations where a problematic reader-writer lock has proven difficult to convert to RCU. Perhaps RLU's snapshotted readers might better handle some of these situations.

That said, I strongly suspect that successful application of RLU will require the authors to carefully separate those semantics that are actually required from those semantics that are merely fashionable.

It is easy to get irritated at academics' insistence on linearizability, given the large performance and scalability penalties they pay to achieve it. On the other hand, they do use deferral to improve performance of RLU, and perhaps further work along these lines will persuade them to let go of linearizability.

Frans Kaashoek presented on the history of parallelism and operating systems [PDF] during the SOSP 2015 History Day Workshop to mark SOSP's 50th anniversary. Frans devoted a full slide of a 33-slide deck to RCU, which should be a point of pride for the many developers who have applied RCU within the Linux kernel. (Yes, I am happy on behalf of my work on the RCU infrastructure, but let's face it, the infrastructure is profoundly uninteresting without its many uses.)

Peter Denning also presented in SOSP 2015 History Day, but on OS Foundations [PDF]. Those who know me will not be surprised to hear that Peter's last bullet on his slide 39 resonated with me: “Theory follows practice”.

So what have I been doing? For my part, I have continued my work enabling complex atomic updates to RCU-protected data structures with minimal copying, which I recently presented [PDF] at linux.conf.au. This round dealt with some bottlenecks in user-mode memory allocators, relearning the lesson that Glibc malloc() is not at all scalable. I have also been working to repair C11's and C++11's memory_order_consume feature, which just might be nearly done. I rewrote the Linux kernel's synchronize_rcu_expedited() and synchronize_sched_expedited() to reduce OS jitter and, for good measure, added kernel parameters to suppress expedited grace periods entirely. I also started work on design documentation for the Linux kernel's RCU implementation, starting with its requirements. Finally, I have upgraded rcutorture in an attempt to keep up with the Linux kernel's huge installed base.

Conclusions

It is very good to see continued academic interest in read-mostly techniques such as RCU. I very much hope that the mutual learning process continues, and that it benefits Linux developers and their users!

Acknowledgments

We all owe thanks to Srivatsa Bhat, K. Gopinath, Orran Krieger, Pedro Ramalhete, Mike Ash, Derek Dreyer, Joe Tassarotti, Iftekhar Ahmed, Adam Morrison, and Hagit Attiya for helping to make this human-readable. I am grateful to Jim Wasko for his support of this effort.

Answers to Quick Quizzes

Quick Quiz 1: Wait a minute!!! How can you possibly create a logic expression that represents all executions of a parallel program?

Answer: CBMC converts the parallel program into single static assignment (SSA) form, which results in each variable having a separate instance for each assignment to it. Each read from that variable is then wired up to one of the instances that it might have read its value from, and all combinations are represented in the resulting logic expression.

Of course the resulting logic expression might be quite large, but modern computers and modern SAT solvers have grown quite capable. For example, in 1990, a world-class SAT solver might be able to handle 100 variables, that is, three 32-bit variables with four bits left over. In contrast, in 2015, I have solved a 1.8 million variable SAT problem on my laptop.

The solution took a full ten seconds.

Back to Quick Quiz 1.

Quick Quiz 2: Given the performance penalties, shouldn't someone stop researchers from holding locks across grace periods?

Answer: No.

First, please note that holding locks (or, in the Linux kernel, mutexes) across grace periods makes perfect sense in some cases, for example, in the synchronize_net() example given earlier. In addition, in some cases, holding locks across grace periods on rarely executed slow paths can greatly reduce complexity. Second, even though I recommend strongly against holding locks across grace periods on fast paths, it is quite possible that researchers exploring this technique will nevertheless come up with something useful. That said, I do indeed suspect that they would make better progress if they moved their grace periods outside of locks. Third, waiting for highly optimized grace periods is not likely to be a big problem on small systems. (Pretty amazing that a 64-CPU system can now be considered “small”, isn't it?) Fourth and finally, it is their job to figure out what they work on. I would no more wish to dictate what they do than I would wish them to dictate what I do.

Back to Quick Quiz 2.

Quick Quiz 3: If RCU does not provide readers a guaranteed consistent snapshot of the data structure, how can anyone successfully use it???

Answer: It turns out that RCU does in fact guarantee a consistent snapshot—at zero cost—in some important special cases. Perhaps the most common case is a data structure where updaters are restricted to adding new elements or removing old ones, in other words, where in-place updates are prohibited. If each reader only ever looks up a single element, then readers will automatically see a consistent snapshot of that single element—even if a given reader looks up a different element each time it executes.

There are many other use cases where RCU provides consistent snapshots for free, and quite a few of them may be found in the Linux kernel. However, it is also the case that consistency guarantees are overrated. After all, the finite speed of light sharply limits the degree of consistency that a given data structure can usefully maintain with its corresponding external real-world state. Given that the whole point of many data structures is to track external state, internal consistency all too often becomes nothing more than an expensive fiction.

Back to Quick Quiz 3.

Quick Quiz 4: Shouldn't RLU be tried on complex RCU uses such as the Linux kernel's VFS dentry cache-walk code?

Answer: That might well be the case. I therefore have already pointed one of the authors (Matveev) at Neil Brown's excellent LWN series here, here, and here. Matveev's initial response was as follows: “The dcache/dentry + RCU + various locks is really a complex structure... ” I take this as a good sign, in that Matveev should not be blinded by overconfidence.

Back to Quick Quiz 4.

Comments (10 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 4.4-rc5 ?
Greg KH Linux 4.3.3 ?
Greg KH Linux 4.3.2 ?
Greg KH Linux 4.2.8 ?
Greg KH Linux 4.1.15 ?
Luis Henriques Linux 3.16.7-ckt21 ?

Architecture-specific

Core kernel code

Frederic Weisbecker nohz: Tick dependency mask v4 ?

Device drivers

Device driver infrastructure

Documentation

Filesystems and block I/O

Memory management

Networking

Security-related

Virtualization and containers

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2015, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds