Kernel development
Brief items
Kernel release status
The current development kernel is 4.7-rc5, released on June 26. Linus said: "I think things are calming down, although with almost two thirds of the commits coming in since Friday morning, it doesn't feel that way - my Fridays end up feeling very busy. But looking at the numbers, we're pretty much where we normally are at this time of the rc series."
Stable updates: 4.6.3, 4.4.14, and 3.14.73 were released on June 24. Among other things, these updates contain an important security fix.
Quote of the week
Bottom line: if you're developing a business critical application on the Internet, you cannot assume that the OSes or the network provide adequate security; you need to take ownership of security for your application. TOU is a step in that direction.
Summary of the Glungezer realtime summit
Daniel Wagner has posted a summary writeup from the realtime summit recently held at a remote location in the Alps. "So Christoph [Hellwig] and I teamed up to organise a small conference on a mountain. With Christoph’s familiarity with the Innsbruck mountains, the Glungezer lodge was chosen. The lodge is located almost on the top of the Glungezer summit, which is reachable by no road or cable car. John Kacur voiced the suspicion that Christoph had a secret plan to promote physical fitness in real-time developers by making them hike through the Austrian Alps. As it turns out, he was right."
WireGuard: a new VPN tunnel
Jason Donenfeld has announced the availability of "WireGuard" for the Linux kernel. "WireGuard is an extremely simple yet fast and modern VPN that utilizes state-of-the-art cryptography. It aims to be faster, simpler, leaner, and more useful than IPSec, while avoiding the massive headache. It intends to be considerably more performant than OpenVPN." The code is in an early state and has never been seen by the community before; indeed, it was not posted as a patch series now. As a result, it may be a while before it shows up in a mainline release. For now, the code and more information can be found at wireguard.io.
Kernel development news
Virtually mapped stacks 2: thread_info strikes back
In last week's episode, Andy Lutomirski had posted a set of patches moving kernel stacks to the kernel's vmalloc() area. There are a number of advantages to doing so, including elimination of some high-order memory allocations, improved security, and better diagnostic output when a stack overflow occurs. There was just one little problem: an additional 1.5µs of kernel overhead during process creation, a cost that Linus was unwilling to accept. The first attempt to address that cost ran afoul of an obscure kernel data structure, but the end result looks to be a substantial cleanup of the kernel's management of process information.The performance regression comes from using vmalloc() to allocate kernel stacks; vmalloc() is a relatively expensive way to allocate memory and has not been subjected to the same sort of optimization work that the slab allocators have seen. One suggestion that had been made was to cache a small number of kernel stacks, allowing the quick reuse of cache-hot stacks after processes exit. The hope was that, by eliminating vmalloc() calls, the code could be made faster in general.
A useless cache
Andy went off to implement this idea and reported a discouraging result: "I
implemented a
percpu cache, and it's useless.
" The problem, in short, is that a
process's resources (including its kernel stack) are not cleaned up
immediately when the process exits. Instead, the read-copy-update (RCU)
mechanism is used to ensure that no references to those resources remain
before they are freed. That means (1) the freeing of the kernel
stack will be delayed until the end of the next RCU grace period, and
(2) the resources for all processes that exited during that
grace period will be freed together. So the cache for kernel stacks will
almost always be empty, then will occasionally be hit with large numbers of
released stacks, most of which will not fit into the cache and, thus, will
be simply
freed. In other words, the cache hit rate, especially with fork-heavy
loads, will be close to zero.
In theory there should be no need for a process's kernel stack after that process has died, so one might think that the stack could be released immediately, even if the other data structures need to stay around. The problem is that the core information the kernel maintains about processes lives in two different places:
- The massive task_struct
structure, found in <linux/sched.h>. This structure,
which is (modulo a disturbing number of #ifdef blocks)
architecture-independent, contains most of the information the kernel
needs to know about a running process.
- The small thread_info structure, which is architecture-specific.
The task_struct structure is allocated from the heap like most other kernel data structures. The thread_info structure, though, lives at the bottom of the kernel stack, making it impossible to reuse the kernel stack as long as something might try to reference that structure. For a brief period, Linus pursued changes that would allow the thread_info structure to be freed quickly, even while the task_struct structure persisted, but it quickly became clear that no easy solutions were to be found in that direction. Some information in thread_info, the flags field in particular, can be accessed at surprising times and needs to remain available as long as the kernel has any record of the associated process at all.
The existence of these two structures is something of a historical artifact. In the early days of Linux, only the task_struct existed, and the whole thing lived on the kernel stack; as that structure grew, though, it became too large to store there. But placement on the kernel stack conferred a significant advantage: the structure could be quickly located by masking some bits out of the stack pointer, meaning there was no need to dedicate a scarce register to storing its location. For certain heavily used fields, this was not an optimization that the kernel developers wanted to lose. So, when the task_struct was moved out of the kernel-stack area, a handful of important structure fields were left there, in the newly created thread_info structure. The resulting two-structure solution is still present in current kernels, but it doesn't necessarily have to be that way.
Getting rid of thread_info
In relatively recent times, the kernel has moved to using per-CPU variables to store many types of frequently-needed information. The scheduler caches some crucial process-related information about the currently running process in the per-CPU area; that turns out to be faster than looking at the bottom of the kernel stack. So the amount of useful data stored in struct thread_info has decreased over time. Given that, an obvious question comes to mind: could the thread_info structure be removed altogether? The question is not exactly new; moving struct thread_info away from the kernel stack was one of Andy's objectives from the beginning. But the performance issue gave this question a higher level of urgency.
Linus quickly discovered that some users of
the thread_info structure have no need of it, even with no other
changes. Mostly they used it to find the task_struct structure —
which, typically, they already had a pointer to. He fixed those, and
committed the result for the 4.7-rc5
release. This kind of change might not qualify as the sort of bug fix that
late-cycle patches are normally restricted to, but he made it clear that he thought they were an
acceptable change: "Those are the 'purely legacy reasons for a bad
calling convention', and I'm ok with those during the rc series to make it
easier for people to play around with this.
"
He pushed the work further, getting to a point where he could move the (somewhat reduced) thread_info structure off the stack, and embed it within the task_struct instead. That work progressed to where it would boot on a test system; Andy then picked it up and integrated it into his larger patch series.
As of this writing, that series is in its fourth revision. It moves many thread_info fields into task_struct, changing the users of those fields along the way. At the end, the thread_info structure, now containing only the flags field, is moved over to the task_struct as well. Getting there requires a number of changes to the low-level architecture code, so it is an x86-only change at the moment. It seems likely that other architectures will follow along, though; getting rid of the separate thread_info structure is a useful cleanup and security enhancement, even without the rest of the work.
With regard to the real objective of the patch set (moving kernel stacks to the vmalloc() area): the removal of the thread_info structure makes it possible to free the kernel stack as soon as the owning process exits — no RCU grace period required. That, in turn, makes it sensible to add a small per-CPU cache holding up to two free kernel stacks. With the cache, Andy says, the 1.5µs performance regression becomes a 0.5–1µs performance gain.
So, at this point, Andy has a patch series that simplifies some core code, provides immediate detection of kernel-stack overruns, gives better diagnostics when that occurs, improves the security of the kernel, and even makes things run faster. Unsurprisingly, objections are now becoming difficult to find. The only remaining question might be when to merge this code, and the answer appears to be during the 4.8 merge window. That is arguably aggressive, given the fundamental nature of the changes and the fact that there must certainly be a surprise or two still lurking there; indeed, one such surprise is still being worked out as of this writing. But the 4.8 cycle should give time to work through those surprises, and the end result should be worth having.
Parallel pathname lookups and the importance of testing
Parallel pathname lookup is a new development that aims to improve some aspects of Linux filesystem performance. It was discussed at the 2016 Linux Storage, Filesystem, and Memory-Management Summit and, as we reported at the time, it required two key changes, both of which have subtle consequences making them worthy of closer examination.
The first of those changes was to introduce a new state for entries in the directory cache (dcache). As well as being positive ("this name does exist") or negative ("this name doesn't currently exist"), they can now be "don't know" or "in-lookup" as it is described in the code. If a dentry (dcache entry) is ever found in this state, the filesystem lookup is still in progress and the caller must wait for it to complete. The design of this change favored performance over simplicity, and the resulting complexity makes bugs harder to see.
The second change was to replace the per-directory mutex with a read/write semaphore that allows read operations to proceed in parallel. While simple in principle, this change has had performance implications that can be educational.
As has been described previously, the dcache allows lookups for cached pathnames to proceed quickly, often with only RCU protection that already allows a high degree of parallelism. The recent work doesn't change this but, instead, handles the case where components of a pathname are not present in the cache. Prior to Linux 4.7-rc1, a per-directory mutex would be held while looking up any name in that directory. For a small directory in a local filesystem, this forced serialization is unlikely to be a problem; looking up one file name is likely to bring the directory block containing the name into the page cache, from which subsequent lookups can be performed with no further delay. For large directories, or directories on network-attached filesystems, it is more likely that every directory access will incur a non-trivial latency and the serialization imposed by the mutex can hurt.
While parallel lookups within a single directory make sense, parallel lookups of a single name do not. Thus, the two changes mentioned can be described as adding per-name locking, and then removing per-directory locking, for lookups at least. The "don't know" state for a dentry could also be described as a "locked" dentry.
The idea of a cache lookup returning an incomplete (but locked) object is far from new. It was in 2002 that Linux 2.6.12 gained the iget_locked() interface that allows the reading of an inode from disk to be separated from the task of adding the inode to the icache (inode cache). At a coarse level, what we are now seeing is the same improvement being added to the dcache. Looking up names in the dcache happens far more frequently than looking up inodes in the icache, so, given that hotter paths tend to be more heavily optimized, it shouldn't be surprising that the dcache version is not as straightforward as the icache version.
A "don't know" state for dcache entries
The sequence of steps for a lookup with the possibility of "don't know" entries is conceptually straightforward:
- See if the object is already in the cache
- If not:
- allocate a new object, flagged as "locked"
- repeat the lookup, but this time insert the new object if none was found
- If an existing object was found, free the new version (if we allocated it), then wait if the found object is locked
- If no existing object was found, initialize the new object completely and unlock it, waking up any process waiting for it
All of these steps can be seen in the new code, particularly in d_alloc_parallel(), which covers 2a, 2b, and 3. Step 4 can be found in lookup_slow(). Step 1 is separate; it is part of the "fast path" executed when everything is in cache. It is embodied in various calls to lookup_fast(), such as the one in walk_component(). The main source of extra complexity in this code is that a new hash table has been introduced to hold the "in-lookup" dentries. The primary hash table, dentry_hashtable, only holds entries on which lookup has completed and are thus known to be positive or negative; entries are added to the new in_lookup_hash using a separate linkage field (d_u.d_in_lookup_hash) in the dentry so that it can be transiently in both tables. When filesystem lookup completes, the entry is added to the primary hash table and then removed from the in-lookup hash table.
The lookup in step 2b needs to look in the primary hash table and then the in-lookup hash table, and it needs to be careful of possible races with the entry being moved from the latter to the former once lookup completes. To enable detection of these races a new "bit-seq-lock" is introduced — like a seqlock but with a single bit used as the spinlock.
The value of the secondary hash table is that it allows the insertion of new entries without the need to search the hash chain in the primary table under an exclusive lock. An exclusive lock (obtained with hlist_bl_lock()) is needed to search the hash chain in the secondary table, but that can be expected to be a much shorter chain that is accessed much less often. The exclusive lock on the primary hash chain is only held long enough to attach the dentry once it is ready.
With these concerns in mind, step 2b above can be expanded to:
- Find the current value of the new per-directory bit-seq-lock
- Search the primary hash table with only RCU protection — exit if found
- Get an exclusive lock on the in_lookup_hash chain
- Check whether the bit-seq-lock has changed. If it has, retry from A. If it hasn't, then we have not yet raced with the target dentry being moved between tables, and the lock we just took will stop the race from happening after this point
- Search the in_lookup_hash chain; if nothing is found, insert the new entry that was allocated in 2a
If the newly allocated dentry was inserted, a waitqueue provided by the caller is stored in the entry, in otherwise unused space, so a wakeup can be sent when the dentry is ready. If an existing, in-lookup dentry was found, then d_alloc_parallel() waits on that waitqueue for the wakeup, and then double checks to ensure that the dentry still looks correct: as no locks were held while waiting, the dentry could already have been renamed or unlinked.
With this understanding, it becomes possible to look through d_alloc_parallel() and most of it starts to make sense, though a particularly critical eye might notice
if (d_unhashed(dentry)) continue;
in the middle of the loop performing the search in in_lookup_hash. A similar test appears in other loops that search in the primary hash table, so it is only surprising if you happen to remember that the two hash tables use different linkages and, as this function tests the linkage for the primary hash table, it really doesn't belong here.
This strangeness is particularly easy to notice with hindsight once you know that J. R. Okajima had been doing some testing and reported problems with this code; together with Al Viro he had narrowed down the problem to exactly this line of code. Fortunately, it will now be gone before 4.7-final is released.
Replacing the exclusive lock with a shared lock
Once per-name locking is in place, replacing the per-directory mutex with a per-directory read/write semaphore and only taking a read (or shared) lock for lookup is fairly straightforward. It has had some interesting consequences though.
As previously reported, Jan Kara expressed some concern at LSFMM about the performance of semaphores. They are not widely used in the kernel and read/write semaphores are inherently more complex than mutexes, so performance regressions seemed at least a theoretical possibility. At the time, Viro reported that he hadn't been able to measure any, but more recently Dave Hansen has found a small "blip" in unlink performance that he was able to narrow down to exactly the change from a mutex to a semaphore. Both mutexes and semaphores take an adaptive approach to waiting for the lock; first they spin for a little while, then they go to sleep and let the scheduler use the CPU for something else. They adapt slightly differently though, with mutexes spinning for longer in some cases. Consequently, using a mutex will waste more CPU time (reducing idle time) but often react more quickly (reducing latency).
Hansen wasn't really sure if this was an important regression or a
puzzling inconsistency: "Should we be making rwsems spin more, or
mutexes spin less?
" he asked. Linus Torvalds felt
that the mutex was probably the right approach since performance matters
and: "Being slow under lock contention just tends to make for more lock
contention
".
Meanwhile Waiman Long has a patch set that makes a number of improvements to semaphores that may well address this issue too. So while the change was not as transparent as had been hoped, it appears that the performance of semaphores won't be a cause for concern for long.
In discussions following the original posting of this change, Viro observed that:
So, if all goes well, the semaphore might eventually not be needed and any remaining measured regression will go along with it.
The change from exclusive to shared locking brought up another performance issue of a different kind. This issue affects directory reads ("readdir") rather than lookup; readdir was changed to use shared locking at the same time that lookup was changed, and for many of the same reasons. In particular, it affects dcache_readdir(), which is used by filesystems that keep all entries in a directory in the dcache. Specifically, it affects tmpfs.
dcache_readdir() acquires the d_lock
spinlock for the directory, and similar locks on the entries in the directory.
Previously, when readdir held an exclusive lock on the directory's mutex,
these locks would mostly be uncontended and so impose minimal cost. With
only a shared lock it is possible for parallel readdir operations to
experience much more contention on these locks. Usually, finer grained
locking improves performance, but when those locks result in more
contention events, it can work the other way. As Viro described it when he
reported
the problem, there is now "an obscene amount of
grabbing/releasing ->d_lock [...] And unlike mutex
(or rswem exclusive), contention on ->d_lock chews a lot of
cycles.
"
This difficulty seems well on the way to being resolved with a proposed patch that reduces the number of times that d_lock is claimed. It would not be fair to say that the shared-locking changes created this problem, but it does highlight that, when you make changes to locking rules, strange and unexpected results can certainly appear. This is why ongoing performance testing that looks for regressions, especially in unusual workloads, is so important; it is encouraging to see that happening.
There is clearly a lot of testing happening though, as Viro observed
separately in the context of some NFS-related races, "we
really need a consolidated regression testsuite
". Full coverage
for network filesystems is more challenging than local filesystems, in
part because it really requires multiple machines. Ad-hoc testing by
the community certainly does find bugs, as we have seen here, but it
seems that though we have much more structured testing than we once
did, we would still benefit from having more.
How many -stable patches introduce new bugs?
The -stable kernel release process faces a contradictory set of constraints. Developers naturally want to get as many fixes into -stable as possible but, at the same time, there is a strong desire to avoid introducing new regressions there. Each -stable release is, after all, intended to be more stable than its predecessor. At times there have been complaints that -stable is too accepting and too prone to regressions, but not many specifics. But, it turns out, this is an area where at least a little bit of objective research can be done.
Worries about -stable regressions
Back in April, Sasha Levin announced the
creation of a new extra-stable tree that would only accept security fixes.
That proposal was controversial, and, after an initial set of releases,
Sasha would appear to have stepped away from this project. While he was
defending it, though, he claimed that some -stable patches introduce their
own bugs, and offered a suggestion for doubtful developers: "Take a
look at how many commits in the stable tree have a 'Fixes:' tag that points
to a commit that's also in the stable tree.
" That is exactly what
your editor set out to do.
While kernel changelogs have a fairly well-defined structure, they are still free-form text, so investigating this area requires a certain amount of heuristics and regular-expression work, but it can be done. The first step is to look at the Fixes: tag as suggested by Sasha. Any kernel patch that fixes a bug introduced by another patch is meant to carry a tag like:
Fixes: 76929ab51f0ee ("kselftests/ftrace: Add hist trigger testcases")
In the reality there is some variation in the format, the most common of which is putting the word "commit" before the ID. One would think that the -stable tree, which is supposed to contain (almost) exclusively fixes, would have a Fixes: tag on almost every commit. In truth, less than half of the commits there carry such tags. A few of those without tags are, in fact, straightforward reverts of buggy patches. Git adds a recognizable line to the changelog of reverts, so, unless the developer has significantly changed that line, it is easy to determine which patch is being "fixed" when a revert is done.
Either way, though, the ID for the patch that introduced the bug is almost invariably the ID used in the mainline tree — not the ID of the patch as it appears in the stable tree. Fortunately, stable-tree patches are required to carry a line like:
commit d7591f0c41ce3e67600a982bab6989ef0f07b3ce upstream.
The format of that line tends to vary too, but, once that is coped with, it turns out that something around 99% of the changesets in the stable tree can be mapped to their mainline equivalent. Or, more to the point, the mapping can be done in the other direction, allowing Fixes: tags to be associated with commits in a specific -stable series. So, a when Fixes: line exists, one can, as a rule, fairly easily determine whether the patch fixes a bug introduced by another -stable patch.
The results
The most recent long-term support kernel is 4.4, which has had 14 stable updates thus far. Those updates contained 1,712 changesets, 632 of which contained some form of Fixes: tag. Of those, it turns out that 39 were fixes for other patches that had already appeared in 4.4-stable. So just over 2% of the patches going into 4.4-stable have proved (so far) to contain bugs of their own requiring further fixes.
For the curious, here's the full set:
There are a couple of things worth noting in these results. One is that nine of the bugs introduced into 4.4-stable were fixed in the same -stable release — and some were arguably not bugs at all. So those problems almost certainly did not actually affect any -stable users; taking those out reduces the number of actual -stable regressions in 4.4 (so far) to 30. On the other hand, 2/3 of the changes in 4.4-stable carry no Fixes: tag, but the bulk of them should still certainly be bug fixes. Some of them, undoubtedly, fix regressions that appeared in -stable, but, in the absence of somebody with the time, patience, and alcohol required to manually examine nearly 1,100 patches, there is no way to say for sure how many do.
To get some sort of vague sense of the regression rate, one can start with the fact that the number found here constitutes a hard floor — the rate must be at least that high. If one makes the assumption that the regression rates in patches without Fixes: tags is no higher than those with the tags, a simple ratio gives the ceiling for the overall rate. For 4.4, that places the regression rate somewhere in the range 2.3-6.2%. Results from some of the other -stable trees are:
Series Patches Fixes: # fixed %regressions 4.6 314 144 2 0.6-1.4% Details 4.5 973 437 9 0.9-2.1% Details 4.4 1,712 632 39 2.3-6.2% Details 3.14 4,779 1,098 105 2.2-9.6% Details
In the end, the results are clearly noisy. There are regressions that appear in the -stable tree, and one can make some estimates as to just how many they are. There is no target regression rate for -stable (assuming that a target of zero is unrealistic), so whether the numbers shown above are acceptable or not is probably a matter of perspective — and whether one has been personally burned by a -stable regression or not.
One conclusion that can tentatively be drawn is that the regression rates for more recent kernels seem to be lower. Some portion of that reduction certainly comes from the youth of those kernels — there just hasn't been time to find all of the bugs yet. But it may also be that the efforts that have been made to reduce regressions in -stable (in particular, holding -stable patches until after they have appeared in a mainline -rc release) are having some effect.
In the end, nobody wants to see regressions in the -stable trees. But tightening down on patch acceptance to the point that regressions no longer appear there will almost certainly result in buggier kernels overall, since many good fixes will not be accepted. As with many things in engineering, picking stable patches involves tradeoffs; hopefully the addition of some metrics can help the community to decide whether those tradeoffs are being made correctly.
The code used to generate these results can be found as part of the gitdm collection of cheesy data-mining tools, located at git://git.lwn.net/gitdm.git.
Patches and updates
Kernel trees
Architecture-specific
Build system
Core kernel code
Development tools
Device drivers
Device driver infrastructure
Documentation
Filesystems and block I/O
Memory management
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>