Kernel development
Brief items
Kernel release status
The 4.6 kernel was released on May 15; Linus said: "It's just as well I didn't cut the rc cycle short, since the last week ended up getting a few more fixes than expected, but nothing in there feels all that odd or out of line." Some of the more significant changes in this release are: post-init read-only memory as a bare beginning of the effort to harden the kernel, support for memory protection keys, the preadv2() and pwritev2() system calls, the kernel connection multiplexer, the OrangeFS distributed filesystem, compile-time stack validation, the OOM reaper, and many more. See the KernelNewbies 4.6 page for an amazing amount of detail.
The 4.7 merge window is open; see the separate article below for a summary of what has been merged so far.
Stable updates: none have been released in the last week. The 4.5.5, 4.4.11, and 3.14.70 updates are in the review process as of this writing; they can be expected on or after May 19.
Quotes of the week
Kernel development news
4.7 Merge window, part 1
The 4.7 merge window opened on May 15, after the release of the 4.6 kernel. Since then, Linus has pulled 3,345 non-merge changesets into the mainline repository. A significant chunk of that total came via the networking tree, but some other big trees (including the virtual filesystem tree with a number of fundamental, mostly under-the-hood changes) have been pulled as well. Linus would appear to have decided to start with mostly core-kernel changes this time around; few device-driver trees have been pulled as of this writing.The most significant user-visible changes merged so far include:
- The schedutil CPU-frequency governor
has been merged. This is the first governor that takes load
information directly from the scheduler, ushering in the new era where
CPU-related power management and the scheduler actually work
together. It is in a relatively simple form for 4.7, but will be
enhanced in the future. See this
changelog for more information on the current state of schedutil.
- The sigaltstack()
system call now supports a new flag called SS_AUTODISARM.
When this flag is provided, the alternate signal stack will be
disabled while the signal handler itself is running. That allows the
application to call swapcontext()
without corrupting the signal state, a feature that, evidently, is
especially useful for dosemu.
- The kernel now supports an EFI "capsule loader," accessible via
/dev/efi_capsule_loader. It can be used to load firmware
updates in the EFI capsule format; see this
blog entry for information on why this can be useful.
- The arm64 architecture has gained support for non-uniform memory
architecture (NUMA) systems. Arm64 also now supports hibernation
(suspend-to-disk).
- Two new flags have been added to the preadv2() and
pwritev2() system calls (which were merged in 4.6), though
they are only really applicable to write operations.
RWF_SYNC causes data and metadata to be flushed to persistent
media after the operation, while RWF_DSYNC causes only data
to be flushed.
- The ability to attach BPF programs to
tracepoints has been added. This significantly increases the
dynamic tracing functionality available in mainline kernels.
- BPF programs used with the cls_bpf and act_bpf
traffic-control modules may now access packet content directly without
having to call special load functions. The result is a significant
increase in performance at the cost of possibly exposing kernel data
to user space. These programs can only be loaded by a privileged
user, though, so data leaks should not normally be a problem.
- The BPF just-in-time compiler can do "constant blinding": scrambling
constant values in BPF programs so that they cannot be used to load
arbitrary instructions into kernel space. See this
changelog for more information.
- A patch from Airbus adds support for v1 of the high-availability
seamless redundancy protocol to the network stack.
- The TCP code has been reworked to make it much more preemptible; that
should help to reduce latency spikes when large numbers of packets
need to be processed.
- The GPRS
tunneling protocol GTP-U protocol is now supported by the kernel.
- New hardware support includes:
- Systems and processors:
SGI Ultraviolet UV4 systems.
- Cryptographic:
Freescale security controllers and
Hisilicon random-number generators.
- Miscellaneous: Maxim integrated MAX31722/MAX31723 SPI temperature sensors, TI LP873X power regulators, Powerventure Semiconductor PV88080 voltage regulators, devices using the Qualcomm IPC router protocol, I2C-connected NXP PN533 NFC interfaces, Asus X205TA keyboards, and Loongson 1 GPIO controllers.
- Systems and processors:
SGI Ultraviolet UV4 systems.
Changes visible to kernel developers include:
- Reader/writer semaphores can now be locked for writing with
down_write_killable(), which allows the locking process to be
killed by a fatal signal while waiting.
- The first steps in Thomas Gleixner's grand
plan to rationalize the CPU hotplug subsystem have been merged.
The big state machine envisioned by Thomas isn't there yet, but the
process of getting the hotplug notifiers ready for that step is moving
forward.
- The "floating proportions" code, described in this 2007 article, has been removed. Few
developers will notice, though: it was determined that these functions
were not being used anywhere in the kernel.
- In a change that Linus called
"
a big deal
", the virtual filesystem layer can now do multiple lookups within a directory in parallel, eliminating a significant source of contention. As part of this work, the file_operations structure has gained a new method:
int (*iterate_shared) (struct file *file, struct dir_context *context);It works like the existing iterate(), except that multiple calls can be made simultaneously within the same directory. The plan is to remove iterate() once all filesystems have switched over; in many cases, the existing iterate() implementation works just fine as iterate_shared().
A two-week merge window would be expected to end on May 29. Linus has occasionally been known to close the merge window early, though. Given that the 29th lands in the middle of a holiday weekend in the US, one might conclude that the temptation to wrap up the merge window a little early might be stronger than usual this time around.
Generic hashing functions and the platform problem
What is a kernel developer to do when they need a simple hashing function to use in a new hash table, and they find that the obvious choice provided by the kernel works poorly? The "right" option is to fix the common code. The "easy" option is to write a replacement or a workaround. The "best" option, it seems, is to make sure Linus Torvalds finds out, because this is just the sort of thing that he cares about.
Linux has a fairly simple and efficient set of hashing functions in include/linux/hash.h that work on simple input values: 32-bit or 64-bit numbers, or pointers that are cast to one of those depending on the architecture. The hash is computed by multiplying the input by one large number and ignoring any overflow, thus effectively taking the remainder of a division by another large number: 232 or 264, depending on the word size. The required number of bits are then extracted from the result.
When Thomas Gleixner was testing his second patch set for making futexes faster, he discovered that the hash values returned weren't particularly evenly distributed and, as a result, he was receiving more collisions than expected. He addressed this problem by writing a simple alternative, sparking a sub-thread exploring the problems with these hash functions. As Torvalds found during subsequent investigations, Gleixner was not the first developer to decide it was easier not to fix the broken function. When working on the RAID 5/6 code for Btrfs, David Woodhouse, or possibly Chris Mason, discovered problems with the "hash_64()" function and helpfully provided a comment:
* we shift down quite a bit. We're using byte
* addressing, and most of the lower bits are zeros.
* This tends to upset hash_64, and it consistently
* returns just one or two different values.
It was determined that zeros in the lower-order bits result in the remaining bits not being mixed well, so there is not much variation in the output — not a particularly useful property for a hash function. The chosen response was not to fix hash_64() but to shift down the input to hide the problem.
Some eight years ago, Matthew Wilcox made some changes to the interfaces for the hash function and noted in the commit message:
So he, too, knew that there were problems with the hashing function, particularly the 64-bit version. It would be naive to think that these three are the only developers to have noticed a problem but none made the effort to fix it. In the 14 years since this code was introduced, several people have noticed a problem and no one has bothered to fix it. This is an example of what has become known as the "platform problem".
Fortunately Torvalds was interested in the futex changes and so paid attention to the second patch set from Gleixner. He made no comment on the futex changes, but did take interest in the hash changes, researching the problem and orchestrating the fix. The key issue turned out to be the number that the input is multiplied by. Supposedly based on recommendations from Donald Knuth, this number was chosen to be a prime and to be approximately the fractional part of the golden ratio multiplied by 264 (in the 64-bit case). A prime was chosen that was "bit sparse", having long runs of ones or zeros in the binary representation.
This latter choice was guided by the desire for the multiplication to be fast. On some hardware, shifts and addition or subtraction can be faster than general multiplication and with fewer runs in the binary pattern, fewer additions or subtractions are needed. For the 64-bit case, the pattern is particularly sparse, having a run of 33 one bits.
/* 2^63 + 2^61 - 2^57 + 2^54 - 2^51 - 2^18 + 1 */
#define GOLDEN_RATIO_PRIME_64 0x9e37fffffffc0001UL
This sparseness was the problem. Torvalds's research suggested that the primality was barely interesting and a misunderstanding. As George Spelvin later explained it:
It also appears that the golden ratio is not all that magical. It is
just a convenient number that is, as Torvalds put it, "an
irrational number that is not near to 0 or 1
".
The important property was to have "roughly 50% of the bits set
in a random pattern
" so the resultant mixing of bits would hide
any patterns in the input. The current hash function, it seems,
emphasizes exactly the wrong qualities.
Once the problem was understood, fixing it was relatively straightforward: just pick a better multiplier with more "randomness" in the bit pattern. This was done by approximating an irrational value and not worrying about primality or bit-sparseness. On hardware without 64-bit multiplication, the two 32-bit halves of the 64-bit input are hashed sequentially so two 32-bit multiplications are used. This doesn't return a 64-bit number, but no code ever wants more than 32 bits of hash, so not providing the full 64 is no loss.
Spelvin made good use of the opportunity provided by this need to adjust the hash functions and examined other hashing code, fixing up the hash_mem() and hash_str() function in the sunrpc code (used by nfsd) and suggesting improvements for some simplistic hashing used in the inode cache.
A minimal fix has landed for Linux 4.6-rc7 with a more complete fix expected for v4.7. It is good to know that this problem with the Linux kernel platform has been addressed, but it does lead one to wonder what other problems there are that have been conveniently ignored by many of us, and just need a little light to shine on them at the right time for the problem to be quickly fixed.
Threadable NAPI polling, softirqs, and proper fixes
Sometimes, when trying to make the kernel work better for specific workloads, one runs into problems originating in genuinely old code. Delving into such code can be a daunting task — it has worked for many years, and tweaking it could have surprising consequences in distant parts of the kernel. So it is not surprising that developers can be tempted to work around a problem in other ways. Such a situation recently came up with regard to high-rate networking on small systems; a look at the problem and its resolution provides a good excuse to dig into an ancient kernel mechanism: software interrupts or "softirqs".
Software interrupts
Processing of hardware interrupts is considered to be one of the most urgent tasks for the kernel; when an interrupt is signaled, the system drops what it is doing and calls the interrupt handler. Often, an interrupt will tell the system that there is more work — such as processing a completed I/O operation — to be done, but that work should not be done in the interrupt handler itself. To handle such cases, the kernel provides a number of mechanisms for deferred execution; these include workqueues, tasklets, timers, and software interrupts.
Of these mechanisms, software interrupts must be about the oldest; kernel/softirq.c starts with a copyright statement from Linus dated 1992 (though, in truth, that comment may be the only thing that survives from the kernel/softirq.c that appeared in the 1.2 release). It is also a mechanism that many developers will be relatively unfamiliar with; most will never interact directly with software interrupts. But they play an important role in how many urgent tasks in the kernel are handled.
Essentially, a software interrupt is meant to look like a hardware interrupt, except that it runs at a lower priority. Software interrupts can execute asynchronously and are considered more important than most other things the kernel can do. They are, in short, a way to say "do some work as soon as possible after we finish handling hardware interrupts."
All known softirqs are hard-coded in the kernel source; there is no mechanism to add them dynamically. There are currently ten softirqs defined, though one of them is not used. The ones that are used are for network processing, block request processing, block interrupt polling, read-copy-update processing, scheduler housekeeping, and tasklet processing — the tasklet mechanism is a sort of wrapper around the softirq layer. Each subsystem that uses softirqs registers a handler to be called when an interrupt is pending; signaling of software interrupts is done with a call to raise_softirq().
There are a few places where software interrupt handlers actually get run, including immediately after a hardware interrupt handler completes. So softirqs signaled from interrupt handlers will often be handled almost immediately. Softirq processing can be disabled by kernel code (taking a spinlock will do it, for example); softirq handlers can be run once they are enabled again (once the spinlock is released, for example). That means that, in practice, there are many points throughout the kernel where softirqs can execute.
One problem with the softirq mechanism is that it can create arbitrary latencies at almost any point in the kernel where softirq handlers can be run; there is often a lot of work that is done in those handlers. Thus, developers working on the realtime patches have grumbled about softirqs for years and have changed how they are processed in realtime kernels. In the mainline kernel, this problem has been partially addressed by trying to limit the amount of time spent processing softirqs. If any call to __do_softirq() (the function that actually calls softirq handlers) finds itself running for more than 2ms, processing will stop and the ksoftirqd kernel thread will be woken up to finish the job. So, if the softirq load gets to high, it gets pushed into a thread where it has to compete with the rest of the work the system is doing.
For the most part, this mechanism works well enough that it has not been removed, even though a number of developers would like to see it go. On the other hand, any attempt to add more softirq sources would almost certainly encounter strong pushback, which is why new ones are almost never added. Anybody thinking about doing so will be directed toward workqueues, or, if nothing else will suffice, tasklets.
Softirqs and NAPI polling
The networking subsystem uses two software interrupts, one each for transmit and receive processing. The receive softirq is where NAPI processing (the polling of interfaces for new packets) is done. If there are a lot of packets to handle, this processing can take a long time. As with the softirq subsystem itself, the networking stack imposes a limit on how much time it will spend accepting packets in the softirq handler. That limit, though, is set to two jiffies — as much as 20ms, depending on the system's clock speed. In other words, the networking code increases the maximum time spent continuously handling softirqs by as much as a factor of ten.
Of course, one wants the networking stack to have the CPU time it needs to deal with the flow of packets into the system. But it is also necessary to leave sufficient time for user space to actually do something with those packets. As Paolo Abeni reported in the introduction to his threadable NAPI poll loop patch set, that doesn't always happen:
His solution was to (at the system administrator's option) move NAPI processing out of softirqs entirely, and into its own dedicated kernel thread. Even if no other configuration is done, this change allows the scheduler to balance NAPI processing against the needs of the other runnable threads on the system, giving user space a chance to run. The results, Paolo said, are good:
Paolo's solution looks impressive, but it failed to impress the networking developers. Eric Dumazet rejected the patch, saying:
Paolo was not pleased with this response, but he was able to come back with a bit of important information: why ksoftirqd, as it is implemented in current kernels, is not a sufficient solution to the problem.
It seems that ksoftirqd will process packets for the two jiffies allowed by the networking code, then yield so that user space can run; Paolo estimated that, on his system, about 640 packets would be processed during this time. Once user space gets a chance to run, it will make a system call to retrieve a packet; the code implementing the system call will, at some point, find itself taking and releasing locks. But the point where a lock is released is an opportunity for the kernel to do softirq processing, and that is exactly what happens: softirq processing is done in the calling process's context. In Paolo's case, that in-context softirq processing would handle another 640 packets before returning a single packet to user space — the kernel, in other words, was processing nearly 1,300 packets before letting user space have even one of them. That is truly a system that is out of balance.
To Paolo, this behavior was a good reason to move NAPI processing into a
more controllable context, but Eric's response was, simply: "Looks you found
the bug then. Have you tried to fix it?
". After a bit, Eric came up
with a simple fix of his own. With the
resulting
patch, waking up the ksoftirqd thread will set a per-CPU flag.
Any code that would fire off softirq processing in process context first checks
that flag; if it is set, the processing is skipped. This has the effect of
forcing all softirq processing over to ksoftirqd once it has been
invoked.
Paolo did some testing and reported performance figures close to what were obtained with his patch. Even so, he preferred his patch, but indicated acceptance that it was probably not going to be merged. Eric's patch, after some further tweaking, looks likely to land in 4.7.
One of Paolo's reasons for taking the approach he did was to avoid any
possible effects on other softirq users. As Eric put it, this was a natural tendency:
"Right, we are networking guys, and we feel that messing with such
core infra is not for us.
" But the long-term maintainability of the
kernel depends on fixing problems where they are found rather than adding
new mechanisms to work around those problems. In this case, a small
problem would appear to have been correctly fixed. The rather larger
problem (as seen by some developers) represented by the simple existence of
the softirq code remains untouched, though.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Device driver infrastructure
Documentation
Filesystems and block I/O
Memory management
Networking
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
