Kernel development
Brief items
Kernel release status
The current development kernel remains 4.7-rc7; there is no new -rc release this week due to Linus's travel plans. The final 4.7 release is still expected on July 24.The current 4.7 regression report shows that a handful of regressions remain unfixed.
Stable updates: none have been released in the last week.
Quotes of the week
An honorary degree for Alan Cox
Congratulations are due to Alan Cox, who was awarded an honorary degree by Swansea University for his work with Linux. "Alan started working on Version 0. There were bugs and problems he could correct. He put Linux on a machine in the Swansea University computer network, which revealed many problems in networking which he sorted out; later he rewrote the networking software. Alan brought to Linux software engineering discipline: Linux software releases that were tested, corrected and above all stable. On graduating, Alan worked at Swansea University, set up the UK Linux server and distributed thousands of systems."
Kernel development news
Controlling access to the memory cache
Access to main memory from the processor is mediated (and accelerated) by the L2 and L3 memory caches; developers working on performance-critical code quickly learn that cache utilization can have a huge effect on how quickly an application (or a kernel) runs. But, as Fenghua Yu noted in his LinuxCon Japan 2016 talk, the caches are a shared resource, so even a cache-optimal application can be slowed by an unrelated task, possibly running on a different CPU. Intel has been working on a mechanism that allows a system administrator to set cache-sharing policies; the talk described the need for this mechanism and how access to it is implemented in the current patch set.
Control over cache usage
Yu started off by saying that a shared cache is subject to the "noisy neighbor" problem; a program that uses a lot of cache entries can cause the eviction of entries used by others, hurting their performance. The L3 cache is shared by all CPUs on the same socket, so the annoying neighbor need not be running on the same processor; a cache-noisy program can create problems for others running on any CPU in the socket. A low-priority process that causes cache churn can slow down a much higher-priority process; increased interrupt-response latency is another problem that often results.
The solution to the problem is to eliminate cache sharing between parts of
the system that should be isolated from each other; this is done by
partitioning the available cache. Each partition is shared between fewer
processes and, thus, has fewer conflicts. There is an associated cost,
clearly, in that a process running on a partitioned cache has a smaller
cache. That, Yu said, can affect the overall throughput of the system, but
that is a separate concern.
Intel's cache-partitioning mechanism is called "cache allocation technology," or CAT. Haswell-generation (and later) server chips have support for CAT at the L3 (socket) level. The documentation also describes L2 (core-level) support, but that feature is not available in any existing hardware.
In a CAT-enabled processor, it is possible to set up one or more cache bitmaps ("CBMs") describing the portion of the cache that may be used. If, on a particular CPU, the L3 cache is divided into 20 slices, then a CBM of 0xfffff describes the entire cache, while 0xf8000 and 0x7c00 describe two disjoint regions, each covering 25% of the cache.
The CBMs are kept in a small table, indexed by a "class of service ID" or CLOSID. The CLOSID will eventually control multiple resources (L2 cache, for example, or something entirely different) but, in current processors, it only selects the active CBM for the L3 cache. At any given time, a specific CLOSID will be active in each CPU, controlling which portion of the cache that CPU can make use of. Each CPU has its own set of CLOSIDs; they are not a system-wide resource.
Kernel support is needed to make proper use of the CAT functionality. The number of CLOSIDs available is relatively small, so the kernel must arbitrate access to them. Like any resource-allocation technology, CAT control must be limited to privileged users or it will be circumvented. Yu described how CAT policies can be controlled via the interfaces implemented in the current patch set but, before getting into that, it's worthwhile to step away from the talk for a moment and look at the history of this interface.
Unsuccessful kittens
New hardware features often present interesting problems when the time comes to add support to the kernel. It is relatively easy to add that support as a simple wrapper and access-control layer around the feature, but care must be taken to avoid tying the interface to the hardware itself. A vendor's idea of how the feature should work can change over time, and other manufacturers may have ideas of their own. Any interface that is unable to evolve with the hardware will become unsupportable over time and have to be replaced. So it is important to provide an interface that abstracts away the details of how the hardware works to the greatest extent possible. At the same time, though, the interface cannot be so abstract that it makes some important functionality unavailable.
The first public attempt at CAT support in the kernel appears to be this patch set posted by Vikas Shivappa in late 2014. The approach taken was to use the control-group mechanism to set the CBM for groups of processes; the CLOSID mechanism was hidden by the kernel and not visible to user space at all. The initial review discussion focused on some of the more glaring deficiencies in the patch set, so it took a while before developers started to point out that, perhaps, control groups were not the right solution to this problem; it seems that they abstracted things a little too much.
There were a few complaints about the control-group interface, but by far the loudest was that it failed to reflect the fact that CAT works on a per-CPU basis — each processor has its own set of CLOSIDs and its own active policy at any given time. The proposed interface was tied to processes rather than processors, and it forced the use of a single policy across the entire system. There are plenty of real-world use cases that want to have different cache-utilization policies on different CPUs in the same system, but the control-group mechanism could not express those policies. This problem was exacerbated by the fact that the number of CLOSIDs is severely limited; making it impossible for each CPU to use its own CLOSID-to-CBM mappings made that limitation much more painful.
Beyond setting up different policies on different CPUs, many users would like to use the CPU as the primary determinant for cache policy. For example, a specific CPU running an important task could be given exclusive access to a large portion of the cache. If the task in question is bound to that processor, it will automatically get access to that cache reservation; any related processes — kernel threads performing work related to that task, for example — will also be able to use that cache space. This mode, too, is not well supported by an interface based on control groups. In its absence, users would have to track down each helper process and manually add it to the correct control group, a tedious and error-prone task.
The problem was discussed repeatedly as new versions of the patch set came
out during much of early 2015. At one point, Marcelo Tosatti posted an interface based on ioctl() calls
that was meant to address some of the concerns,
but it seems there was little interest in bringing ioctl() into
the mix. In November, Thomas Gleixner posted a
description of how he thought the interface should work for discussion.
He said that a single, system-wide configuration was not workable and that
"we need to expose this very close to the hardware implementation as
there are really no abstractions which allow us to express the various
bitmap combinations
" His overall suggestion was to create a new
virtual filesystem for the control of the CAT mechanism; that is the
approach taken by Yu's current patch set.
Herding the CAT
Returning to Yu's talk: he noted that a new patch set had been posted just prior to the conference; it shows the implementation of the new control interface. It is all based on a virtual filesystem, as Gleixner had suggested. Naturally enough, the name of that filesystem (/sys/fs/rscctrl) became the first topic of debate, with Gleixner complaining that it was too cryptic. Tony Luck's suggestion that it could instead be called:
/sys/fs/Intel(R) Resource Director Technology(TM)/
seems unlikely to be adopted; "/sys/fs/resctrl" may emerge as the most acceptable name in the end.
The top level of this filesystem contains three files: tasks, cpus, and schemas. The tasks file contains a list of all processes whose cache access is controlled by the bitmap found in the schemas file; similarly, cpus can be used to attach a bitmap to one or more CPUs. Initially the tasks file holds the IDs for all processes in the system, and cpus is all zeroes; the schemas file contains all ones. The default policy, thus, is to allow all processes in the system the full use of the cache.
Normal usage will involve the system administrator creating subdirectories to create new policies; each subdirectory will contain the same set of three files. A different CBM can be written to the schemas file in the subdirectory, changing the cache-access policy for any affected process. A process can be tied to that new policy by writing its ID to the tasks file. It is also possible to tie the policy to one or more CPUs by writing a CPU mask to the cpus file. A CPU-based policy will override a process-ID-based one — if a process is running on a CPU with a specific policy, that is the policy that will be used regardless of whether the process has been explicitly set to use a different one.
Yu's talk glossed over a number of details on exactly how these control files work, as one might expect; the documentation file from the patch set contains those details and some usage examples as well. He did discuss some benchmark results (which can be seen at the end of his slides [PDF]) showing significant improvements for specific workloads that were affected by heavy cache contention. This feature may not be needed by everybody, but it seems that some users will have a lot to gain from it. Realtime workloads, in particular, would appear to stand to benefit from dedicated cache space.
As for where things stand: the current patch set is out for review, with the hope that the most significant obstacles have been overcome at this point. Assuming that the user-space interface issues have now been resolved, this code, which has been under development for well over a year, should be getting close to being ready for merging into the mainline.
[Your editor would like to thank the Linux Foundation for supporting his travel to LinuxCon Japan].
LTSI and Fuego
It has now been nearly five years since Tsugikazu Shibata announced the launch of the long-term support initiative (LTSI) project. LTSI's objective is to provide extended support for specific kernel releases that can serve as a rallying point for embedded-system vendors and a means by which those vendors can get their patches upstream. At LinuxCon Japan 2016, Shibata-san provided an update on LTSI; he was followed by Tim Bird, who discussed the "Fuego" test framework that is now being used to help validate LTSI releases.
An LTSI update
The core process for LTSI kernels has not changed much since the project's inception. LTSI releases are based on the long-term support releases maintained by Greg Kroah-Hartman, and are maintained for the same time period. They do, however, include a significant set of extra patches in the form of vendor-contributed features and backports from more recent kernels; they also go through more extensive testing than ordinary stable-kernel releases.
Five years in, LTSI is seeing some significant adoption. The Yocto meta-distribution has had an
option to use LTSI kernels since 2012. The Automotive Grade Linux project
is using LTSI kernels (via Yocto). The relatively new Civil Infrastructure Platform (CIP)
project is also using LTSI, with an interesting twist. Systems based on
CIP are likely to be deployed in situations where they are expected to run
for a long time, so there is a need for long-term support with a different
value of "long-term": 10-15 years. CIP will itself be providing that
support by taking over responsibility for LTSI kernels after LTSI itself
has moved on. Shibata-san noted that supporting a kernel for that long is
going to be an interesting challenge; he wished the project luck.
The LTSI release process starts with one of the regular long-term support releases. There is a four- or five-month period in which patches to this kernel are prepared; these include backports and other features that are useful to the LTSI community. That is followed by a two-month merge window in which all those patches are applied. One month of validation follows; all contributors to the LTSI kernel are expected to ensure that things work properly in the final release. This process, Shibata-san claimed, leads to the production of one of the most stable and secure kernels available.
That said, there are concerns that the current seven-month process takes too long; the latency in the process is especially acutely felt with the 4.4 kernel, which came a bit sooner than had been expected. So the project is talking about shortening the release process this time around; there would be a two-month preparation period and a one-month merge window. The final decision on that change, it seems, will come in the near future.
Fuego
One of the significant changes in the LTSI release process mentioned by Shibata-san was the adoption of a new testing framework called "Fuego." Bird used the next session to talk about Fuego and how it works. In short, Fuego is the combination of the Jenkins continuous-integration tool, a set of scripts, and a collection of tests, all packaged within a Docker container.
Jenkins is used to run tests based on various triggers and collect the
results. It is widely used and features hundreds of extensions to handle
things like email notifications or integration with source-code management
systems. The big customization that Fuego has added is to separate host
and target configuration; testing can be directed from a host, but it runs
on the specific embedded target of interest.
There is a set of "abstraction scripts" designed to make Fuego work with any specific target board; these scripts are driven by variables describing how to interact with the board, functions to get or put files and run commands, etc. The end result is a generated script to run the actual tests. There are about fifty tests integrated into the system so far; most of those are existing tests from elsewhere, but the plan is to add a bunch of new tests as well.
The whole system is designed to be packaged up into a Docker container. The end result should be runnable on any Linux distribution without modification.
Fuego was designed to be easy for embedded engineers to set up and run. It comes with configurations for specific target systems, including Yocto, Buildroot, OpenWrt, and more. Various target types and transports are supported; Fuego can talk to a target using a serial port, SSH, or Android's adb tool, among others. It is designed to send test results to a centralized repository. The end goal is to enable the creation of a decentralized test network, allowing the testing of changes on a wide variety of hardware and getting past the "I don't have that particular board" problem.
Future plans include the decluttering of the Jenkins interface, which is rather busy at the moment. The project would like to add handling for USB connections, making it easier to use tools like adb to talk to handset-like devices. More documentation and more tests are on the list, as is integration with the kernelci.org project.
More users and contributors would certainly be welcome. The project is using the ltsi-dev mailing list for its communications for now; more information on the project, including pointers to the repositories, can be found on elinux.org. See this page for more information on how to install and use Fuego.
[Your editor would like to thank the Linux Foundation for supporting his travel to LinuxCon Japan].
Coding-style exceptionalism
As I was analyzing the behavioral details of various drivers as part of my research for a recent article on USB battery charging in Linux, I was struck by the the thought that code doesn't exist just to make certain hardware perform certain functions. Important though that is, the code in Linux, and in other open projects, also exists as a cultural artifact through which we programmers communicate and from which we learn. The disappointment I felt at the haphazard quality I found was not simply because some hardware somewhere might not perform optimally. It was because some other programmer tasked with writing a similar driver for new hardware might look to some of these drivers for inspiration, and might copy something that was barely functional and not use the best interfaces available.
With these thoughts floating around my mind I was interested to find a recent thread on the linux-kernel mailing list that was more concerned about how a block of code looked than about what it did.
The code in question handles SHA-256 hash generation on Intel x86 platforms. The thread started because Dan Carpenter's smatch tool had found some unusual code:
if ((ctx->partial_block_buffer_length) | (len < SHA256_BLOCK_SIZE)) {
The "|" here looks like it was probably meant to be
"||" — so there was a bit-wise "or" where a logical
"or" is more common. Carpenter went to some pains to be clear
that he knew the code would produce the same result no matter which
operator was used, but observed that "it's hard to tell the
intent.
" Intent doesn't matter to a compiler, but it does to a
human reader. Even well-written code can be a challenge to read due to the
enormous amount of detail embedded in it. When there is an unusual
construct that you need to stop and think about, that doesn't make it any
easier.
There
were a couple of suggestions that this was an intentional optimization and
there is some justification for this. With both GCC 4.8 and 5.3
compiling for x86_64, the "|" version produces one fewer
instruction, avoiding a jump. In some cases that small
performance difference might be worth the small extra burden on the reader,
though as Joe Perches observed:
"It's probably
useful to add a comment for the specific intent here
"; that
would not only make it easier to read, but would ensure that nobody broke the
optimization in the future. Further, the value of such optimizations can
easily vary from compiler to compiler.
Once a little attention was focused on this code, other complaints arose, with Ingo Molnar complaining about the excessive use of parentheses, the unusually long field name partial_block_buffer_length, and, responding to what is clearly a sore spot for some, requesting that the "customary" style be used for multi-line comments.
Documentation/Codingstyle explains that:
The preferred style for long (multi-line) comments is: /* * This is the preferred style for multi-line * comments in the Linux kernel source code. * Please use it consistently. * * Description: A column of asterisks on the left side, * with beginning and ending almost-blank lines. */ For files in net/ and drivers/net/ the preferred style for long (multi-line) comments is a little different. /* The preferred comment style for files in net/ and drivers/net * looks like this. * * It is nearly the same as the generally preferred comment style, * but there is no initial almost-blank line. */
The code under the microscope is in arch/x86/crypto — not
strictly part of the networking subsystem — but this code uses
the style for net/ and
drivers/net/ in at least one place. Herbert Xu, the crypto subsystem
maintainer, asserted
that
the crypto API uses the same style as networking, but Molnar wasn't
convinced and neither, it turned out, was Linus Torvalds. I won't try to
summarize Torvalds's
rant (which he promised he would not follow up on) but I will examine a
concrete and
testable assertion made by Molnar: "That 'standard' is
not being enforced consistently at all
".
Looking at the ".c" and ".h" files in linux 4.7-rc7 and using fairly simple regular expressions (which might have occasional false positives), the string "/*" appears 1,308,166 times, suggesting the presence of over 1.3 million comments. Of those, 981,168 are followed by "*/" on the same line, leaving 326,998 multi-line comments. 200,737 of these have nothing (apart from the occasional space) following the opening "/*" on the first line, and 51,366 start with "/**" which indicates a "kernel-doc" formatted comment, leaving 74,895 multi-line comments in the non-customary format with text on the first line.
These three groups are present in a ratio of approximately 8:2:3. The kernel-doc comments have to be in the expected format to be properly functional, leaving the developer no discretion; it thus isn't reasonable to include them when looking at the choices developers have made. Of the multi-line comments where the programmer has some discretion, we find an 8:3 ratio of customary format, in the sense Molnar meant it, to others. So 27% are non-standard.
If we repeat these measurements for net/, drivers/net/, crypto/, and drivers/crypto/, the number of non-standard multi-line comments are:
Subsystem Comments Percent Total Net-style net/ 13,441 6,423 48% drivers/net/ 36,599 19.516 54% crypto/ 593 171 29% drivers/crypto/ 706 178 25%
So broadly, the evidence does seem to support Molnar's claim. While "text-on-the-first-line" comments are more common in the networking code, they just barely constitute a majority of multi-line comments there and they are not significantly more common in the crypto code. This statistic doesn't tell us a lot, but it does suggest that the supposed "preferred" style for the networking code is not consistently preferred in practice, and that sticking to it for new comments wouldn't actually improve the overall consistency of that code.
Some of us may think this is all a storm in a teacup and that empty lines in comments, much like empty lines in code, are a matter of personal taste and nothing more. For many people this may be true. But open-source code will particularly benefit from being read by people who have a high attention for details, who will notice things that look a bit out of place, and who can spot bugs that compilers or static analyzers will miss. These people are likely to notice, and so be burdened by, irrelevant details like non-standard comments.
For Molnar at least, "the networking code's
'exceptionalism' regarding the standard comment style is super
distracting
" and there is evidence that he is not alone in this. To
get the greatest value from other people reading our code, it makes sense
to keep it as easy to read as possible. The code doesn't just belong to
the author, it belongs to the community which very much includes those who
will read it, whether to fix bugs, to write documentation, or as a basis
for writing new drivers. We serve that community, and so indirectly
ourselves, best when we make our code uniform and easy for others to
read.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Device driver infrastructure
Documentation
Filesystems and block I/O
Networking
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>