LWN.net Weekly Edition for September 21, 2017
Welcome to the LWN.net Weekly Edition for September 21, 2017
This edition contains the following feature content, most of which comes from the Open Source Summit North America and Linux Plumbers Conference events:
- Linking commits to reviews: A new tool to locate email discussions relevant to specific kernel commits.
- Building the kernel with clang: The ability to build the kernel with the LLVM clang compiler has been a wishlist item for years; it is getting closer to reality.
- Building an ARM64 laptop: Why would one want to build a laptop around an ARM64 processor, and how would one go about doing it?
- The rest of the 4.14 merge window: The 4.14 merge window closed on September 16; here's a summary of the last set of changes to be merged.
- Notes from the LPC scheduler microconference: several sessions on various aspects of CPU-scheduler development.
- Testing kernels: How can the testing of pre-release kernels be improved?
This week's edition also includes these inner pages:
- Brief items: Brief news items from throughout the community.
- Announcements: Newsletters, conferences, security updates, patches, and more.
Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.
Linking commits to reviews
In a talk in the refereed track of the 2017 Linux Plumbers Conference, Alexandre Courouble presented the email2git tool that links kernel commits to their review discussion on the mailing lists. Email2git is a plugin for cregit, which implements token-level history for a Git repository; we covered a talk on cregit just over one year ago. Email2git combines cregit with Patchwork to link the commit to a patch and its discussion threads from any of the mailing lists that are scanned by patchwork.kernel.org. The result is a way to easily find the discussion that led to a piece of code—or even just a token—changing in the kernel source tree.
Courouble began with a short demo of the tool. It can be accessed by typing (or pasting) in a commit ID on this web page, which brings up a list of postings of the patch to various mailing lists; following those links shows the thread where it was posted (and, often, discussed). Another way to get there is to use cregit; navigating to a particular file then clicking on a token will bring up a similar list that relates to the patch where the symbol was changed. Note that the Patchwork data only goes back to 2009, so commits before that time will not produce any results.
So, email2git allows those interested to get a look into the design decisions that went into a particular chunk of code. Without it, doing so manually is not particularly easy. There are several use cases that he presented, starting with security researchers, who want to understand the thinking when a patch was made. It can also be used in bug fixing and by newcomers to the kernel community. In addition, email2git is being used as part of a recently announced Linux Foundation project: Community Health Analytics Open Source Software (CHAOSS).
Email2git takes commits from the mainline Git repository and tries to match them up to patches that Patchwork has picked up. Patchwork scans around 70 mailing lists to extract patches and the discussion threads that follow. It provides a user-friendly online interface, though email2git accesses the Patchwork database directly. Cregit is used to find changes at the token level, which is more accurate than git blame, he said.
There is not any kind of direct mapping from patches posted on a mailing list to commits in the mainline Git tree, so email2git needs to find those matches. Initially, he used a method from some research papers that effectively did an exhaustive search comparing the diff output in a commit to that in all of the posted patches until a match is found. That did not scale once he started working with the entire data set, which is some 500K mainline commits and 1.4 million patches posted since 2009. Some kind of heuristics were needed to narrow down the search space.
Courouble ended using three pieces of the patch to match them to a Git commit. The first is the subject of the email, which is often carried over into the Git commit summary. That heuristic alone finds 55% of the commits directly. Step two is to look at the patch author; he has created a map of all patches from a given author, so those can be tried next. The third heuristic is to match up the files that are affected by the patches. Each of the last two steps does comparisons of the diff in the patch and commit to make a matching decision.
The results for different kernel directories vary fairly widely. Some, like mm, kernel, tools, and virt can match 60-90% of the commits. Those are likely to be subtrees where the email subject winds up in the commit, he said. On the other end of the scale, the net directory has less than 30% matches; it turns out that Patchwork does not track the relevant mailing list. Other subtrees fall somewhere in between.
There are a number of limitations of the current system. It is only tracking the mainline for one thing, there may be other trees of interest. It is dependent on Patchwork, which is a great resource but only goes back to 2009, so some data is missing. The mbox format of the data can be inconsistent; there are patches with dates from 1970 and 2040, for example.
For the future, the plan is to make the match data available through other means, such as via a REST interface. In addition, running an instance of Patchwork in-house would allow extracting more data from other lists and perhaps going further back in time. Adding tracking for linux-next, improving the algorithm to do incremental processing, and handling patch series that are discussed in multiple threads are all on the radar. Courouble is also interested in getting feedback and ideas for other features from kernel developers.
A few of those were offered up in the Q&A. Someone suggested that it would be nice to get the "0/X" patch associated with the thread. Courouble seemed surprised to hear that Patchwork did not track those, but thought it should be added. There were also suggestions that providing guidelines on how patches move from the mailing list into the Git repositories so that they can be more easily tracked or perhaps adding Git patch IDs into the mix might help.
[I would like to thank LWN's travel sponsor, The Linux Foundation, for assistance in traveling to Los Angeles for LPC.]
Building the kernel with Clang
Over the years, there has been a persistent effort to build the Linux kernel using the Clang C compiler that is part of the LLVM project. We last looked in on the effort in a report from the LLVM microconference at the 2015 Linux Plumbers Conference (LPC), but we have followed it before that as well. At this year's LPC, two Google kernel engineers, Greg Hackmann and Nick Desaulniers, came to the Android microconference to update the status; at this point, it is possible to build two long-term support kernels (4.4 and 4.9) with Clang.
Desaulniers began the presentation by answering the most commonly asked question: why build the kernel with Clang? To start with, the Android user space is all built with Clang these days, so Google would like to reduce the number of toolchains it needs to support. He acknowledged that it is really only a benefit to Google and is "not super useful" elsewhere. But there are other reasons that are beneficial to the wider community.
There are some common bugs that often pop up in kernel code, especially out-of-tree code like the third-party drivers that end up in Android devices. The developers are interested in using the static analysis available in Clang to spot those bugs, but the kernel needs to be built using Clang to do so. There are also a number of dynamic-analysis tools that can be used like the various sanitizers (e.g. AddressSanitizer or ASan) and their kernel equivalents (e.g. KernelAddressSanitizer or KASAN).
Clang provides a different set of warnings than GCC does; looking at those will result in higher quality code. It is clearly beneficial to all kernel users to have fewer bugs in it. There are some additional tools that are planned using Clang. One is a control-flow-analysis tool that could enumerate valid stack frames at compile time; those could be checked at run time to eliminate return-oriented programming (ROP) attacks. There is also work going on for link-time optimization (LTO) and profile-guided optimization (PGO) for Clang, which could provide better execution speed, especially for hot paths.
Building code with another compiler is a good way to shake out code that relies on undefined behaviors. Since the language specification does not define certain behaviors, compiler developers can choose whatever is convenient. That choice could change, so even a GCC upgrade might cause misbehavior if some kernel code is relying on undefined behavior. The hope, Desaulniers said, is that both the kernel and LLVM/Clang can improve their code bases from this effort. The kernel is a big project with a lot of code that can find bugs in the compiler; in fact, it already has.
Greg Kroah-Hartman said that "competition is good"; he was strongly in favor of the effort. Desaulniers was glad to hear that as he and others were worried that the tight coupling with GCC was being protected by the kernel developers. Kroah-Hartman said that there have been other compilers building the kernel along the way. Behan Webster also pointed to all of the new features that have come about in GCC over the past five years as a result of the competition with LLVM. Kroah-Hartman said that he wished there was a competitor to the Linux kernel.
Hackmann related the state of the upstream kernel: "we are very close to having a kernel that can be built with Clang". It does require using a recent Clang that has some fixes, but the x86_64 and ARM64 kernels can be built, though each architecture has one out-of-tree patch that needs to be applied to do so. There is also one Android-specific Kbuild change that is needed, but only if the Android open-source project (AOSP) pre-built toolchain is being used.
As announced on the kernel mailing list, there are patches available for the 4.4 and 4.9 kernels. There are also experimental branches of the Android kernels for 4.4 and 4.9 available from AOSP. More details can be found in the slides [PDF]. Those branches had just been pushed a few days earlier, Hackmann said, and the HiKey boards were able to build and boot that code shortly thereafter.
There have been LLVM bugs found in the process, though most of them have been fixed at this point, Desaulniers said. The initial work was done with LLVM 4.0, but they have since updated to 5.0 and are also building with the current LLVM development tree (which will become 6.0). You can probably build the kernel with 4.0, he said, but it will be much slower than building with 5.0 or later.
There are still some outstanding issues. Variable-length arrays as non-terminal fields in structures are not supported by Clang, there is a GNU C extension for inline functions that is not supported, and the LLVM assembler cannot be used to build the kernel. Hackmann noted that the GNU assembler is too liberal in what it accepts.
This work has shown that the FUD surrounding using a new toolchain for the kernel is unfounded, Desaulniers said. It is working now, but there are a few asterisks. Clang, the front end, can compile the kernel, but the assembler and the linker from GNU Binutils are needed to complete the build process.
Next up is figuring out how to do automated testing of LLVM and the kernel. Currently, the team is working with two specific LTS kernel branches and using specific LLVM versions. So he can't quite say that Clang will build any kernel, since there are so many different configuration options. A bot to check whether kernel patches will fail to build under Clang is in the works as well. An audience member noted that kernelci.org is looking at adding other compilers to its build-and-boot testing.
Hackmann and Desaulniers encouraged others to try building using Clang. All it takes is a simple "make CC=clang" on a properly equipped system. We are, it seems, quite close to having a two-compiler world for the Linux kernel.
[I would like to thank LWN's travel sponsor, The Linux Foundation, for assistance in traveling to Los Angeles for LPC.]
Building an ARM64 laptop
Processors based on the 64-bit ARM architecture have been finding their way into various types of systems, including mobile handsets and servers. There is a distinct gap in the middle of the range, though: there are no ARM64 laptops. Bernhard Rosenkränzer and a group of colleagues set out to change that situation by building such a laptop from available components. He showed up at the 2017 Open Source Summit North America to present the result.He started by addressing the question of why one would want to build an ARM64 laptop in the first place. The ARM architecture is known for low power use — a useful feature in a laptop in its own right — but there is more to the ARM story than that; the ARM64 chips are fast and can beat single-core Intel Core-M processors on some benchmarks. An ARM64 laptop may not be good for fast kernel builds, but it can do what most people need, and it can do the kernel builds too in the end. ARM processors need no fans, meaning that the resulting laptop is lighter and will not burn the user's legs. There is little or no malware targeting ARM64 systems, for now at least.
Some of the motivations for this device may not resonate for people outside of the ARM ecosystem; Rosenkränzer talked about the ability to do native ARM64 builds rather than cross-compiling on an Intel system, for example. This system is also good for testing ARM64 hardware in the desktop setting — something that isn't really happening. An advantage that many users will appreciate, instead, is the potential to build a more open system. ARM64-based systems don't need any "secret BIOS" or mysterious management engines. That potential isn't fully realized yet, though, because a binary-only bootloader is still needed.
Given that there is value in an ARM64 laptop, he asked, why is nobody doing
it? He didn't really answer the question, but did note that there are a
couple of Chromebooks based on ARM64 chips. They are a good start, but
Chromebooks are not a good substitute for a general-purpose desktop
system. Their storage is limited, there is no SATA port, the keyboard is
minimal, and the 12-inch screen is small. The pi-top is an interesting system, great for
embedded use, but is also not well suited to desktop use.
It would be nice to avoid building an entire laptop from scratch, so the idea of replacing the insides of an x86-based system with an ARM64 board has some appeal. It can be done, he said, but this method doesn't scale. There is also the problem that no two laptops are the same inside, so this solution would be tied to a specific laptop model.
In the end, they went back to the pi-top, which does have some advantages. It's not bolted shut, so it's easy to modify. The display is connected via HDMI, and the touchpad is a USB device, meaning that they can be plugged into a different board with relatively little trouble. So could the Raspberry Pi processor in the pi-top be replaced with something more powerful? The answer is "yes", with the proviso that the power supply must be modified to be able to drive a more power-hungry processor. They used a Mediatek X20 board to build a proof-of-concept laptop. It works, but it still lacks adequate storage and has too little RAM.
To do this right, he said, requires a board with a fast ARM64 processor, at least an A72. Naturally, it should have a GPU that does not require binary-only drivers, so no Mali or PowerVR graphics. It needs a SATA or M.2 port so it can provide a decent amount of storage, and enough USB ports that some of them can be made available to the user. Lots of RAM is needed, HDMI output would be nice. And, of course, the board providing all of this needs to be affordable.
Such a board is not really available at the moment. The 96Boards systems are short on storage options. Only the DragonBoard 410c has open graphics, but lacks RAM and has an underpowered (for this task) CPU. The Tegra boards are more capable, but they are expensive and he does not trust NVIDIA to keep them open. The Banana Pi and ODROID boards lack performance. The i.MX8 might be interesting, he said; it has a Vivante GPU that can be driven by the free etnaviv driver, but it's not yet available.
An alternative might be to use a networking board; the MacchiatoBIN has some appeal, for example. It has a quad A72 processor, DDR4 DIMM memory, three SATA connectors, and a PCIe port that can be used for a graphics card. Unfortunately, that PCIe port lacks an aperture large enough to run a proper graphics adapter, so it won't work either.
Finally, though, they found the OpenQ 820. It has a four-core CPU, an Adreno 530 GPU that is "mostly supported" by the freedreno driver, 3GB of RAM, 32GB of UFS storage to hold the operating system, and two PCIe slots. Naturally, there is a snag: it only runs Android and doesn't support "real" Linux. But, Rosenkränzer said, they don't plan to use any of the Android binary-only drivers, so perhaps it could be made to work.
Getting there took a bit of effort. The bootloader for this system wants Android sparse filesystem images; fortunately, the Android Open Source Project has the tools needed to create those. In particular, its make_ext4fs tool can be made to do the job after a few patches are applied; the modified version can be found on GitHub. With that work done, they have an ARM64-based laptop that can boot and run. After fixing a few freedreno bugs, they have KDE running on it, and all of the operating-system packages have been ported. Battery support is still missing, so the laptop must stay plugged in, but that support is almost ready.
As always, there is more to be done. They would like to get away from using the pi-top HDMI interface; if they switched to MIPI the system would use less power, provide higher resolution, and have the HDMI port available for external connections. The problem is that MIPI displays only come in small sizes for now. They would also like to have a better case; the board doesn't quite fit in the pi-top case. Manufacturers are not cooperating, though; they want million-unit volume before they put any effort into it.
The cost of the system is about $600; there are hopes that it can be lowered in the future. Rosenkränzer also hopes that future boards will work better for this application, though he couldn't talk about any specific possibilities on the horizon. He concluded by saying that it's always likely to be some work to make a system. Nobody is trying to build commercial ARM64-based laptops, so it is unlikely that any boards oriented toward that use will appear in the market.
[Thanks to the Linux Foundation, LWN's travel sponsor, for supporting your editor's travel to the Open Source Summit.]
The rest of the 4.14 merge window
As is sometimes his way, Linus Torvalds released 4.14-rc1 and closed the merge window one day earlier than some might have expected. By the time, though, 11,556 non-merge changesets had found their way into the mainline repository, so there is no shortage of material for this release. Around 3,500 of those changes were pulled after the previous 4.14 merge-window summary; read on for an overview of what was in that last set.User-visible changes include:
- The disk quota subsystem has received a fair amount of performance
work, increasing file-creation performance in ext4 (with quotas
enabled) by a factor of about two.
- The CIFS filesystem can now read and write extended attributes when using
the SMB3 protocol.
- The heterogeneous memory management
patch set has finally been merged. HMM exists to enable the use of
devices with their own memory-management units that can access main
memory. See this
changelog for an overview of the current state of this work.
- The namespaced file capabilities patch
has been merged. This allows file capabilities to be used within user
namespaces. The scheme has been simplified somewhat, though, so that
only a single security.capability extended attribute can
exist for any given file.
- The kernel has gained support for the zstd compression
algorithm; it claims better performance and a better compression
ratio. See this
changelog for an overview.
- The new IOCB_NOWAIT flag can be used to specify that
asynchronous buffered
block I/O operations should be as close to non-blocking as possible.
Without this flag (or in current kernels) these operations can block
on memory-management operations, for example.
- The entire firmware subtree has been removed from the kernel
repository. The firmware that everybody actually uses has been
maintained in
its own repository for some years now, so these files were unused
and unneeded.
- New hardware support includes: Rockchip RK805 power-management chips, STMicroelectronics STM32 low-power timers, ROHM BD9571MWV power-management chips, TI TPS68470 power-management chips, Altera/Intel mSGDMA engines, Mediatek MT2712 and MT7622 PCI host controllers, Spreadtrum SC9860 I2C controllers, NXP i.MX6/7 remote processors, Maxim MAX17211/MAX17215 fuel gauges, Intel PCH/PCU SPI controllers, Qualcomm "B" family I/O memory-management units, Synopsys HSDK reset and clock controllers, ZTE ZX pulse-width modulators, Socionext UniPhier thermal monitoring units, Realtek RTD129x realtime clocks, Synopsys AXS10X SDP Generic PLL clocks, Renesas R-Car USB 2.0 clocks, Allwinner R40, A10, and A20 clock controllers, STMicroelectronics STM32H7 reset and clock controllers, Altera Soft IP I2C interfaces, STMicroelectronics STM32F7 I2C controllers, Lantiq XWAY SoC RCU reset controllers, Lantiq XWAY SoC RCU USB 1.1/2.0 PHYs, and PWM-controlled vibrators.
Changes visible to kernel developers include:
- The structure-layout randomization
plugin for GCC will now automatically rearrange the members of
structures that consist entirely of function pointers.
- The build system has long used gperf to generate
hashes. As of 4.14, though, that is no longer true. The gperf 3.1
broke that generation and, rather than fix it in a version-independent
way, Linus simply ripped
it out.
- The new memset16(), memset32(), and
memset64() functions can be used to set a range of memory to
an integer value.
- dma_alloc_noncoherent() has been removed; dma_alloc_attrs() should be used instead.
Now it is just a matter of finding and fixing the various bugs that were inevitably introduced with all of those changes. If that process goes well and follows the usual schedule, the final 4.14 release can be expected on November 5 or 12.
Notes from the LPC scheduler microconference
The scheduler workloads microconference at the 2017 Linux Plumbers Conference covered several aspects of the kernel's CPU scheduler. While workloads were on the agenda, so were a rework of the realtime scheduler's push/pull mechanism, a distinctly different approach to multi-core scheduling, and the use of tracing for workload simulation and analysis. As the following summary shows, CPU scheduling has not yet reached a point where all of the important questions have been answered.
Workload simulation
First up was Juri Lelli, who talked about the rt-app tool that tests the scheduler by simulating specific types of workloads. Work on rt-app has been progressing; recent additions include the ability to model mutexes, condition variables, memory-mapped I/O, periodic workloads, and more. There is a new JSON grammar that can be used to specify each task's behavior. Rt-app now generates a log file for each task, making it easy to extract statistics on what happened during a test run; for example, it can be used to watch how the CPU-frequency governor reacts when a "small" task suddenly starts requiring a lot more CPU time. Developers have been using it to see how scheduler patches change the behavior of the system; it is a sort of unit test for the scheduler.
In summary, he said, rt-app has become a flexible tool for the simulation
of many types of workloads. The actual workloads are generated from traces
taken when running the use cases of interest. It is possible (and done),
he said, to compare the simulated runs with the original traces. The
rt-app repository itself has a set of relatively simple workloads, modeling
browser and audio playback workloads, for example. New workloads are
usually created when somebody finds something that doesn't work well and
wants a way to reproduce the problem.
Rafael Wysocki said that he often has trouble with kernel changes that affect CPU-frequency behavior indirectly. He would like to have a way to evaluate patches for this kind of unwanted side effect, preferably while patches are sitting in linux-next at the latest and are relatively easy to fix. Might it be possible to use rt-app for this kind of testing? Josef Bacik added that his group (at Facebook) is constantly fixing issues that come up with each new kernel; it gets tiresome, so he, too, would like to find and fix these problems earlier.
He went on to state that everybody seems to agree that rt-app is the tool for this job. There would be value in having all developers pool their workloads into a comprehensive battery of tests, but where is the right place to put them? The rt-app project itself isn't necessarily the right place for all these workloads, so it would seem that a different repository is indicated. The LISA project was suggested as one possible home. Steve Rostedt, however, suggested that these workloads could just go into the kernel tree directly. Lelli asked whether having rt-app as an external dependency would be acceptable; Rostedt said that it would.
Wysocki wondered about how the community could ensure that these workloads get run on a wide variety of hardware; it's not possible to emulate everything. Bacik replied that it's not possible to test everything either, and that attempts to do so would be a case of the perfect being the enemy of the good. If each interested developer runs the tests on their own system, the overall coverage will be good enough. It's an approach that works out well for the xfstests filesystem-testing suite, he said.
Peter Zijlstra complained that rt-app doesn't produce any direct feedback — there is no result saying whether the right thing happened or not. As a result, interpreting rt-app output "requires having a brain". Rostedt suggested adding a feature to compare runs against previous output and look for regressions. Patrick Bellasi noted, though, that to work well in this mode, the tests need to be fully reproducible; that requires care in setting up the tests. At this point, he said, rt-app is not a continuous-integration tool, but it could maybe be evolved in that direction.
Reworking push/pull
Rostedt gave a brief presentation on what he called a "first-world problem". When running realtime workloads on systems with a large number (over 60) of CPUs — something he said people actually want to do — realtime tasks do not always migrate between CPUs in an optimal manner. That is due to shortcomings in how that migration is handled.
When the last running realtime task on any given CPU goes idle, he said, the CPU will examine the system to see if any other CPUs are overloaded with realtime tasks. If it finds a CPU running more than one realtime process, it will pull one of those processes over and run it locally. Once upon a time, this examination was done by grabbing locks on the remote CPUs, but that does not scale well. Now, instead, the idle CPU will send an inter-processor interrupt (IPI) to tell other CPUs that they can push an extra realtime task in its direction.
That is an improvement, but imagine a situation with many CPUs, one of which is overloaded. All of the other CPUs go idle more-or-less simultaneously, which does happen at times. They will all send IPIs to the busy CPU. One of them will get the extra process from that CPU but, having pushed that process away, the CPU still has to process a pile of useless IPIs. That adds extra load to the one CPU in the system that was already known to be overloaded, leading to observed high latencies — the one thing a realtime system is supposed to avoid at all costs.
The proposed solution is to have only the first (for some definition of "first") idle CPU send the IPI. That IPI can then be forwarded on to other overloaded CPUs if needed. The result would be the elimination of the IPI storms. Nobody seemed to think that this was a bad solution. It was noted that it would be nice to have statistics indicating how well this mechanism is working, and the conversation devolved quickly into a discussion of tracepoints (or the lack thereof) in the scheduler code. Zijlstra said that he broke user space once as a result of tracepoints, so he will not allow the addition of any more. This is a topic that was to return toward the end of the session.
Multi-core scheduling
Things took a different turn when Jean-Pierre Lozi stood up to talk about multi-core scheduling issues. Lozi is the lead author of the paper called The Linux Scheduler: a Decade of Wasted Cores [PDF], which described a number of issues with the CPU scheduler. An attempt to go through a list of those bugs drew a strong response from Zijlstra, who claimed that most of them were fixed some time ago.
The biggest potential exception is "work conservation" — ensuring that no task languishes in a CPU run queue if there are idle CPUs available elsewhere in the system. On larger systems, CPUs will indeed sit idle while tasks wait, and Zijlstra said that things will stay that way. When there are a lot of cores, he said, ensuring perfect work conservation is unacceptably expensive; it requires the maintenance of a global state and simply does not scale. Any cure would be worse than the disease.
Lozi's proposed solution is partitioning the system, essentially turning it into a set of smaller systems that can be scheduled independently. Zijlstra expressed skepticism that such an idea would be accepted, leading to a suggestion from Lozi that this work may not be intended for the mainline kernel. There was a quick and contentious discussion on the wisdom of the idea and whether it could already be done using the existing CPU-isolation mechanisms. Mathieu Desnoyers eventually intervened to pull the discussion back to its original focus: a proposal to allow the creation of custom schedulers in the kernel. This work is based on Bossa, which was originally developed some ten years ago. It uses a domain-specific language to describe a scheduler and how its decisions will be made; different partitions in the system could then run different scheduling algorithms adapted to their specific tasks.
It was pointed out that the idea of enabling loadable schedulers was firmly shot down quite a few years ago. Even so, there was a brief discussion on how they could be enabled. The likely approach would be to create a new scheduler class that would sit below the realtime classes on the priority scale, but above the SCHED_OTHER class where most work runs. The discussion ran out of time, but it seems likely that this idea will eventually return; stocking up on popcorn for that event might be advisable. Zijlstra, meanwhile, insists that the incident with the throwable microphone was entirely accidental.
Workload analysis with tracing
Desnoyers started the final topic of the session by stating that there is a real need for better instrumentation of the scheduler so that its decisions can be understood and improved. He would like exact information on process switches, wakeups, priority changes, etc. It is important to find a way to add this information without creating ABI issues that could impede future scheduler development. That means that the events have to be created in a way that allows them to evolve without breaking user-space tools.
Zijlstra said that it would not be possible to add all of the desired
information even to the existing tracepoints; that would bloat them too
much. Desnoyers suggested adding a new version file to each
tracepoint; an application could write "2" to it to get a new,
more complete output format. Rostedt complained about the use of version
numbers, though, and suggested writing a descriptive string to the existing
format file instead. Zijlstra said that tracepoints should default to the
newest format, but Rostedt said that would break existing tools. The only
workable way is to default to the old format and allow aware tools to
request a change.
That was about the point where Zijlstra (semi-jokingly) declared his intent to remove all of the tracepoints from the scheduler. "I didn't want this pony", he said.
There was a wandering discussion on how it might be possible to design a mechanism that resembles tracepoints, but which is not subject to the kernel's normal ABI guarantees. Bacik said that a lot of the existing scripts in this area use kprobes; they are simply updated whenever they break. Perhaps a new sort of "tracehook" could be defined that resembles a kprobe: it is a place to hook into a given spot in the code, but without any sort of predefined output. A function could be attached to that hook to create tracepoint-style output; it could be located in a separate kernel module or be loaded as an eBPF script. Either way, that glue code would be treated like an ordinary kernel module; it is using an internal API and, as a result, must adapt if that API changes.
The developers in the room seemed to like this idea, suggesting even that it might be a way to finally get some tracepoints into the virtual filesystem layer — a part of the kernel that does not allow tracepoints at all. Your editor was not convinced, though, that the ABI issues could be so easily dodged. As the session ended, it was resolved that the idea would be presented to Linus Torvalds for a definitive opinion, most likely during the Maintainers Summit in October.
[Your editor would like to thank the Linux Foundation, LWN's travel sponsor, for supporting his travel to LPC 2017].
Testing kernels
New kernels are released regularly, but it is not entirely clear how much in-depth testing they are actually getting. Even the mainline kernel may not be getting enough of the right kind of testing. That was the topic for a "birds of a feather" (BoF) meeting at this year's Linux Plumbers Conference (LPC) held in mid-September in Los Angeles, CA. Dhaval Giani and Sasha Levin organized the BoF as a prelude to the Testing and Fuzzing microconference they were leading the next day.
There were representatives from most of the major Linux distributors present in the room. Giani started things off by asking how much testing is being done on the stable kernels by distributors. Are they simply testing their own kernels and the backports of security and other fixes that come from the stable kernels? Beyond the semi-joking suggestion that testing is left to users, most present thought that there was little or no testing (beyond simple build-and-boot testing) of the stable kernels.
Part of the problem is that it is difficult to know what to choose in order to test "the upstream kernel". The linux-next tree is a moving target as are the stable trees (and the mainline itself). But there is value in finding bugs before they make it into a release. In order to try to find bugs before they actually get into releases, some distributors are starting to test the ‑rc1 kernels. That way, if bugs are found, they can be fixed before the release, but it takes a lot of hours and machines to do that well. There is also a question of which kernel configurations to test.
The upstream testing that is done for the mainline and stable kernels is fairly limited. There is a lot of it being done, but it doesn't go all that deep into kernel functionality. It takes Red Hat a year to stabilize the kernel chosen for a RHEL release; roughly 300 engineers work on that task, meaning it takes the company 300 person-years to test and harden a kernel.
Boot testing is well covered by various upstream testing efforts, so newly released kernels will boot. The majority of bugs that are found in those (or any other) kernel are in the drivers; the only people testing the drivers are those that have the hardware. The core itself is pretty safe, it is believed, and things like Ftrace, the scheduler, and memory management are used widely, so they get a fair amount of testing. Other, non-core or less popular functionality may not see much functional testing.
Red Hat has a large testing lab with something like 6000 machines of various sorts all over the world. It uses Beaker and tests lots of different kernel configurations. It currently runs tests on three RHEL kernels and two Fedora kernels, though there are plans to add the mainline releases. Different teams focus on drivers specific to their area of interest, so the storage teams test various storage devices, while the RDMA team tests that type of hardware.
The main problem is that it takes a lot of effort to analyze the bugs found with the tests. Any crashes that happen could simply be posted to the kernel mailing list as regressions, as was suggested, but even that takes some amount of triage and requires reducing the code to a reproducible test case.
Others who might want to test the drivers may be stymied by the lack of availability (or the cost) of the hardware. It also requires a good understanding of exactly what the device is supposed to do. Ideally, the driver writers would be testing the devices—generally they do—but even that is not a complete solution. Driver writers try to make sure the driver works for their use cases, but they have differing motivations depending on whether they are being paid to write it or simply doing so to support hardware they have, often with little or no documentation.
There are also performance regressions that need to be found in new kernels. That is a difficult problem to solve since "random performance testing" does not really help. There is a need to put together some guidelines for performance tests, so that an apples-to-apples comparison can be done.
The biggest problem for all of these testing efforts is resources. More people and more machines are needed in order to find and fix the bugs sooner.
Stable
The conversation then turned toward the stable kernels. There is a need to stop bugs from entering the kernel; if the mainline is perfect, there would be no need for the stable trees. Perfection is not possible, of course, but do the distributions even use the stable trees?
It turns out that Red Hat only cherry-picks fixes from the stable trees. Each minor release of RHEL has 8-10 thousand patches on top of its kernel, all of which have come from upstream. The RHEL kernel team looks at the stable trees and the latest mainline kernel to find fixes that should be applied. The amount of testing done varies based on which subsystem the patch applies to; some subsystems have a good track record on providing working patches, others less so. Generally, Red Hat only builds the stable kernels to test them against the RHEL kernel to see if a bug is from upstream or was introduced in RHEL.
SUSE does build the stable kernels, but also cherry-picks patches for inclusion. Stable kernel testing could be added to its testing grid, but it is not clear what value that would have for the company. Ubuntu is similar; other than building the stable kernels, there is no formal testing of them.
So the distributions generally care that the fixes that go into stable are correct, but they are testing them in their own kernels. It was suggested that perhaps a collaborative project could be put together by the Linux Foundation, in cooperation with Canonical, SUSE, Red Hat, and others, to put together a set of machines with a test suite to do testing for the stable series.
Linaro is currently working on a project for Google to test the stable kernels using the kernel self-tests (kselftest) and tests from the Linux Test Project (LTP). Those tests are run for every stable release. The self-tests do find bugs, but those who are writing self-tests are probably not the ones introducing the majority of the bugs. The self-tests are just the starting point, however, Linaro intends to add more tests.
One of the difficulties is the huge number of configurations. When a stable kernel is released, it might have 100 patches, but any test suite may not exercise many, or even any, of those fixes. There is a real question of what it means to test a stable release.
The 0-Day kernel test service is also doing more than just build-and-boot tests, including performance testing. The kernelci.org project is doing build-and-boot testing on lots of different hardware, which is quite valuable, but it doesn't do any real functional testing. Things are certainly getting better, overall, and what is there now is "surprisingly better than nothing", one attendee joked.
The self-tests are typically written by kernel developers, but it takes time and effort to turn personal tests into something that can be used more widely. Drivers generally do not have self-tests, because the driver writer didn't have any time to add one. In many cases, the code is of low quality because of that, as well. So the existing self-tests are likely to be in subsystems that are already well tested, but they have found bugs on architectures that are different from what the developer normally runs. Various ARM bugs have been found that way.
LTP tests many things, but there is also plenty that it does not test. It is used by some distributions and has definitely found bugs, but there is a need for more (and better) test suites.
Benchmarks and fuzzing
Beyond that, there is also a need for more benchmarking to detect performance regressions. Mel Gorman's MMTests were mentioned as something that could be used as the basis of a "kbench" benchmarking suite. Some in the room seemed unfamiliar with that test suite, which helped point out the need for better documentation. A test suites file for the kernel documentation directory might be a start, but any benchmark is going to need in-depth documentation.
There was some thought that it would be nice to have a benchmark that boiled down to a single number that could be compared between systems (like the idea behind BogoMips). There was also a fair amount of skepticism about how possible that might turn out to be, however.
Fuzzing for stable kernels was also discussed. Fuzzing the upstream kernels is the best option, since fixes must be made there, but it can find problems in backports for distribution kernels. It turns out that the syzkaller fuzzer generates small test cases to reproduce the problems that it finds. It was agreed that those should be added to the self-tests. Some of the bugs only manifest themselves under the Kernel Address Sanitizer (KASAN), but those tests can simply be skipped as "unsupported" for kernels that are not configured for KASAN.
More and more self-tests are being added to the mainline, but the stable kernels don't benefit from those. Some are running the latest self-tests with older kernels, but there was some thought that perhaps the self-tests themselves should be backported into the stable trees.
As the BoF wound down, Levin asked that distribution maintainers push the patches they are using to the stable trees. It is not uncommon to find a fix in a distribution kernel that should be in stable as well. He has been working on training a neural net to recognize stable-eligible patches, which elicited some laughter, but he said it is actually "going surprisingly well".
One person who was not at the BoF, but has a vested interest in what was being discussed, is stable maintainer Greg Kroah-Hartman. He got a chance to offer some of his opinions in the microconference, which opened with a short session where Levin replayed what was discussed in the BoF.
As Levin said, there were a number of points raised in the BoF without much, if any, resolution of those problems. Someone spoke up to suggest that more hardware be given to the kernelci.org project, but Kroah-Hartman would also like to see more functional testing. It may make sense for the Linaro and kernelci.org efforts to join forces, though, someone said.
Kroah-Hartman has no objection to the idea of backporting self-tests as long as they will run on the kernel in question. He agreed that it would be nice for distributions to be diligent about getting their fixes to the stable tree, but noted that Fedora and Debian are already doing a good job in that area. Distributions often try to get a fix to their users quickly, then do the work to get it fixed upstream, another participant said. Kroah-Hartman said that he often will leave a bug in stable if it is not fixed in the upstream kernel to both be "bug compatible" and to provide some pressure for it to get fixed.
It is clear that more kernel testing could be done, but it is less clear exactly what form it should take or who will actually do it. With luck, some progress on that will be made in the near future, which is likely to lead to more bugs found sooner. Perfection is impossible, of course, but an overall reduction in kernel bugs is something we can all hope for.
[I would like to thank LWN's travel sponsor, The Linux Foundation, for assistance in traveling to Los Angeles for LPC.]
Page editor: Jonathan Corbet
Inside this week's LWN.net Weekly Edition
- Briefs: Security summit recap; PyPI malware; RDMA µconf; ARM graphics; Pipewire; Librem 5; EME; Quotes; ...
- Announcements: Newsletters, events, security updates, kernel patches, ...
