LWN.net Weekly Edition for January 4, 2018
Welcome to the LWN.net Weekly Edition for January 4, 2018
This edition contains the following feature content:
- Welcome to 2018: our unlikely and unreliable predictions for 2018, some of which have already come true.
- Notes from the Intelpocalypse: a first look at the recently disclosed CPU vulnerabilities.
- Future directions for PGP: a survey of GnuPG and its alternatives.
- Statistics for the 4.15 kernel: where the code in the 4.15 kernel came from.
- An introduction to the BPF Compiler Collection: a first look at the tools that make BPF programming easier.
- A Modularity rethink for Fedora: a new direction for this initiative from the Fedora project.
- Varlink: a protocol for IPC: a new proposed inter-process communication mechanism for Linux.
This week's edition also includes these inner pages:
- Brief items: Brief news items from throughout the community.
- Announcements: Newsletters, conferences, security updates, patches, and more.
Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.
Welcome to 2018
Welcome to the first LWN.net Weekly Edition for 2018. The holidays are over and it's time to get back to work. One of the first orders of business here at LWN is keeping up with our ill-advised tradition of making unlikely predictions for the coming year. There can be no doubt that 2018 will be an eventful and interesting year; here's our attempt at guessing how it will play out.The image of the technology industry as a whole suffered in 2017, and that process is likely to continue this year as well. That should lead to an increased level of introspection that will certainly affect the free-software community. Many of us got into free software to, among other things, make the world a better place. It is not at all clear that all of our activities are doing that, or what we should do to change that situation. Expect a lively conversation on how our projects should be run and what they should be trying to achieve.
Some of that introspection will certainly carry into projects related to machine learning and similar topics. There will be more interesting AI-related free software in 2018, but it may not all be beneficial. How well will the world be served, for example, by a highly capable, free facial-recognition system and associated global database? Our community will be no more effective than anybody else at limiting progress of potentially freedom-reducing technologies, but we should try harder to ensure that our technologies promote and support freedom to the greatest extent possible.
Our 2017 predictions missed the fact that an increasing number of security problems are being found at the hardware level. We'll not make the same mistake in 2018. Much of what we think of as "hardware" has a great deal of software built into it — highly proprietary software that runs at the highest privilege levels and which is not subject to third-party review. Of course that software has bugs and security issues of its own; it couldn't really be any other way. We will see more of those issues in 2018, and many of them are likely to prove difficult to fix.
As a result of these problems and more, interest in open hardware will grow in 2018. For many of us, creating our own hardware looks like an impossibly challenging task; we should remember that creating our own operating system looked just as difficult in 1983 when the GNU project was announced. If we truly want control over our computers, we need control over the hardware too.
At a higher level, we'll see more companies working to produce hardware that at least has a chance of being under its owner's control. The production of systems with the Intel management engine disabled and the Purism Librem 5 handset are examples of what is coming. These products will only continue to exist, though, if they find customers, and it is not yet clear that enough people are willing to pay a premium for this kind of hardware.
Linux virtualization and container technology is at the base of the entire cloud-computing ecosystem. Despite our community's steady stream of security problems, that base has generally held firm, and there have been few significant security issues at the large cloud providers — that we know of, anyway. With luck, that will still be true at the end of the year, but it seems inevitable that things will go wrong at some point. A serious compromise of a cloud provider could challenge faith in cloud computing in general and adversely affect a number of Linux-related businesses.
The battle to control various aspects of the cloud-computing ecosystem will continue on both the commercial and community levels. The early prominence of Docker seems to have provoked an immune response at both levels, with the result that alternative runtime systems will likely see increasing adoption over the year. One sure prediction is that we'll still be tired of hearing about containers at the end of the year.
Blockchain hype may continue, but much of the world will have realized by the end of the year that a blockchain is just another useful data structure. It's easy to predict that there will be no end of weirdness around cryptocurrencies, but we're not silly enough to try to predict what that weirdness will look like.
Mainline (or near-mainline) kernels will become more widely deployed this year. It would appear that mobile-device chip vendors are trying to close the gap between the kernels they ship and the mainline, a most welcome development. If the kernel page-table isolation patches turn out to be as important as it seems they might, there is going to be a lot of painful backporting work to do — work that may well bring distributors around to the idea that shipping old, custom kernels isn't necessarily the best practice.
Alternatives to the Linux kernel will grow in prominence this year, with Google's Fuchsia perhaps leading the pack. The Linux community has become somewhat complacent in its dominance of much of the computing landscape, but that complacency should have been taken down a notch by the revelation that billions of Intel processors include an internal CPU running Minix. There are an increasing number of situations that call for a small, lightweight kernel, and the Linux community has not worked that hard to make the kernel fit into such settings. It is also natural for companies to want to have an alternative to fall back on just in case, and if that alternative is under their control and permissively licensed, so much the better. The dominance of Linux will not be seriously challenged in 2018, but there will be more nibbling around the edges.
Some longstanding projects will come into their own this year. The Wayland compositor has been successfully deployed in a few distributions for a while now, but the underlying applications still think they are talking to an X server. Native Wayland support should move into production in the desktop environments, and the long reign of the X Window System will approach its end. Meanwhile, Python 3 adoption appears to have reached the turning point; by the end of 2018, Python 2 applications will start to look seriously old. Perl 6 will have to wait a little longer, though.
People will still try to pick fights around systemd. Meanwhile, most of the rest of us will simply carry on using it and focus on more important things.
Finally, LWN will complete its 20th year of publication at the end of January — not bad for an enterprise that started as a side project for a would-be consulting and training company. It has been a long and strange trip indeed, but sometimes it still feels like we're just getting started. We'll be here through 2018 and beyond, keeping an eye on the Linux and free-software development communities. Thanks to all of you for supporting us for so long; best wishes for 2018 from the entire LWN crew.
Notes from the Intelpocalypse
Rumors of an undisclosed CPU security issue have been circulating since before LWN first covered the kernel page-table isolation patch set in November 2017. Now, finally, the information is out — and the problem is even worse than had been expected. Read on for a summary of these issues and what has to be done to respond to them in the kernel.All three disclosed vulnerabilities take advantage of the CPU's speculative execution mechanism. In a simple view, a CPU is a deterministic machine executing a set of instructions in sequence in a predictable manner. Real-world CPUs are more complex, and that complexity has opened the door to some unpleasant attacks.
A CPU is typically working on the execution of multiple instructions at once, for performance reasons. Executing instructions in parallel allows the processor to keep more of its subunits busy at once, which speeds things up. But parallel execution is also driven by the slowness of access to main memory. A cache miss requiring a fetch from RAM can stall the execution of an instruction for hundreds of processor cycles, with a clear impact on performance. To minimize the amount of time it spends waiting for data, the CPU will, to the extent it can, execute instructions after the stalled one, essentially reordering the code in the program. That reordering is often invisible, but it occasionally leads to the sort of fun that caused Documentation/memory-barriers.txt to be written.
Out-of-order execution runs into a challenge whenever the code branches, though. The processor may not yet be able to tell which branch will be taken, so it doesn't know where to go to execute ahead of the stalled instruction(s). The answer here is "branch prediction". The processor will make a guess based on past experience with the branch in question and, possibly, explicit guidance from the code (the unlikely() directive used in kernel code, for example). Once the actual branch condition can be evaluated, the processor will determine whether it guessed right. If not, the "speculatively" executed instructions after the branch will be unwound, and everything will proceed as if they had never been run.
A branch-prediction failure should really only lead to slower execution, with no visible side effects. That turns out to not be the case, though, leading to a set of severe information-disclosure vulnerabilities. In particular, speculative instruction execution can cause data to be loaded into the CPU memory cache; timing attacks can then be used to learn which instructions were executed. If speculative execution of kernel code can be controlled by an attacker, the contents of the cache can be used as a covert channel to get data out of the kernel.
Getting around boundary checks
Perhaps the nastiest of the vulnerabilities, in terms of the cost of defending against them, allows the circumvention of normal boundary checks in the kernel. Imagine kernel code that looks like this:
if (offset < array1->length) { unsigned char value = array1->data[offset]; unsigned long index = ((value&1)*0x100)+0x200; if (index < array2->length) // length is < 0x300 unsigned char value2 = array2->data[index]; }
If offset is greater than the length of array1, the reference into array1->data should never happen. But if array1->length is not cached, the processor will stall on the test. It may, while waiting, predict that offset is within bounds (since it almost always is) and execute forward far enough to at least begin the fetch of the value from array2. Once it's clear that offset is too large, all of that speculatively done work will be discarded.
Except that array2->data[index] will be present in the CPU cache. An exploit can fetch the data at both 0x200 and 0x300 and compare the timings. If one is far faster than the other, then the faster one was cached. That means that the inner branch was speculatively executed and that, in particular, the lowest bit of value was not set. That leaks one bit of kernel memory under attacker control; a more sophisticated approach could, of course, obtain more than a lowest-order bit.
If a code pattern like the above exists in the kernel and offset is under user-space control, this kind of attack can be used to leak arbitrary data from the kernel to a user-space attacker. It would seem that such patterns exist, and that they can be used to read out kernel data at a relatively high rate. It is also possible to create the needed pattern with a BPF program — some types of which can be loaded and run without privilege. The attack is tricky to carry out, requires careful preparation of the CPU cache, and is processor-dependent, but it can be done. Intel, AMD, and ARM processors are all vulnerable (in varying degrees) to this attack.
There is no straightforward defense to this attack, and nothing has been merged to date. The only known technique, it would seem, is to prevent speculative execution of code within branches when the branch condition is under an attacker's control. That requires putting in a barrier after every test that is potentially vulnerable. Some preliminary patches have been posted to add a new API for sensitive pointer references:
value = nospec_load(pointer, lower, upper);
This macro will return the value pointed to by pointer, but only if it falls within the given lower and upper bounds; otherwise zero is returned. There are a number of variants on this macro; see the documentation for the full set. This approach is problematic on a couple of counts: it hurts performance, and somebody has to find the vulnerable code patterns in the first place. Current vulnerabilities may be fixed, but there can be no doubt that new vulnerabilities of this type will be introduced on a regular basis.
Messing with indirect jumps
The kernel uses indirect jumps (calling a function through a pointer, for example) frequently. Branch prediction for indirect jumps uses cached results in a separate buffer that only keys on 31 bits of the address of interest. The resulting aliasing can be exploited to poison this cache and cause speculative execution to jump to the wrong location. Once again, the CPU will figure out that it got things wrong and unwind the results of the bad jump, but that speculative execution will leave traces in the memory cache. This issue can be exploited to cause the speculative execution of arbitrary code that will, once again, allow the exfiltration of data from the kernel.
One rather frightening aspect of this vulnerability is that an attacker running inside a virtualized guest can use it to leak data accessible to the hypervisor — all the data in the host system, in other words. That has all kinds of highly unpleasant implications for cloud providers. One can only hope that those providers have taken advantage of whatever early disclosure they got to update their systems.
There are two possible defenses in this case. One would be a microcode update from Intel that fixes the issue, for some processors at least. In the absence of this update, indirect calls must be replaced by a two-stage trampoline that will block further speculative execution. The performance cost of the trampoline will be notable, which is why Linus Torvalds has complained that the current patches seem to assume that the CPUs will never be fixed. There is a set of GCC patches forthcoming to add a flag (-mindirect-branch=thunk-extern) to automatically generate the trampolines in cases where that's necessary. As of this writing, no defenses have actually been merged into the mainline kernel.
Forcing direct cache loads
The final vulnerability runs entirely in user space, without the involvement of the kernel at all. Imagine a variant of the above code:
if (slow_condition) { unsigned char value = kernel_data[offset]; unsigned long index = ((value&1)*0x100)+0x200; if (index < length) unsigned char value2 = array[index]; }
Here, kernel_data is a kernel-space pointer that should be entirely inaccessible to a user-space program. The same speculative-execution issues, though, may cause the body of the outer if block (and possibly the inner block if the low bit of value is clear) to be executed on a speculative basis. By checking access timings, an attacker can determine the value of one bit of kernel_data[offset]. Of course, the attacker needs to find a useful kernel pointer in the first place, but a variant of this attack can be used to find the placement of the kernel in virtual memory.
The answer here is kernel page-table isolation, making the kernel-space data completely invisible to user space so that it cannot be used in speculative execution. This is the only one of the three issues that is addressed by page-table isolation; it alone imposes a performance cost of 5-30% or so. Intel and ARM processors seem to be vulnerable to this issue; AMD processors evidently are not.
The end result
What emerges is a picture of unintended processor functionality that can be exploited to leak arbitrary information from the kernel, and perhaps from other guests in a virtualized setting. If these vulnerabilities are already known to some attackers, they could have been using them to attack cloud providers for some time now. It seems fair to say that this is one of the most severe vulnerabilities to surface in some time.
The fact that it is based in hardware makes things significantly worse. We will all be paying the performance penalties associated with working around these problems for the indefinite future. For the owners of vast numbers of systems that cannot be updated, the consequences will be worse: they will remain vulnerable to a set of vulnerabilities with known exploits. This is not a happy time for the computing industry.
It is, to put it lightly, unlikely that this is the last vulnerability hiding within the processors at the heart of our systems. Like the Linux kernel, these processors are highly complex devices that are subject to constant change. And like the kernel, they probably have a number of unpleasant issues lurking within them. Given that, it's worthwhile to look at how these vulnerabilities were handled; there seems to be some unhappiness on that topic which might affect how future issues are disclosed. It's important to get this right, since we'll almost certainly be doing it again.
See also: the Meltdown and Spectre attacks page, which has a detailed and academic look at these vulnerabilities.
Future directions for PGP
Back in October, LWN reported on a talk about the state of the GNU Privacy Guard (GnuPG) project, an asymmetric public-key encryption and signing tool that had been almost abandoned by its lead developer due to lack of resources before receiving a significant infusion of funding and community attention. GnuPG 2 has brought about a number of changes and improvements but, at the same time, several efforts are underway to significantly change the way GnuPG and OpenPGP are used. This article will look at the current state of GnuPG and the OpenPGP web of trust, as compared to new implementations of the OpenPGP standard and other trust systems.
GnuPG produces encrypted files, signed messages, and other types of artifacts that comply to a common standard called OpenPGP, described in RFC 4880. OpenPGP is derived from the Pretty Good Privacy (PGP) commercial software project (since acquired by Symantec) and today is almost synonymous with the GnuPG implementation, but the possibility exists for independent implementations of the standard that interoperate with each other. Unfortunately, RFC 4880 was released in 2007 and a new standard has not been published since then. In the meantime, several extensions have been added to GnuPG without broader standardization, and a 2017 IETF working group formed to update RFC 4880 ultimately shut down due to lack of interest.
GnuPG 2 is a significantly heavier-weight software package than previous GnuPG versions. A major example of this change in architecture is GnuPG 2's complete reliance on the use of the separate gpg-agent daemon for private-key operations. While isolating private-key access within its own process enables improvements to security and functionality, it also adds complexity.
In the wake of the Heartbleed vulnerability in OpenSSL, a great deal of scrutiny has been directed toward the maintainability of complex and long-lived open-source projects. GnuPG does not rely on OpenSSL for its cryptographic implementation, instead it uses its own independent implementation: Libgcrypt. This leads to the question of whether GnuPG's cryptographic implementation is susceptible to the same kinds of problems that OpenSSL has had; indeed the concern may be larger in the case of GnuPG.
Despite the release of Libgcrypt as an independent library, it has not seen substantial use outside of GnuPG itself, preventing it benefiting from more thorough security review as OpenSSL has had. This concern is not purely theoretical, as multiple vulnerabilities in GnuPG and Libgcrypt were published in 2017, including side-channel leaks in the RSA implementation and ECC implementations that were each previously known issues in other software projects.
NetPGP
In response to concerns about the maintainability of GnuPG, two projects have been launched to create independent, interoperable implementations. The first is NetPGP, produced by the NetBSD project. NetPGP was first developed [PDF] by Alistair Crooks in 2009, and has since reached a fairly stable state. It is available under the BSD license and is intended for use both as a user-facing command-line tool (netpgp) and as a library (libnetpgp). Its main downside is a limited feature set compared to the larger GnuPG.
NetPGP is based on the OpenPGP SDK implementation by Ben Laurie and Rachel Willmer, which is in turn based on OpenSSL. NetPGP's command-line tools are significantly simplified compared to GnuPG's. For example, the netpgp manual page describes fewer than a dozen major options and only 24 in total, as compared to gpg with well over one hundred, some of which accept many subcommands. Major GnuPG features missing from netpgp include interacting with key servers, key signing and trust management, and subkey management. NetPGP's library interface is similarly simplified, with only one header file and about a dozen key function prototypes.
NetPGP's capabilities include the basic functions of a PGP implementation: encryption, decryption, signing, and verifying. It also includes a basic key-management tool, netpgpkeys, which operates on a keyring stored in the same format as used by GnuPG. This shared keyring format makes it easy to use NetPGP alongside GnuPG, such as when NetPGP is embedded in other tools.
NetPGP was developed primarily with an eye toward having a BSD-licensed PGP implementation for embedding in BSD tools, since GnuPG is licensed under the GPLv3. NetPGP's simple feature set and library interface reflect this purpose, particularly in the inclusion of the separate netpgpverify command that only performs signature verification and is perfect for use in shell scripts. NetPGP is useful but does not make up a full competitor to GnuPG.
NeoPG
To replicate the features of GnuPG, it's easiest to start from GnuPG, and that's exactly what the NeoPG project has recently done. Developer Marcus Brinkmann launched NeoPG based on a fork of the GnuPG codebase with the goals of improved maintainability and ease of use.
While NeoPG started with the GnuPG codebase, it is quickly diverging. Two primary goals of the project are a switch to C++ and a replacement of the GnuPG cryptography implementation with the Botan C++ encryption library. NeoPG further introduces a number of engineering measures to improve code quality, including unit testing, continuous integration, and the use of a fuzzer for stability testing.
NeoPG also varies from GnuPG in its architecture. NeoPG will continue to use a single-process design as GnuPG 1.x did, rather than depending on separate daemons for some functionality. This is not a simple or uncontroversial decision, as isolation of sensitive operations into separate processes does have security advantages. In a message to the oss-security mailing list, Brinkmann explained the decision:
NeoPG also aims to improve usability of both the command-line tools and the library interface. The library interface will be significantly simplified, while the command-line interface will use a Git-style subcommand structure. As for the underlying cryptography, Botan is a fairly popular C++ library which has been audited using funding from the German government and is under active development.
NeoPG is still in an early development stage and many intended features have not yet been implemented. Brinkmann has already removed over one hundred command-line options as part of the interface reorganization and implemented significant additional testing. NeoPG appears promising as an alternative OpenPGP implementation for both direct use and inclusion as a library.
Key distribution
The most difficult problem in encrypted communications is key distribution—establishing that the keys used to contact another person actually belong to that person. Historically, OpenPGP implementations have solved this problem using a model called the "web of trust" (WoT). In that model, users verify the keys of other individuals and then cryptographically sign those keys. Eventually, this forms a graph of keys verified by multiple users that can be used to transitively trust keys that cannot be verified in person. For example, if Alice needs to securely contact Charlie but cannot personally verify his key (for example, by meeting in person), a third user that Alice has personally verified may have verified Charlie, giving Alice some second-hand confidence that Charlie's key is correct.
While this works in theory, in practice there are many challenges. Verifying and becoming verified by enough people to become well-connected to the web of trust requires a significant investment of time and effort that discourages new users; those same new users often find the web of trust model, itself, and the GnuPG implementation details to be confusing and difficult to use.
One substitute for the web of trust which is increasingly adopted in the PGP ecosystem is "trust on first use" (TOFU). Implementations of this technique, of which the OpenSSH client is likely the best known, focus on detecting a man-in-the-middle attack based on the attacker having a different key pair than the legitimate destination of a message—once you have communicated with someone once, their key should never change. This is a powerful concept and is effective in preventing a man-in-the-middle attack once you have first communicated with someone. However, if you are contacting someone for the first time, you don't have any previous key to rely on and the communication is still subject to a man in the middle, which may require a stronger key validation method.
There have been many efforts to replace or augment the web of trust with a simpler and easier system that still reliably identifies a key's owner on the first use. These approaches are often based on authoritatively tying keys to other personal information like email addresses. For example, the PGP Corporation (now Symantec) operates a keyserver that verifies the email address associated with a submitted key and then signs the key. By trusting Symantec's verification key for this service, a GnuPG user can have some confidence that each signed key really belongs to the person with the indicated email address.
A more interesting approach to this problem comes from the commercial service Keybase, which produces an open-source tool to verify ownership of a key based on assertions published to web sites and social media services. For example, the keybase tool can automatically check a signed statement uploaded to GitHub's Gist pastebin service to prove that a PGP key belongs to the owner of the GitHub account. This has the major advantage that the actual verification process is decentralized: the Keybase tool validated these posted identity proofs independently without relying on or trusting the Keybase service.
Keybase is not a complete solution, as it suffers from the obvious problem that many users of powerful cryptographic software intentionally avoid using social networks. The concept of tying cryptographic keys to other types of online identity is a powerful one, though, and NeoPG aims to both integrate support for Keybase and implement a keyserver that provides email verification similar to that done by Symantec.
While the GnuPG project has gone through difficult times, an uptick in development effort on GnuPG itself, as well as development on competitors, is promising for the future of open-source public-key cryptography systems. There is still a long distance to cover, though, particularly in the space of user-friendly key distribution systems, which are currently dominated by centralized, commercial offerings.
Statistics for the 4.15 kernel
The 4.15 kernel is likely to require a relatively long development cycle as a result of the post-rc5 merge of the kernel page-table isolation patches. That said, it should be in something close to its final form, modulo some inevitable bug fixes. The development statistics for this kernel release look fairly normal, but they do reveal an unexpectedly busy cycle overall.This development cycle was supposed to be relatively calm after the anticipated rush to get work into the 4.14 long-term-support release. But, while 4.14 ended up with 13,452 non-merge changesets at release, 4.15-rc6 already has 14,226, making it one of the busiest releases in the kernel project's history. Only 4.9 (16,214 changesets) and 4.12 (14,570) brought in more work, and 4.15 may exceed 4.12 by the time it is finished. So far, 1,707 developers have contributed to this kernel; they added 725,000 lines of code while removing 407,000, for a net growth of 318,000 lines of code.
The most active developers this time around were:
Most active 4.15 developers
By changesets Kees Cook 349 2.5% Colin Ian King 237 1.7% Harry Wentland 170 1.2% Ben Skeggs 156 1.1% Gustavo A. R. Silva 138 1.0% Christoph Hellwig 137 1.0% Geert Uytterhoeven 136 1.0% Arnd Bergmann 134 0.9% Chris Wilson 129 0.9% Dmytro Laktyushkin 125 0.9% Allen Pais 112 0.8% Masahiro Yamada 108 0.8% Thomas Gleixner 105 0.7% Dave Airlie 103 0.7% Eric Dumazet 99 0.7% Ville Syrjälä 97 0.7% Arvind Yadav 95 0.7% Jakub Kicinski 94 0.7% Markus Elfring 92 0.6% Mauro Carvalho Chehab 89 0.6%
By changed lines Harry Wentland 152262 16.8% Dave Airlie 47651 5.2% Takashi Iwai 41943 4.6% Dmytro Laktyushkin 28306 3.1% Rex Zhu 24008 2.6% Andy Shevchenko 18204 2.0% Paul E. McKenney 14629 1.6% Ben Skeggs 12684 1.4% Palmer Dabbelt 10433 1.1% David Howells 10210 1.1% Darrick J. Wong 8792 1.0% Yue Hin Lau 8483 0.9% Greg Kroah-Hartman 8298 0.9% Kees Cook 7091 0.8% Christoph Hellwig 7076 0.8% Linus Walleij 6757 0.7% Jakub Kicinski 6402 0.7% Wei Hu 5967 0.7% Mauro Carvalho Chehab 5692 0.6% Alex Deucher 5406 0.6%
Kees Cook was this cycle's most prolific contributor of changesets; he did security-related work throughout the kernel, but the bulk of the patches implemented the internal kernel-timer API change. Colin Ian King contributed cleanup patches all over the kernel, Harry Wentland added another massive pile of AMD graphics driver code, Ben Skeggs worked on the Nouveau driver as usual, and Gustavo Silva focused on marking fall-through cases in switch statements (as in this patch).
In the lines-changed column, Wentland's AMD graphics driver additions topped the list. Dave Airlie brought the AMD display core code into the graphics subsystem, but also did a bunch of cleanup work resulting in the removal of over 21,000 lines of code. Takashi Iwai worked all over the audio subsystem; in particular, he removed the ancient Open Sound System code, shrinking the kernel by over 40,000 lines. Dmytro Laktyushkin and Rex Zhu also added more AMD graphics code. The AMD graphics drivers thus dominated the changes in this cycle in terms of lines of code, as has been the case for a number of recent development cycles.
It is worth noting once again that staging-tree work hardly figures in these numbers at all; the days when staging was the biggest driver of kernel changes appear to be done. The page-table isolation work also doesn't show up much here either, showing that important changes often come in relatively small packages.
The work in 4.15 was supported by 231 companies (that we can identify), more than worked on 4.14 but still a relatively small number by recent standards; 4.10 remains the record holder with 271 companies participating. The most active companies this time around were:
Most active 4.15 employers
By changesets Intel 1609 11.3% AMD 1526 10.7% Red Hat 955 6.7% (None) 813 5.7% 739 5.2% (Unknown) 703 4.9% Linaro 489 3.4% IBM 450 3.2% Oracle 390 2.7% Renesas Electronics 343 2.4% Mellanox 340 2.4% Linux Foundation 307 2.2% ARM 306 2.2% SUSE 294 2.1% Broadcom 260 1.8% Huawei Technologies 257 1.8% Canonical 254 1.8% (Consultant) 251 1.8% Samsung 221 1.6% Netronome Systems 157 1.1%
By lines changed AMD 266230 29.3% Red Hat 97177 10.7% Intel 82791 9.1% SUSE 46479 5.1% (Unknown) 33739 3.7% IBM 33105 3.6% (None) 24842 2.7% Linaro 23291 2.6% 17760 2.0% Broadcom 15482 1.7% Mellanox 14923 1.6% Samsung 13841 1.5% Oracle 13755 1.5% Huawei Technologies 13655 1.5% ARM 13118 1.4% Renesas Electronics 10762 1.2% Netronome Systems 10366 1.1% Linux Foundation 9855 1.1% ST Microelectronics 8803 1.0% Chelsio 8695 1.0%
The AMD graphics work shows clearly in these numbers; otherwise, the results are typical for recent development cycles.
The Signed-off-by tags attached to patches give clues as to who took responsibility for their development. In particular, if one looks at the signoffs attached by developers other than the author of the patch, the result is a picture of who accepted the patches for merging into the mainline — the most active maintainers, in other words. For the 4.15 kernel, the results look like this:
Non-author signoffs in 4.15
By developer David S. Miller 1942 14.1% Alex Deucher 1551 11.3% Greg Kroah-Hartman 749 5.5% Ingo Molnar 397 2.9% Mark Brown 329 2.4% Doug Ledford 300 2.2% Mauro Carvalho Chehab 287 2.1% Andrew Morton 271 2.0% Jens Axboe 240 1.7% Martin K. Petersen 226 1.6% Thomas Gleixner 218 1.6% Simon Horman 177 1.3% Herbert Xu 174 1.3% Jeff Kirsher 156 1.1% Kalle Valo 152 1.1% Michael Ellerman 151 1.1% Jiri Pirko 126 0.9% David Sterba 114 0.8% Martin Schwidefsky 113 0.8% Linus Walleij 110 0.8%
By company Red Hat 3334 24.3% AMD 1681 12.2% Intel 1088 7.9% Linaro 904 6.6% Linux Foundation 769 5.6% 479 3.5% Samsung 440 3.2% Oracle 395 2.9% IBM 372 2.7% 334 2.4% Huawei Technologies 328 2.4% (None) 320 2.3% Mellanox 283 2.1% SUSE 270 2.0% Renesas Electronics 219 1.6% Free Electrons 218 1.6% Linutronix 218 1.6% Code Aurora Forum 214 1.6% (Consultant) 181 1.3% ARM 176 1.3%
Kernel subsystem maintainers have long been concentrated in a relatively small set of companies. That situation is slowly changing, but it's still true that, in 4.15, half of the changes merged were accepted by developers working for just four companies.
Finally, the most active bug reporters and patch testers, according to the Reported-by and Tested-by tags attached to patches, were:
Bug reporters and testers in 4.15
Reported-by credits kernel test robot 36 5.5% Dan Carpenter 25 3.8% syzbot 25 3.8% Dmitry Vyukov 12 1.8% Andrey Konovalov 11 1.7% Geert Uytterhoeven 9 1.4% Arnd Bergmann 7 1.1% Michael Ellerman 7 1.1% Randy Dunlap 7 1.1% Brian Foster 7 1.1% Stephen Rothwell 7 1.1% Jianlin Shi 7 1.1% Jakub Kicinski 6 0.9%
Tested-by credits Andrew Bowers 114 12.7% Juergen Gross 52 5.8% Yu Chen 51 5.7% Krishneil Singh 22 2.4% Borislav Petkov 20 2.2% Oleksandr Natalenko 16 1.8% Arnaldo Carvalho de Melo 15 1.7% Aaron Brown 13 1.4% Sean Wang 12 1.3% Chris Brandt 12 1.3% Xin Long 11 1.2% Geert Uytterhoeven 9 1.0% Lee Tibbert 9 1.0%
A relatively new entry here is "syzbot", which is an operation run by Dmitry Vyukov at Google. Syzbot runs the syzkaller fuzz tester in an automated mode and reports the (numerous) crashes that result. As can be seen in the tags, those reports are leading to a steady stream of bug fixes, which can only be a good thing.
The story told by that final table is incomplete, though, in that most bug reporting and (especially) most testing goes untracked. The kernel community counts on many people beyond those who directly contribute code; it will never be possible to credit them all. As a whole, this community remains large, active, and growing, and the first kernel to be released in 2018 will reflect that.
An introduction to the BPF Compiler Collection
In the previous article of this series, I discussed how to use eBPF to safely run code supplied by user space inside of the kernel. Yet one of eBPF's biggest challenges for newcomers is that writing programs requires compiling and linking to the eBPF library from the kernel source. Kernel developers might always have a copy of the kernel source within reach, but that's not so for engineers working on production or customer machines. Addressing this limitation is one of the reasons that the BPF Compiler Collection was created. The project consists of a toolchain for writing, compiling, and loading eBPF programs, along with example programs and battle-hardened tools for debugging and diagnosing performance issues.
Since its release in April 2015, many developers have worked on BCC, and the 113 contributors have produced an impressive collection of over 100 examples and ready-to-use tracing tools. For example, scripts that use User Statically-Defined Tracing (USDT) probes (a mechanism from DTrace to place tracepoints in user-space code) are provided for tracing garbage collection events, method calls and system calls, and thread creation and destruction in high-level languages. Many popular applications, particularly databases, also have USDT probes that can be enabled with configuration switches like --enable-dtrace. These probes are inserted into user applications, as the name implies, statically at compile-time. I'll be dedicating an entire LWN article to covering USDT probes in the near future.
The project documentation shows how to use the existing scripts and tools to conduct a thorough performance investigation without writing a line of code, and a handy tutorial is provided in the BCC repository. Another useful guide to some of the BCC tools was written by Brendan Gregg, who has the second highest number of patches to bcc/tools (Sasha Goldshtein holds the number one spot as of this writing).
Front-ends for the Python and Lua programming languages are available in BCC. Using these high-level languages, it's possible to write short but expressive programs with all the data-manipulation advantages that are missing with C. For example, developers can treat eBPF maps as Python dictionaries and access map contents directly, which is implemented internally by using the BPF helper functions. This helps to lower the bar for would-be developers using eBPF because they can use the standard patterns that they're used to for processing data.
BCC invokes the LLVM Clang compiler, which has a BPF back end, to translate C into eBPF bytecode. BCC then takes care of loading the eBPF bytecode into the kernel with the bpf() system call. If loading fails, for example if the in-kernel verifier checks fail, then BCC provides hints as to why loading failed, e.g. "HINT: The 'map_value_or_null' error can happen if you dereference a pointer value from a map lookup without first checking if that pointer is NULL." This is another motivation for creating BCC — it's difficult to write obviously correct BPF programs; BCC tells you when you've made a mistake.
A really quick "Hello, World!" example
To demonstrate how quickly you can start working with BCC, here's the "Hello, World!" program example from the BCC repository. It prints into the trace buffer every time the clone() system call runs. I've reformatted it slightly to make it easier to read.
#!/usr/bin/env python from bcc import BPF program=''' int kprobe__sys_clone(void *ctx) { bpf_trace_printk("Hello, World!\n"); return 0; } '''
The entire eBPF program is contained in the program variable; this is the code that runs inside the kernel on the eBPF virtual machine. The format of the function name, "kprobe__sys_clone()", is important: the kprobe__ prefix directs the BCC toolchain to attach a kprobe to the kernel symbol that follows it. In this case, that's sys_clone(). When sys_clone() is called and this kprobe fires, the eBPF program runs and bpf_trace_printk() prints "Hello, World!" into the kernel's trace buffer.
The remainder of the Python program causes the eBPF code to be loaded into the kernel and run:
b = BPF(text=program) b.trace_print()
The previously cumbersome task of compiling the program to eBPF bytecode and loading it into the kernel is handled entirely by instantiating a new BPF object; all the low-level work is done behind the scenes, inside the Python bindings and BCC's libbpf.
BPF.trace_print() performs a blocking read on the kernel's trace buffer file (/sys/kernel/debug/tracing/trace_pipe) and prints the contents to the standard output. Here's the output:
gnome-terminal--3210 [003] d..2 19252.369014: 0x00000001: Hello, World! gnome-terminal--3210 [003] d..2 19252.369080: 0x00000001: Hello, World! pool-21543 [001] d..2 19252.382317: 0x00000001: Hello, World! bash-21545 [002] d..2 19252.385535: 0x00000001: Hello, World! bash-21546 [003] d..2 19252.385752: 0x00000001: Hello, World! bash-21545 [002] d..2 19252.386883: 0x00000001: Hello, World!
The output shows:
- The name of the application running when the kprobe fired
- Its PID
- The CPU it was running on (in [brackets])
- Various process context flags
- A timestamp
The final field is our "Hello, World!" string that we passed to bpf_trace_printk(). The penultimate field contains the address 0x00000001. Normally, when kernel code writes to the trace buffer, the instruction pointer address following the call to trace_printk() is printed in that field. Unfortunately, this isn't implemented for bpf_trace_printk(), so the hard-coded address 0x00000001 is always used.
More examples
argdist.py inserts a probe (uprobe, kprobe, tracepoint, or USDT) into to a given function, which can be in the kernel or in user-space code. When the probe fires, argdist.py prints the function's parameter values, either as a count or histogram. It runs until interrupted by the user. For example, the following command prints the number of times irq_handler_entry() is called, along with which interrupt was raised:
$ tools/argdist.py -C 't:irq:irq_handler_entry():int:args->irq' [14:14:24] t:irq:irq_handler_entry():int:args->irq COUNT EVENT 12 args->irq = 45 16 args->irq = 53 52 args->irq = 48 [14:14:25] t:irq:irq_handler_entry():int:args->irq COUNT EVENT 1 args->irq = 49 5 args->irq = 53 24 args->irq = 45
Because the histogram option (-H) uses buckets to group multiple interrupts together, it's less useful than the count option (-C) in this case. One scenario where histogram output is helpful, however, is for the btrfsdist.py tool, which summarizes the latency of Btrfs reads, writes, opens, and fsync operations into power-of-two buckets:
$ tools/btrfsdist.py Tracing btrfs operation latency... Hit Ctrl-C to end. ^C operation = 'read' usecs : count distribution 0 -> 1 : 775 |****************************************| 2 -> 3 : 60 |*** | 4 -> 7 : 20 |* | 8 -> 15 : 3 | | 16 -> 31 : 3 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 1 | | 256 -> 511 : 19 | | 512 -> 1023 : 12 | | operation = 'write' usecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 2 |********** | 4 -> 7 : 8 |****************************************| 8 -> 15 : 1 |***** | 16 -> 31 : 4 |******************** | 32 -> 63 : 4 |******************** | operation = 'open' usecs : count distribution 0 -> 1 : 636 |****************************************| 2 -> 3 : 22 |* | 4 -> 7 : 16 |* | 8 -> 15 : 2 | | 16 -> 31 : 1 | | operation = 'fsync' usecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 0 | | 1024 -> 2047 : 0 | | 2048 -> 4095 : 0 | | 4096 -> 8191 : 1 |****************************************|
There's more to come
That was just a quick introduction to BCC. In the next one, we'll explore some of the more complicated topics, like how to access eBPF data structures, how to configure the way your eBPF program is compiled, and how to debug your programs, all using the Python front end.
A Modularity rethink for Fedora
We have covered the Fedora Modularity initiative a time or two over the years but, just as the modular "product" started rolling out, Fedora went back to the drawing board. There were a number of fundamental problems with Modularity as it was to be delivered in the Fedora 27 server edition, so a classic version of the distribution was released instead. But Modularity is far from dead; there is a new plan afoot to deliver it for Fedora 28, which is due in May.
The problem that Modularity seeks to solve is that different users of the distribution have differing needs for stability versus tracking the bleeding edge. The pain is most often felt in the fast-moving web development world, where frameworks and applications move far more quickly than Fedora as a whole can—even if it could, moving that quickly would be problematic for other types of users. So Modularity was meant to be a way for Fedora users to pick and choose which "modules" (a cohesive set of packages supporting a particular version of, say, Node.js, Django, a web server, or a database management system) are included in their tailored instance of Fedora. The Tumbleweed snapshots feature of the openSUSE rolling distribution is targeted at solving much the same problem.
Modularity would also facilitate installing multiple different versions of modules so that different applications could each use the versions of the web framework, database, and web server that the application supports. It is, in some ways, an attempt to give users the best of both worlds: the stability of a Fedora release with the availability of modules of older and newer packages, some of which would be supported beyond the typical 13-month lifecycle of a Fedora release. The trick is in how to get there.
The main problem that arose with the modular server edition was, in effect, a lack of modules. It turned out to be far more painful for packagers to build modules than expected, so few did. That left it up to the Modularity team to build the modules that would ship with Fedora 27. As Stephen Gallagher, who has been one of the driving forces behind the initiative, put it:
In addition, the first mechanism chosen to build modules relied on a "bootstrap" module that, among other things, made it difficult for existing Fedora users to upgrade into a modular server release. Third-party software was also problematic in this first approach, since it would need to be built into a module—something that was difficult for anyone but the Modularity team to accomplish.
New approach
The original plan was to define a build environment (buildroot) specifically for the modular server, but that seems to have caused more problems than it solved. The new plan is to use the "everything" repository for the Fedora 28 release as the underlying "platform module", which makes things more straightforward. Importantly, it makes things easier for module packagers:
That change will also make it easy for users to simply upgrade into a modular release. Modules and traditional packages can coexist on a system as well. So far, the plan has been for only the server edition to support modules, but with an easier upgrade path and the ability to support both packages and modules, the idea could be adopted by other editions (e.g. workstation) of Fedora.
In fact, the module-creation process will become so straightforward that
automated tools will be provided to create the configuration to
build single-source-package modules. "Even for more complex
multi-package modules, the automatically-created module definitions provide
an easy and obvious starting point.
" This will make it easy to
support multiple versions, as Gallagher notes:
For future Fedora releases, there will be two sets of repositories to support both the traditional RPM-based distribution and the modular approach. Those who have no interest in modules can disable the modular repositories and continue on as they always have. For others who are looking for the modular approach, though, it will be as easy as simply using the DNF package manager with some new target-specification syntax to pick up modules.
It is not surprising that a change of this nature might run into some
turbulence as it gets integrated into a well-established distribution
packaging ecosystem like Fedora's. It is a pretty fundamental change to
the distribution, so problems are to be expected during the upheaval. As
Fedora project leader Matthew Miller put it in the announcement of the
rethinking: "Sometimes experiments produce negative results. That's
okay — the project learns even when trying a path that doesn't work out,
and it iterates to something better.
" For his part, Gallagher
expressed optimism that the Modularity project is now on better track:
As the number of available modules grows, users of Fedora will have a much easier access to the exact version of software they want to accomplish their tasks. People doing rapid-prototyping can more easily access newer versions of packages and at the same time people running older applications can continue to access the older streams that they need.
As Miller pointed out, progress is not made without some missteps along the way. It remains to be seen if Modularity represents progress, but the problem it addresses is certainly real—the approach Fedora is taking seemingly has the potential to solve it. One the major benefits of development in the open is that these kinds of missteps are not hidden behind delayed releases, vaporware, press releases, and other obfuscation techniques as they often are in the proprietary software world. In the free-software world, we get to see the sausage being made (and remade), so projects can learn from each other. With luck, that makes for better software throughout our ecosystem.
Varlink: a protocol for IPC
One of the motivations behind projects like kdbus and bus1, both of which have fallen short of mainline inclusion, is to have an interprocess communication (IPC) mechanism available early in the boot process. The D-Bus IPC mechanism has a daemon that cannot be started until filesystems are mounted and the like, but what if the early boot process wants to perform IPC? A new project, varlink, was recently announced; it aims to provide IPC from early boot onward, though it does not really address the longtime D-Bus performance complaints that also served as motivation for kdbus and bus1.
The announcement came from Harald Hoyer, but he credited Kay Sievers and Lars Karlitski with much of the work. At its core, varlink is simply a JSON-based protocol that can be used to exchange messages over any connection-oriented transport. No kernel "special sauce" (such as kdbus or bus1) is needed to support it as TCP or Unix-domain sockets will provide the necessary functionality. The messages can be used as a kind of remote procedure call (RPC) using an API defined in an interface file.
One of the foundations of varlink is simplicity. As outlined on
the "ideals"
page, the protocol is "not specifically optimized for anything
else but ease-of-use and maintainability
". To that end, interface
definitions are text files, readable by both machines and humans, that
describe the services a
varlink endpoint will provide. The interface files are meant to be
self-documenting and can be retrieved using the
GetInterfaceDescription() method of the varlink service interface
(org.varlink.service). As Hoyer describes, they are human-readable so
that the interfaces can be discussed widely:
Hoyer shows a simple example that gets information from the /etc/passwd file:
interface com.redhat.system.accounts type Account ( name: string, uid: int, gid: int, full_name: string, home: string, shell: string ) method GetAccounts() -> (accounts: Account[]) method GetAccountByUid(uid: int) -> (account: Account) method GetAccountByName(name: string) -> (account: Account) method AddAccount(account: Account) -> (account: Account) error AccountNotFound () error AccountCreationFailed (field: string)
All it takes is four lines of Python to retrieve and print the information for the "root" user (for example). There is also a varlink command-line tool (written in C) that can be used to make varlink calls. Bindings for other languages (C, JavaScript, Go, Java, and Rust) are also available, though some are just a proof of concept at this point.
As described so far, there is still a missing piece. Some service must provide a way to resolve names like "com.redhat.system.accounts" to a Uniform Resource Identifier (URI) corresponding to the running service. If the service is known, but is not running, something needs to start it. Both of those tasks can be handled by the varlink resolver.
Unlike other protocols, such as D-Bus, varlink makes no provision for sending
things like file descriptors. It is simply for sending simple data types
(numbers, strings, arrays, etc.) That means the messages can be transparently
proxied or redirected elsewhere for servicing. As the ideals statement
notes: "Varlink should be free of any side-effects of local APIs. All
interactions need to be simple messages on a network, not carrying things
like file descriptors or references to locally stored files.
"
Varlink is available in a GitHub repository. It is available under the Apache 2.0 license.
As part of the announcement, Hoyer makes a sweeping claim about the current API to a Linux system: it could all be replaced with varlink-based interfaces. In that statement, he includes kernel interfaces, such as ioctl() and other system calls, procfs, and sysfs; the Linux command-line interface; and various IPC mechanisms including D-Bus and Protobuf. There is a kernel module that allows varlink interfaces to be added to the kernel, but it is a little hard to see the kernel API being replaced, even if it was deemed desirable. It would be decades (if not longer) before the existing kernel interfaces could be removed, which would make for a maintenance headache at minimum.
Hoyer does wryly note the classic xkcd
standards proliferation comic: "Of course varlink is the 15th
xkcd standard here
".
As nice as it might be to have a single, standard interface mechanism
throughout the Linux system, that's not a likely outcome. However,
varlink does seem like it may have its uses. One would guess that, rather
than have each early boot daemon have "fallback IPC via unix domain
sockets with its own homegrown
protocol
", it may make sense for (some) distributions to move to varlink.
Given that the developers are from Red Hat, Fedora would seem like a
plausible starting place.
Varlink is a fairly simple way to gather needed information or request that certain services be performed, though it doesn't provide the kinds of guarantees that D-Bus is supposed to require—or the increased performance that folks have been clamoring for. The amount of churn throughout the Linux ecosystem to support it "everywhere" would be enormous and the benefits to doing so are not obvious. As they say, however, the future is unwritten.
Page editor: Jonathan Corbet
Inside this week's LWN.net Weekly Edition
- Briefs: Major CPU security holes; KPTI merged; OpenWrt/LEDE; Linux Journal returns; Quotes ...
- Announcements: Newsletters; events; security updates; kernel patches; ...