Leading items
Welcome to the LWN.net Weekly Edition for November 29, 2018
This edition contains the following feature content:
- event-stream, npm, and trust: a malware attack highlights problems in our maintainership and distribution models.
- Filesystems and case-insensitivity: case-insensitive filesystems may yet be supported by Linux.
- Bringing the Android kernel back to the mainline: much work has been done to address the "Android problem".
- A panel discussion on the kernel's code of conduct: the first public event where kernel developers could discuss their new code of conduct.
- The kernel developer panel at LPC: a unique panel of kernel developers discusses a wide range of topics.
- Toward a kernel maintainer's guide: documenting the process quirks found in many kernel subsystems.
- Updates on the KernelCI project: improving the testing of the kernel on a wide range of hardware.
This week's edition also includes these inner pages:
- Brief items: Brief news items from throughout the community.
- Announcements: Newsletters, conferences, security updates, patches, and more.
Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.
event-stream, npm, and trust
Malware inserted into a popular npm package has put some users at risk of losing Bitcoin, which is certainly worrisome. More concerning, though, is the implications of how the malware got into the package—and how the package got distributed. This is not the first time we have seen package-distribution channels exploited, nor will it be the last, but the underlying problem requires more than a technical solution. It is, fundamentally, a social problem: trust.
Npm is a registry of JavaScript packages, most of which target the Node.js event-driven JavaScript framework. As with many package repositories, npm helps manage dependencies so that picking up a new version of a package will also pick up new versions of its dependencies. Unlike, say, distribution package repositories, however, npm is not curated—anyone can put a module into npm. Normally, a module that wasn't useful would not become popular and would not get included as a dependency of other npm modules. But once a module is popular, it provides a ready path to deliver malware if the maintainer, or someone they delegate to, wants to go that route.
That is just what happened with the event-stream package, as was recently discovered. The package allows creating streams that can be used both for I/O and for event handling. Its maintainer, Dominic Tarr, had stopped using the package some time ago, so his interest in maintaining it was low. As he noted in a comment on the bug report filed in the event-stream GitHub repository, someone volunteered to take it over:
As detailed in a blog post by Zach Schneider, who plucked various pieces out of the voluminous GitHub bug report thread, the attack that was inserted by the new maintainer, "right9ctrl", was clever. The commit log of changes right9ctrl made to event-stream was fairly innocuous; even the commit that added the malware was simply adding a new dependency on another npm module: flatmap-stream.
Had anyone looked, flatmap-stream might have seemed a bit of an odd dependency: it had one contributor and no downloads prior to its inclusion. Its contents might seem reasonable at first glance, but there is a tangled chain of malware contained there.
The flatmap-stream npm package had an extra file added into it that was not in the GitHub repository. It also had "minified" code that read the AES256-encrypted data stored in the file using the parent package's npm_package_description as the key. For all except one npm package, that decryption would fail (and be ignored) but, for the victim package, it resulted in JavaScript code that would be executed. That code does a decryption of a different chunk of the "extra" file that results in the payload code, which, naturally, gets executed.
As determined by brute-forcing the key from a list of all the npm package
descriptions, the victim package was copay-dash, which is a
"secure bitcoin wallet platform
" from a company called Copay. Given the presence of the word
"bitcoin", one can probably guess what the malware ultimately targeted. It
would send account information to the attacker, who would, presumably, use
it to abscond with the Bitcoin.
The dependency on flatmap-stream only lasted a little over ten days before it was replaced with a non-malware implementation of a "flat map" in event-stream itself. The npm blog post about the incident says that it was the Copay build process that was being subverted:
Copay's initial response was that that no builds containing this malicious code were released to the public, but we now have confirmation from Copay that "the malicious code was deployed on versions 5.0.2 through 5.1.0."
As Schneider noted, the JavaScript-development community is particularly vulnerable to this kind of problem:
He goes on to note JavaScript applications tend to be fast moving:
"its users install a lot of packages and updates, and are thus
vulnerable to malicious updates
". On the other hand,
problems can also occur from not updating frequently enough, he said,
pointing to the Equifax breach. He suggested two ways to avoid this kind of
thing in the future: locking the version number of dependencies to "known
good" versions and paying attention to the dependencies a project is
adding.
We have seen other related mayhem in the npm world before. Back in 2016, a developer deleting a simple left-pad npm module "broke the internet" because so much of the rest of the npm ecosystem relied on it to pad strings.
But the problem is not at all restricted to npm or JavaScript. Other languages have similar problems with their non-curated package repositories. Typosquatting is a related problem that has occurred with some frequency as well. Beyond that, it is not even just a problem for languages; as Dirk Hohndel pointed out in a talk back in May, today's containers are built up from many constituent parts gathered from all over the internet. Most of the container creators have no idea what is actually in them, what versions of code are being used, and so on. Docker and similar technologies are also part of the "move fast" school of development.
Certainly there have been some failures even in curated repositories—humans are not infallible. But curation and "move fast" tend not to play all that well together, which is why there is always such tension between the language-specific installation methods (e.g. npm, pip) and a distribution's package-management system. Users often just want the latest and greatest; they are not willing to wait for a distribution to get around to packaging it. That may be reasonable for a personal desktop or laptop—there are obvious risks (e.g. Bitcoin wallets) but they may be considered manageable—but the public release or deployment of a web application or component seems like it warrants a higher level of scrutiny.
Beyond more scrutiny, which is surely something that development teams should be doing no matter whether it slows things down, package maintenance is an area that clearly needs to be addressed. Tarr created a package that was useful to some, but apparently got no help in maintaining it. Once he had shared it, the left-pad fiasco shows there is no real way to "unshare" it, but he lost interest in maintaining it. In his statement about the event-stream malware, Tarr noted that the problem is widespread:
He continued by noting that sharing commit and publish rights was a
longstanding npm-community practice. "Open source is driven by
sharing! It's great! it worked really well before bitcoin got
popular.
" He suggested that people should either be paying maintainers
of the packages they use or step up to help maintain packages they depend on.
Once again, this is not in any way an "npm problem". The explosion of availability of open-source software has not really been met with a concomitant increase in the number of maintainers. There are, it seems, a lot of companies and others that are using open source without truly considering what that means. Even large projects like the Linux kernel suffer from a dearth of maintainers in some areas and events like Heartbleed exposed the maintenance problem for critical internet infrastructure like OpenSSL. Heartbleed led to the founding of the Core Infrastructure Initiative, but it is hard to see that kind of effort being extended down to the "leaves"—fixing it really requires users to step up.
Filesystems and case-insensitivity
A recurring topic in filesystem-developer circles is on handling case-insensitive file names. Filesystems for other operating systems do so but, by and large, Linux filesystems do not. In the Kernel Summit track of the 2018 Linux Plumbers Conference (LPC), Gabriel Krisman Bertazi described his plans for making Linux filesystems encoding-aware as part of an effort to make ext4, and possibly other filesystems, interoperable with case-insensitivity in Android, Windows, and macOS.
Case-insensitive file names for Linux have been discussed for a long time. The oldest reference he could find was from 2002, but it has come up at several Linux Storage, Filesystem, and Memory-Management Summits (LSFMM), including in 2016 and in Krisman's presentation this year. It has languished so long without a real solution because the problem has many corner cases and it is "tricky to get it right".
An attendee asked about XFS and its handling of case-insensitive file names. Krisman said that when an XFS filesystem is created, it can be configured to handle them. It is ASCII-only, though a proposal from SGI in 2014 would have added full UTF-8 support for XFS and extended the case-handling to Unicode file names.
The traditional Unix approach is that file names are opaque byte sequences that cannot contain "/" characters. He is proposing to add encoding awareness to filesystems, but, he asked, what are the advantages of doing so? For one thing, Windows and macOS have encoding-aware filesystems; it is a feature that Linux lacks. There are "real world use cases" as well: porting from the Windows world, dealing with the case-insensitive tree that Android exposes, and, in general, providing better support for exported filesystems. Android has a user-space hack for case handling, but it is slow and has many race conditions. An encoding-aware filesystem is a better way to expose this functionality to users, he said.
Unicode can represent the "same" string in multiple different ways, via composition for example, but that is confusing. Multiple files with the same-appearing name in a directory, as he showed in his slides [PDF], will be difficult to deal with. That means some kind of normalization will need to be applied. Beyond that, "case" is really only defined in terms of an encoding—it is meaningless for a byte sequence. That is why he implemented encoding awareness before tackling case insensitivity.
The kernel has a Native Language Support (NLS) subsystem but it has multiple limitations. It has trouble dealing with invalid character sequences—in some situations it returns zero, in others something else. It can't deal with multi-byte sequences or code points; for example, to_upper() and to_lower() return a single byte. There is no support for dealing with the evolution of encodings, which is not really a problem for UTF-8 except for unmapped code points—case folding for unmapped points is not stable, he said. In addition, NLS is missing support for normalization and has only partly implemented case folding; the latter is "almost ASCII only".
Start with NLS
So he has been proposing improvements to NLS as part of his encoding and case-insensitive support patch set that has been posted to the ext4 mailing list. It provides a new load_nls_version() function that allows the caller to define the encoding and version that it wants to use. It has a flags argument that allows filesystems to specify the normalization type, case-fold type, and permissiveness mode they want. That version and behavior information would be stored in the superblock of the filesystem.
Krisman's changes would add support for multi-byte characters by adding a new API for comparisons, normalization, and case folding. It will support UTF-8 NFKD normalization that is based on code from the 2014 SGI patch set. It uses a decoding trie and the mechanism is extendable to other normalization types. For example, if support for the Apple filesystem was needed, NFD normalization could be added. The changes he is making are all backward compatible with existing NLS tables and users, Krisman said.
He currently has patches for the kernel, e2fsprogs, and xfstests out for review. This effort is quite different from what he presented at LSFMM back in May.
There was some discussion among attendees about the changes. The original file name will be preserved when it is created, Krisman said, so that makes the filesystem "case preserving" like NTFS. Concern was expressed about containers sharing a filesystem with encoded file names, but having different user-space encodings. That is not a use case that is envisioned, he said; root filesystems will not normally be encoding aware. The most common use cases, Ted Ts'o said, are USB sticks with a FAT filesystem that does case folding or users of other operating systems accessing the filesystem through Samba. A storage appliance will be able to create a case-folding filesystem and Samba can turn off its expensive user-space case-handling solution.
Another use case that Krisman brought up was for SteamOS, which would have a separate partition for game data that would be encoding aware. Ts'o said that there are some inherent assumptions in this work. The primary users will be like the SteamOS or Samba appliance examples and that "all the world is UTF-8". It would be hugely complicated to support different directories with different encodings, he said. He invited those present to point out any problems they see with those assumptions.
James Bottomley asked if the user-space side had been consulted on these choices. He noted that European distributions typically use single-byte encodings and that the Chinese hate UTF-8 because all characters become four bytes in size. Ts'o said that the problem is essentially being handed off to the distributions. POSIX does not have a way for filesystems to communicate the encoding of their file names; if that existed, glibc could handle the differences.
There is no good solution for that problem, Ts'o continued. There will be information in the superblock, which should be exposed via statfs(). That will take some time to happen, so perhaps a sysfs field could be used in the interim.
Krisman said that his implementation tries to make good use of the directory entry (dentry) cache. Equivalent names do not create multiple dentries, there is just one per file. The d_hash() and d_compare() routines needed to be made encoding aware. For now, negative dentries (asserting the absence of a given file name) are not cached; it will require some work to carefully invalidate negative dentries during file creation.
On to case-insensitivity
Supporting case-insensitive file names requires the encoding-awareness changes in order to define what case folding means for a given character. A per-directory inode attribute can be set to turn on case-insensitivity, but that is only allowed on empty directories to avoid name collisions. Case-insensitivity is trivial to implement once the encoding support is available, he said; it is effectively just a special case of encoding.
There are some limitations of the current implementation, starting with the lack of negative dentries in the cache. Directory encryption is not supported since the lookup is based on the hash of the name, but the same hash cannot be generated from two names that normalize to the same name. He proposed storing the file using the hash of the normalization, but was not sure if that would solve the problem.
Another problem area is how to deal with invalid byte sequences. He proposes falling back to the previous behavior, just treating the names as sequences of bytes, when a sequence is invalid for the encoding. There may be some user-space breakage due to normalization or case folding of file names that will need to be handled as well.
The current implementation is for the ext4 filesystem, but the main part is the NLS changes. The ext4-specific changes give other filesystems a roadmap to adding encoding-awareness and case-insensitivity, Krisman said. Ts'o noted that there is no active NLS maintainer currently, so he will take Krisman's changes through the ext4 tree. He will try to test other users of NLS, but explicitly is not volunteering to take on NLS maintenance going forward.
Boaz Harrosh pointed out that Linus Torvalds called negative dentries important for performance reasons. He wondered if there were plans to add them for encoding-aware filesystems. Krisman said that invalidating negative dentries needs careful thought and code but that it should be doable. The path for file renames is particularly tricky. Bottomley asked why negative dentries needed to be handled differently than positive ones. The problem is that many people want case-preserving filesystems, so looking up FOO when foo exists should generate a negative dentry for FOO but that will interfere with case-insensitive lookups for Foo or even foo.
The reaction to this proposal was much more positive than to Krisman's earlier attempt. It would seem that we will soon have the ability to handle case-insensitive ext4 filesystems and the potential is there to add it for others.
[I would like to thank LWN's travel sponsor, The Linux Foundation, for assistance in traveling to Vancouver for LPC.]
Bringing the Android kernel back to the mainline
Android devices are based on the Linux kernel but, since the beginning, those devices have not run mainline kernels. The amount of out-of-tree code shipped on those devices has been seen as a problem for most of this time, and significant resources have been dedicated to reducing it. At the 2018 Linux Plumbers Conference, Sandeep Patil talked about this problem and what is being done to address it. The dream of running mainline kernels on Android devices has not yet been achieved, but it may be closer than many people think.Android kernels, he said, start their life as a long-term stable (LTS) release from the mainline; those releases are combined with core Android-specific code to make the Android Common Kernel releases. Vendors will pick a common kernel and add a bunch more out-of-tree code to create a kernel specific to a system-on-chip (SoC) and ship that to device manufacturers. Eventually one of those SoC kernels is frozen, perhaps with yet another pile of out-of-tree code tossed in, and used as the kernel for a specific device model. It now only takes a few weeks to merge an LTS release into the Android Common Kernel, but it's still a couple of years before that kernel shows up as a device kernel. That is why Android devices are always running ancient kernels.
There are a lot of problems associated with this process. The Android core
has to be prepared to run on a range of older kernels, a constraint that
makes it hard to use newer kernel features. Kernel updates are slow or,
more often, nonexistent. The use of large amounts of out-of-tree code (as
in millions of lines of it) makes it hard to merge in new stable updates,
and even when that's possible, shipping the result to users is frightening
to vendors
and not often done. There is no continuous-integration process for Android
kernels, and it's not possible to run Android systems on mainline kernels.
All told, the way Android kernels are developed and managed takes away a
lot of the advantages of using Linux in the first place, but work is being
done to address many of these issues.
With regard to older kernels: the Oreo release required the use of one of the 3.18, 4.4, or 4.9 kernels — an improvement over previous releases, which had no kernel-version requirements at all. The Pie release narrowed the requirements further, saying that devices must ship with 4.4.107, 4.9.84, or 4.14.42 (or a later stable release, in each case). The Android developers are trying to "push things up a notch" by mandating the incorporation of stable updates. This has improved the situation, but the base kernel remains two years old (or more), and the Android core still has to work on kernels back to 3.18.
Patil noted that some people worry about regressions from the stable updates, but in two years of incorporating those stable updates, the Android project has only encountered one regression. In particular, 4.4.108 broke things, which is why nothing later than 4.4.107 is required at the moment. Otherwise, he said, the stable updates have proved to be highly reliable for Android systems.
One reason for that may be that the situation with continuous-integration testing is improving; the LKFT effort is now running functional testing on the LTS, ‑rc, and Android Common kernels, for example. More testing is happening through KernelCI, and Android developers are contributing to the Linux Test Project as well. Kernel patches go through pre-submission testing on an emulated device called Cuttlefish, which can run both Android and mainline kernels. More testing is being done by SoC vendors, none of whom have reported problems from LTS kernel updates so far. They do see merge conflicts with their out-of-tree code, but that is unsurprising.
Even so, kernel upgrades remain a huge issue for Android vendors, who worry about shipping large numbers of changes to deployed devices. So devices generally don't get upgraded kernels after they ship — a bad situation, but it's better than the recent past, when kernels could not be upgraded for a given SoC after its launch, he said. Google plans to continue to push vendors to ship updates, though, eventually mandating updates to newer LTS releases even after a device is launched. At some point, LTS releases will be included in Android security bulletins, because there really is value in getting all of the bug fixes. Patil echoed Greg Kroah-Hartman's statement that there are no "security bugs" as such; "there are just bugs" and they should all be fixed.
The problem of devices being unable to run mainline kernels remains; the problem, of course, is all of that out-of-tree code. The amount of that code in the Android Common Kernel has been reduced considerably, though, with a focused effort at getting the changes upstream. There are now only about 30 patches in the Android Common Kernel, adding about 6,500 lines of code, that are needed to boot Android. The eventual plan is to push that to zero, but there are a number of issues to deal with still, including solving problems with priority inheritance in binder, getting energy-aware scheduling into the mainline, and upstreaming the SDCardFS filesystem bridge.
Project Treble, he said, introduced a new "vendor interface" API that implements a sort of hardware abstraction layer. Along with this interface came the concept of a generic system image (GSI), being a build of the Android Open Source Project that can be booted on any Android device. If the GSI can be booted on a specific device, then the manufacturer has implemented the vendor interface correctly.
For now, the kernel is considered to be part of the vendor interface — the vendor must provide it as part of the low-level implementation. The plan, though, is for Android to provide a generic kernel image based on the mainline. Devices will be expected to run this kernel; to make that happen, vendors will provide a set of kernel modules to add the necessary hardware support. Getting there will require the upstreaming of kernel symbol namespaces among other things.
This design will clearly not eliminate the out-of-tree code problem, since those modules will, in many or most cases, not come from the mainline. But there is still a significant change here: vendor-specific code will be relegated to loadable modules and, thus, be unable to change the core kernel. The days of vendors shipping their own CPU schedulers should come to an end, for example; all out-of-tree code will have to work with the generic kernel image using the normal module interface. That will force that code into a more upstream-ready state, which is a step in the right direction.
In conclusion, Patil said, the Android kernel team is now aggressively trying to upstream code before shipping it. There is a renewed effort to proactively report vulnerabilities and other problems and to work with upstream to resolve them. Beyond the above, the project has a number of goals, including getting the ashmem and ion modules out of the staging tree, improving Android's use of device trees, and more. But things are progressing; someday, the "Android problem" may be far behind us.
[Thanks to the Linux Foundation, LWN's travel sponsor, for supporting my travel to LPC.]
A panel discussion on the kernel's code of conduct
There has been a great deal of discussion around the kernel project's recently adopted code of conduct (CoC), but little of that has happened in an open setting. That changed to an extent when a panel discussion was held during the Kernel Summit track at the 2018 Linux Plumbers Conference. Panelists Mishi Choudhary, Olof Johansson, Greg Kroah-Hartman, and Chris Mason took on a number of issues surrounding the CoC in a generally calm and informative session.Kroah-Hartman began by apologizing for the process by which the code was adopted. Linus Torvalds wanted something quickly, Kroah-Hartman said, so the process was rushed and a lot of political capital was burned to get the code into the kernel. He has since been trying to make up for things by talking to a lot of people; while he apologized for how things happened, he also insisted that it was necessary to take that path. The "code of conflict" that preceded the current code was also pushed into the kernel over a period of about three weeks; "we have been here before", he said.
The current status is that we have a code of conduct and an interpretation
document that describes how the code will be viewed in the kernel
community. There is an email alias for the reporting of problems, and work
is underway to set up a group to handle problems that is separate from the
Linux Foundation Technical Advisory Board (TAB), which is currently charged with
that task. He noted that putting the TAB in that role is nothing new,
though; the TAB was also responsible for handling complaints under the old
code of conflict.
Mishi Choudhary has been appointed as a mediator for difficult code-of-conduct issues. She introduced herself by congratulating the kernel community as a whole, saying that everybody celebrates our way of creating software. The kernel's success will lead to the continued growth of the kernel community, meaning that we will have a lot of new people to welcome. She is there to help in any way she can, she said.
Though originally from India, Choudhary is now based in New York, where she has been working with the open-source community for twelve years as a lawyer for groups like the Free Software Foundation, the Apache Software Foundation, and the OpenSSL project. In her current role at the Software Freedom Law Center, she has to deal with conduct-related issues, so she has the experience to help. The important thing, she said, is to be excellent and kind to each other; fairness and due process are not in conflict with kindness.
Kroah-Hartman jumped in with one other thing he wanted to mention: a lot of people have ideas for how they might like to change the code of conduct. One sentence that didn't work with our community (making maintainers responsible for enforcing the code) has been removed but, in general, he suggested that it would be better to work with the upstream project to propose changes. Kroah-Hartman noted that many people have complaints about the author of the code of conduct; he responded that he, personally, has a number of disagreements with Richard Stallman, but still uses Stallman's license. We don't have to agree with the author of the code, he said, and we can expect it to improve over time.
Objectives
Frank Rowand said that any objections to the author are beside the point, it is the political objectives behind the document that many find objectionable. Kroah-Hartman answered that he doesn't agree with much of Stallman's intent either, but that is not a problem. We are using the code for the value it provides to us, and not to further somebody else's political objectives. Rowand continued by saying that the code has an intent that is antithetical to many in the community, and that needs to be acknowledged. In particular, its anti-meritocracy objective does not sit well with many. Rik van Riel responded that we need to rethink what we mean by "meritocracy"; true merit includes working well with others.
Mel Gorman said that we should treat the code as a specification, not an implementation. The kernel community is creating its own implementation of that specification, and we often deviate from specifications when it makes sense to us; this case is no different. Treating merit as a black-and-white quantity is dangerous, he added; it can be exclusionary and restrict junior developers who have the capacity to learn. We should not get too hung up on meritocracy, and we should not be too worried about the code of conduct. We just need to focus on our own implementation. He finished by noting that, since he is Irish, he has a way of dealing with people that's incompatible with almost everybody else. He has to mediate his responses; others should be able to do the same.
Shuah Khan said that "meritocracy" means that we will not accept a reduction in the quality of our code. But we can reject code with polite language; indeed, we have always done that, with just a few exceptions. Kroah-Hartman said the biggest change is that there will be no more exceptions.
Mauro Carvalho Chehab said that the changes that have been made — the removal of maintainer responsibility, the interpretation document, and the appointment of a mediator — are all important. But he is still unsure about how the code will interact with the law in different countries. He mentioned the right to be forgotten as a particular problem in this regard. Kroah-Hartman replied that lawyers at the Linux Foundation and elsewhere have long since concluded that the work done in the kernel community is public and that we cannot be expected to edit it out, so the right to be forgotten does not apply. Mason added that he can sympathize with people who have legal concerns; the TAB can try to connect people with lawyers to get answers to specific questions when the need arises.
Rowand said that the real goal is to create a better and more inclusive community. What else can we do to solve the real problems that we have? Kroah-Hartman said that the Linux Foundation would be announcing some initiatives soon. He also mentioned the work that is being done to create a maintainer's handbook as a step in that direction. Laura Abbott said that, while supporting new developers is important, experienced developers also need to treat each other well. Kroah-Hartman said that he "got to play Linus for a month" and that the experience was not much fun. Every maintainer does things differently. There is a lot that we all can do to make life easier for other maintainers.
Johansson said that, as a community, we have long optimized for keeping the people we care about around; we want them to be there in the future to support their code. Some of that pressure is being relieved by the new set of development tools and testing programs, though; it is easier to be sure that code is in good shape when it goes into the kernel and should help us to broaden the contributor base.
Maintainer responsibility
Gorman warned that the community should be wary of placing responsibility for upholding the code of conduct on any specific individuals. The load on maintainers is high now; it will become worse if maintainers are made responsible for more tasks. There will always be members of our community, he said, who are not suited to mentoring contributors. When I see an example of somebody being "crucified" in a review, he said, we all have the responsibility to intervene. It is possible to tear a patch apart without ever explaining what would make the patch acceptable; such a review can be "very friendly" but also quite unhelpful. Complaining to the TAB will not help in such situations; instead, the best thing to do is to enter the conversation and help the contributor figure out what they need to do. It is not fair to put that responsibility on any one maintainer, or on the TAB.
Dan Williams said that part of the problem is that he doesn't always have the time or ability to mentor contributors, or he doesn't know what to suggest to them without trying to solve the problem himself. It's not always easy to see how you want the problem solved in the end. Johansson said that, if you're the only one who can review patches in your subsystem, you have a growth opportunity for somebody else. Try to find that person and bring them up to speed; that is how we develop new maintainers. Ted Ts'o added that the only power in a volunteer organization is the ability to say "no"; it's nearly impossible to compel people to do things. So telling maintainers to make something happen will not work. The only way is for people to improve things themselves and to recruit others to help.
Mason said that almost all patches receive one of four possible answers: "yes", "no", "no with reasons", and "no with reasons and a long story about how terrible the patch is". He would like to get rid of the last type of answer, which is not helpful. If the code of conduct can do that, he said, we will be better off in the end.
Abbott said that most code-of-conduct issues are things that maintainers and developers can handle themselves. Escalation is not needed, good judgment is enough. The code is mostly a way to establish what we value and a plan in case we have to deal with a more serious issue someday; she hopes that we will never need it. Choudhary also hoped that she would never be needed. Each community figures out how it wants to resolve these problems, she said; the kernel community will be able to do the same. Johansson said that the TAB isn't just there as a sort of police force; anybody is encouraged to reach out with questions or for help.
Jes Sorensen said that, while there has been a lot of talk about how we need to be helpful, one of our bigger problems is contributors who refuse to listen. Some of them are just stubborn; others are overtly trying to provoke a reaction. How do we cope with such people? Kroah-Hartman said that this problem is outside of the code of conduct. We have dealt with such problems in the past, with the worst offenders being banned from the mailing lists. Sorensen was unconvinced, though, citing problems with contributors who have worked on a body of code for years and keep pushing it even after it has been rejected. Such problems will happen again, he said.
A solution to that problem will have to wait, though, as the session ran out of time and came to an end. The overall feeling was one of wary acceptance; kernel developers generally agree with the goals of the code of conduct (as expressed within the community) and hope that the actual outcome will consistent with those goals.
[Thanks to the Linux Foundation, LWN's travel sponsor, for supporting my travel to LPC.]
The kernel developer panel at LPC
The closing event at the 2018 Linux Plumbers Conference (LPC) was a panel of kernel developers. The participants were Laura Abbott, Anna-Maria Gleixner, Shuah Khan, Julia Lawall, and Anna Schumaker; moderation was provided by Kate Stewart. This fast-moving discussion covered the challenges of kernel development, hardware vulnerabilities, scaling the kernel, and more.The initial topic was entry into kernel development, and the panelists' experience in particular. Khan, who got started around seven years ago, said that her early experience was quite positive; she named Tim Bird as a developer who gave her a lot of good advice at the beginning. Abbott started by tracking down a bug that was causing trouble internally; after getting some feedback, she was able to get that work merged into the mainline — an exciting event. Schumaker started with a relatively easy project at work. Lawall, instead, started by creating the Coccinelle project back around 2004. Her experience was initially somewhat painful, since the patches she was creating had to go through a lot of different maintainers.
It had been a busy week at LPC, Stewart said, asking the panelists what stood out for them. Khan called out the networking track as a place where she learned a lot, but also said that the conference helped her to catch up with what is going on with the kernel as a whole, which is not an easy thing to do. She mentioned the sessions on the kernel's code of conduct and the creation of a maintainer's handbook.
Gleixner said that, as a relative newcomer to the kernel, she found it helpful to hop around between tracks. She did follow the realtime microconference in particular, though, and described it as fun. Abbott enjoyed the hallway track (discussions outside of any organized session), along with the WireGuard and GitLab talks and the Android microconference. Schumaker mostly stuck with the Kernel Summit track, and agreed with Khan about code of conduct and maintainer's handbook sessions. Lawall called out the testing and fuzzing microconference and the realtime microconference.
Safety-critical, realtime, and more
Switching subjects, Stewart said that interest in using Linux in safety-critical application is growing; she asked the panelists where they thought the biggest gaps were in this area. Abbott replied that it mostly comes down to testing. The old claim that "many eyes" can find bugs is true, but it's better if those eyes are supplemented with testing; the good news is that the kernel is finally reaching a point where it has a good set of automated tests. Lawall agreed with that assessment, and suggested that we could improve in the area of static-analysis tools as well. We have some of these tools and they overlap coverage in various ways; it would be good to bring them together in a more coherent way. Khan also agreed, noting that the fuzz testing being done by the Syzbot project has been especially helpful, and that the unit testing framework in the kernel has now grown to test 46 different subsystems.
A related subject is the realtime patch set which, we have been informed, should be fully merged within a year. Stewart asked: what will change when that happens? Gleixner responded that realtime will not be different from any other kernel feature; if you break it, you'll have to fix it. A likely problem spot is code that disables preemption, creating unwanted latencies; we need better test coverage, she said, to find places where that is done, especially since it's not possible to test all drivers currently.
Since testing seemed to be on everybody's mind, Stewart asked what the plans were to improve testing in the future. Khan said that the kselftest framework remained weak, especially when it comes to driver tests, and that contributions would be welcome. The PowerPC and x86 architectures have reasonable test coverage, but Arm could use some help.
Gleixner said that there are some tests for realtime behavior, but the coverage is still small. The primary focus has been on detecting changes in response times. Abbott said that she is focused on the subsystems that tend to break when users get their hands on them; that means graphics and input drivers, for example. Third-party modules can also be a problem and could benefit from better test coverage.
Lawall, unsurprisingly, is working on adding more features to Coccinelle. Ensuring that initialization and exit sections are properly annotated is one area of interest. There is also work being done on a tool that can examine a handful of example patches and create a semantic patch that will effect similar changes elsewhere in the tree — an idea that drew enthusiastic applause from the audience. She warned that the result may never be perfect, but it can hopefully serve as a good starting point. A related project is a tool to detect code that could create problems when being backported to older kernels.
What changes will the new classes of hardware vulnerabilities bring to our processes? Khan said that she learned a lot from having to make a number of fixes to one of her drivers; it is good to think about where things could go wrong and to look for potential issues when reviewing code. Some of the proof-of-concept exploit code for these vulnerabilities is finding its way into the self-test framework to help protect against regressions in the future.
Abbott noted that the embargoes around these vulnerabilities have been "touchy", but they are needed for coordination between groups. She is one of the developers in the front line behind the security@kernel.org alias, but tries to handle things in the open whenever possible. The closed nature of the response to Meltdown and Spectre was a big problem, she said.
Tools and documentation
There was a brief discussion on whether the kernel community has the right tools to support long-term kernels. As Abbott noted, this work requires a good view of what is in any given kernel tree to be able to tell if any given patch should be backported to it. Khan admitted that she doesn't always think to send relevant patches to the stable team, so any tools that can help in that regard are useful. She mentioned cregit as a valuable tool. Schumaker added that the machine-learning work being done to identify candidates for stable backports has also been helpful.
Maintainers are a limited resource in almost every project; what can the kernel do to improve the situation there? And, Stewart asked, does the kernel's email-based process still work? Khan is looking forward to the upcoming maintainer's handbook as one helpful development in that area. Abbott said that working with email does present a bit of a learning curve; anybody can learn how to be effective with it, but it's worth asking whether email is really still the best way to collaborate. It works well for maintainers, she said, but perhaps less well for contributors.
A member of the audience noted, to applause, that perhaps the process is not broken even if today's kids prefer web pages. Stewart noted that there are over 60,000 files in the kernel and a lot of developers working with them; that represents a great diversity of opinions that can be mined for better ways to work. Khan said that the community has been continuously evolving and will continue to do so. Lawall added that documenting the details of how subsystems work, as the maintainer's handbook is expected to do, will help a lot; the differences between subsystems can be frustrating for developers now.
James Bottomley asked the panel for one thing that they would change at the Linux Plumbers Conference to make it more welcoming for new attendees; Abbott responded that Bottomley should be required to personally introduce himself to every one of them. More seriously, she said that it would be good to have a way to indicate that she is truly enthusiastic about speaking to new developers. Khan said that the green-dot stickers provided at the conference — applied to an attendee's badge to indicate a willingness to be approached — do work, and that the hallway track had been awesome.
Documentation returned to the fore as a member of the audience asked who should be writing kernel documentation. Gleixner said there is no easy answer to that question; if she documents something that is clear to her, she is likely to miss points that others need. Lawall said that the code itself should be the documentation. That includes good commit messages to explain why changes have been made; that is something the kernel community generally does well now. Schumaker said that reviewers should be asking about documentation.
Which is worse: out-of-date documentation, or documentation that's missing entirely? Gleixner voted for the former, since it creates confusion. Abbott said it depends on the nature of the documentation; some of it exists, for example, to explain the design decisions that were made and will remain useful even as the code evolves.
The final questions had to do with scaling the kernel community. The kernel is one of the fastest-moving projects now, but what would it take to get to a project that is ten times bigger? Khan said that it would simply be impossible to keep up with such a project. Lawall noted that, as the code base gets bigger, making progress becomes harder. Making changes gets more painful, so people just don't bother. The solution, of course, is better tools; Abbott, too, said that more automation will be required for the community to scale successfully.
[Thanks to the Linux Foundation, LWN's travel sponsor, for supporting my travel to LPC.]
Toward a kernel maintainer's guide
"Who's on Team Xmas Tree?" asked Dan Williams at the beginning of his talk in the Kernel Summit track of the 2018 Linux Plumbers Conference. He was referring to a rule for the ordering of local variable declarations within functions that is enforced by a minority of kernel subsystem maintainers — one of many examples of "local customs" that can surprise developers when they submit patches to subsystems where they are not accustomed to working. Documenting these varying practices is a small part of Williams's project to create a kernel maintainer's manual, but it seems to be where the effort is likely to start.In theory, Williams said, kernel maintenance is a straightforward task. All it takes is accumulating patches and sending a pull request or two to Linus Torvalds during the merge window. In this ideal world, subsystems are the same and there is plenty of backup to provide continuity when a maintainer takes a vacation. In the real world, though, the merge window is a stressful time for maintainers. It involves a lot of work juggling topic branches, a lot of talking to people (which is an annoying distraction), and the fact that Torvalds can instinctively smell a patch that is not yet fully cooked. Maintenance practices vary between subsystems, and there is no backup for the maintainers in many of them. It is hard for a maintainer to take a break.
Kernel maintainers, he said, are a gang of opinionated people. They don't
always agree on things, but the good news is they don't have to. So why
would we want a maintainer's handbook? The idea is to create a reference
manual for both maintainers and contributors, a collection of "tribal
knowledge" and best practices rather than a set of rules. There is a lot
of good advice for maintainers to be found in email discussions, but nobody
has, yet, gone to the effort to capture that information and present it in
a useful form.
Another way of putting it, he said, is that there is a fair amount of pain in the community, and he would like to try to alleviate it. Contributors feel the pain of trying to get a maintainer to do something; maintainers, instead, feel the pain of simply trying to hold everything together. He noted that he, too, is guilty of doing things as a maintainer that have caused him stress as a contributor; it is easy to unintentionally make the process harder for others. By addressing some of those pain points, Williams hopes he can help to create a better experience for all involved.
For example, one painful experience for contributors is getting silence in response to patches sent to a maintainer. Different maintainers exhibit different latencies, so it is hard to know when to press further. One way to address this problem might be for maintainers to advertise an equivalent to a service-level agreement (SLA) documenting the response time they agree to provide. Associated with the SLA could be information like a set of trusted reviewers who could stand in for the maintainer for many review tasks, the location of the subsystem's test suite, and more. By setting the contributor's expectations, the guide should make their life easier; they will know when to resend a patch.
Another part of the guide would concern itself with preventable maintainer mistakes. There have been a lot of lectures on the proper use of Git posted by Torvalds (and others) over the years; it should be collected and put into a place where maintainers can find it before they make a mistake. Torvalds, Williams said, provides great explanations of how things should be done "after the storm passes"; he does so patiently, repeatedly, as the same mistakes are discovered anew. Why, Williams asked, isn't this information written down anywhere?
As Williams looked into the creation of a kernel maintainer's guide, he discovered that one already exists; it was created by Tobin Harding in 2017 and hasn't been changed since. His first objective is to add subsystem profiles to this guide; the profile is meant to tell contributors how to work with the subsystem. It would include information like:
- Whether the subsystem accepts pull requests or, instead, requires that all submissions be posted as patches to a mailing list.
- The last day before the merge window that new features can be posted and the last day that any new features could actually be merged. This "last day" is likely to be expressed in terms like "when -rc5 comes out".
- What the requirements are for Reviewed-by or Acked-by tags on patches and whether the maintainer is allowed to merge unreviewed patches.
- Whether the subsystem has a test suite and where it can be found.
- A list of trusted reviewers for the subsystem.
- The "resend cadence" for the subsystem — how long should a contributor wait before resending a patch?
- The time zone(s) in which the maintainers operate, which would be a hint for when contributors could expect a response to an email.
- The maintainer's opinion on trivial cleanup patches.
- Whether the maintainer trusts off-list patch reviews. These often take the form of a Reviewed-by tag from somebody who works for the same company as the submitter; not all maintainers put much faith in such tags.
If nothing else, this list is an interesting overview of how different kernel maintainers approach their job. It drew the obvious question ("why do we have these differences in the first place?") from the audience, but there isn't really an answer beyond "it has always been that way".
The session wound down with some unfocused discussion on the details of the subsystem profiles. Should they be listed in the MAINTAINERS file? Should they include the maintainer's expectations on the documentation of new features? Answers to those questions will have to await the conclusion of the wider discussion, which is ongoing as of this writing. But, as Mel Gorman noted at the end of the talk, the work that has been done so far is a useful enumeration of the problem space, which is a good start.
[Thanks to the Linux Foundation, LWN's travel sponsor, for supporting my travel to the event.]
Updates on the KernelCI project
The kernelci.org project develops and operates a distributed testing infrastructure for the kernel. It continuously builds, boots, and tests multiple kernel trees on various types of boards. Kevin Hilman and Gustavo Padovan led a session in the Testing & Fuzzing microconference at the 2018 Linux Plumbers Conference (LPC) to describe the project, its goals, and its future.
KernelCI is a testing framework that is focused on actual hardware. Hilman is one of the developers of the project and he showed a picture of his shed where he has 80 different embedded boards all wired up as part of the framework. KernelCI came out of the embedded space and the Arm community; there are so many different hardware platforms, it became clear there was a need to ensure that the code being merged would actually work on all of them. Since then, it has expanded to more architectures.
Another goal of the project is to be distributed. No one lab is going to have all of the hardware that needs to be tested. There is a centralized build facility that tracks multiple trees, including the mainline, linux-next, stable trees, and maintainer trees, and builds kernels for the various labs. Hilman's is just one of around ten different labs currently, he said; all of the reporting is centralized at kernelci.org.
Right now, most of the testing is just building and booting the kernels, which actually breaks "quite often". There are more than 250 unique boards and systems that cover 37 unique system-on-chips (SoCs). Over the last few years, KernelCI has done more than four million boots.
If a kernel boots to a shell, that is considered a test "pass". For stable trees and the mainline outside of the merge window, roughly 98% of the kernels pass, but it is "much worse" for linux-next. In particular, linux-next for non-Intel hardware is not all that stable. Arm is getting lots better, but there are still problems; generally, problems in linux-next are caused by some dependency that has been missed, Hilman said.
The current system will send mail to the architecture (or sub-architecture) maintainers when things break. There is work being done to bisect problems so that a particular commit (thus developer) can be identified and notified.
The kinds of problems that KernelCI finds are "all over the place", Hilman said. Many are dependency related, where the driver has changed but the device-tree changes did not make it into the tree, for example. Those are the kinds of problems that are expected to be caught in linux-next. Beyond that, the kernel size is increasing, so the change in memory layout that results can sometimes cause the boot to fail. There is a mix of lab infrastructure problems and kernel problems as might be expected.
Beyond build and boot
When the project got started, it was meant to help find problems where the default configuration (defconfig) for a particular SoC or board would not build. Once that part was mostly handled, KernelCI moved on to testing whether those kernels would boot. Now that is working well, so the project is starting to add testing after the kernel has booted. Basically, the developers wanted to handle a breadth of hardware first and now they are getting to the depth part by running things like kselftest and the Linux Test Project on a subset of the hardware.
Padovan said that his employer, Collabora, has been helping with KernelCI development recently. One of the areas that he and others have been working on is to add more test suites, including tests for video and display along with some tests that look at basic functionality of some subsystems (e.g. USB, suspend/resume). There has also been work on better reporting of the errors, both via email and on the web site. Hilman noted that getting a useful report to the right developer is a more difficult problem that is still being worked on.
An attendee asked about getting a custom kernel tree tested as part of KernelCI. Hilman said that can be done with a request to the project. KernelCI is not interested in testing vendor trees, but any upstream-focused tree can be added. In answer to another question, Hilman said that patches posted to the mailing lists are not being tested currently, but it is something he would like to see added—though it may still be a ways off.
A standardized Debian-based root filesystem for all architectures is also in progress, Padovan said. An attendee asked if any of the tests involved systemd, which tends to break more readily when the kernel does unexpected things. The root filesystem is fairly minimal, but there are some basic tests that involve systemd, Padovan replied. A lot of the build infrastructure for KernelCI is handled by Jenkins; that has recently moved to using Jenkins Pipelines. There has been a lot of work on documenting the project on its wiki as well.
Auto-bisection is under development too. The email report used to just say that the testing failed, but now auto-bisection tries to find the commit that caused the problem. It is similar to what the 0day testing infrastructure does, Padovan said, just on more hardware. Auto-bisection was in beta at the time of the microconference, but has since been announced on the kernel mailing list.
The reliability of the auto-bisection was the subject of an attendee query. Padovan said that it can certainly fail, for example by never ending or by pointing to a commit for a different architecture, so there is a manual step required at times. In addition, a lab infrastructure failure looks like a boot failure, which can lead to a bad bisection.
That led to a question about the reliability of the lab infrastructure. Hilman said that the reliability is not really dependent on whether it is a home lab versus a corporate lab; it has more to do with how closely the lab is monitored. His lab is well monitored because he sits right next to it most of the time; other labs have to get reports of problems before they get fixed. He wishes they had kept statistics on all of that. He did also note that the problem is sometimes the hardware under test itself: it might be flaky, need a firmware update, or the like.
There is a "decent mix" of new and old hardware being tested. When board companies come out with new boards, they often send them to one of the labs. If someone wants to start a new lab, instructions for setting that up have recently been added to the wiki, Hilman said. He suggested that those who are interested in the project ask questions on the mailing list or on the Freenode #kernelci IRC channel.
New Linux Foundation project
Hilman said that the project gets a lot of requests for new features, but does not have the ability to handle them all—more developers are needed. To that end, KernelCI is becoming a Linux Foundation project soon. Founding members are being recruited now. Once the project and its funding are established, there are plans to update the user interface as it is "getting a bit dated". It also does not provide ways to mine the data that is being collected. "We have a lot of data that we are not doing much with", such as boot time, he said.
Adding more architectures and toolchains is planned, as is more test suites. There is a lot of testing on real hardware that KernelCI is doing, but there is clearly room for more.
[I would like to thank LWN's travel sponsor, The Linux Foundation, for assistance in traveling to Vancouver for LPC.]
Page editor: Jonathan Corbet
Next page:
Brief items>>

![Shuah Khan, Anna-Maria Gleixner, Laura Abbott, Anna Schumaker, Julia Lawall, Kate Stewart [The panel]](https://static.lwn.net/images/conf/2018/lpc/kernel-panel-sm.jpg)