|
|
Subscribe / Log in / New account

Leading items

Welcome to the LWN.net Weekly Edition for September 19, 2019

This edition contains the following feature content:

This week's edition also includes these inner pages:

  • Brief items: Brief news items from throughout the community.
  • Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Deep argument inspection for seccomp

By Jake Edge
September 18, 2019

LPC

In the Kernel Summit track at the 2019 Linux Plumbers Conference, Christian Brauner and Kees Cook led a discussion on finding a way to do deep argument inspection for seccomp filtering. Currently, seccomp filters can only look at the top-level arguments to a system call, which means that there are use cases that cannot be supported. There was a lively discussion in the session, but no definitive conclusion was reached; various ideas were considered, but none seemed to quite fit the bill.

Cook said that the current seccomp filters can only inspect the system-call argument values; if one of those values is a pointer, dereferencing it will not work. Even if it were possible to do so, another thread could change the values after the check is done. That is a classic time-of-check-to-time-of-use (TOCTTOU) race. Programs that are using the filters would like to be able to filter based on file name arguments to restrict which files the programs can access, for example, but that is currently not possible.

A more pressing use case is that new system calls are using an API pattern that puts various parameters (flags, in particular) into a structure, as with clone3(), Brauner said. The address of that structure gets passed to the call along with its size, but the parameters in the structure are off-limits to filters. The idea behind the pattern is to enable additions to the API over time; the structure can be extended and the size of the structure will grow so the system call will be able to recognize when it is called with extra parameters that it does not understand.

Both passive and active filtering of, say, open() calls are also affected, Cook said, so even simply logging file names as part of a passive filtering effort is not reliable. The value for the file name that the filter sees may not be the value that actually reaches the system call. The user-space seccomp decisions feature makes it possible for programs like container managers to reliably handle system calls but, since they cannot filter for only those they are interested in, they have to implement those system calls for every call; there is no way to tell the kernel to simply continue handling the system call once it has been deferred to user space.

System-call flow

[Kees Cook]

The slides [PDF] for the session had what Cook called an "eye chart" for system-call flow that was part of the background information for attendees. After the system call enters the kernel, various ptrace() entry hooks are called, which may block the program while another process, such as a debugger, examines, and possibly changes, the call. Anything about the call be changed, including the system-call number or its arguments; the hook can also request that the system call be skipped.

After that, the seccomp hooks are called, which can result in a wide variety of outcomes, Cook said. They can kill the thread or process, skip the system call, log the call, send a signal to the calling process, defer the decision to user space, or generate ptrace() events. In the latter case, the ptrace() hooks may, once again, change anything, including the system-call number, which means the seccomp filter code needs to be run again. As a kind of a hack, further recursion is disallowed after one iteration of that, he said.

Next, the actual system-call code is reached; it copies the user-space memory into kernel memory for parsing into kernel objects. At that point, the Linux security module (LSM) hooks are called, which can only make a simple accept or reject decision. If it is accepted, the system-call code then operates on the kernel objects to perform its function. Then the ptrace() exit hooks are called and, finally, the call returns to user space.

Both the ptrace() and seccomp hooks are in the wrong place to do deterministic checks of system calls, he said. Until the arguments are copied in the system-call function itself, they can be changed by other threads, either mistakenly or as part of an attack.

On the other hand, the LSM hooks are in the right place, but the LSM interface is not meant for system-call filtering. The LSM hooks are higher-level abstractions that are shared between system calls—the same hook can be called from multiple system calls. However, the recent addition of the SafeSetID LSM has changed the situation somewhat; it can distinguish between system calls that seek to change user ID versus those that change the group ID. But, as yet and perhaps forever, there are no allowances for unprivileged LSMs, so the Chrome browser, for example, could not load an LSM to filter its system calls.

Cook wondered if it makes sense to do deep inspection of system-call arguments via seccomp at all. If you really want to inspect file names or IP addresses, the LSM layer makes more logical sense since the kernel objects of interest are available there. Doing filtering based on system calls makes it possible for user space to get sloppy and only filter open() and ignore rename() for example.

Other possibilities

Cook said that he explored finding a way to make an association between seccomp and an LSM of some sort. It was looking like a "really scary" solution that was overly complex with a lot of layering violations. Another way forward might be to move the seccomp hooks; the ABI says they have to be done after the ptrace() hooks so that means they could be pushed deeper into the system-call path. But there is a problem with that too: adding a hook to every system-call function feels completely wrong to him, for one thing.

Another thought was to move the copying of the arguments to earlier in the system-call path; the actual system call could use that cached copy rather than doing the copying itself. The problem is that things like path-name resolution may result in different kernel objects at the two different points, which still leaves a race condition.

Yet another idea would be to have system calls declare their argument types more completely so that the parsing of the arguments and, if needed, conversion to kernel objects could be done early in the system call path. For many system calls, this would be fairly straightforward, but path-name resolution is significantly more complex, for example. Beyond that, some system calls do things like walk lists of structures in user space, which is messy to handle; "the logic for that is really terrible".

[Christian Brauner]

Brauner said that they had come to the realization that allowing deep argument inspection in a generic way, for every system call, is probably not the right way forward. There is a set of system calls that are of most interest to user space for filtering for security purposes. He thought those could probably be handled separately unless someone has a great idea for how to solve the problem generically.

He asked the assembled kernel developers if a piecemeal approach made sense by adding support for individual system calls. Aleksa Sarai said that it did make sense but he wanted to know how user space would be able to detect which system calls are supported. Cook said, "yet another problem", with a chuckle. Some way for the kernel to mark system calls as having that ability will be needed, Brauner said; that was suggested on the mailing list by Andy Lutomirski, he said.

An attendee said that the session made it clear that the filtering is being done in the wrong place, at least for filtering based on kernel objects, such as file handles rather than path names. The LSM hooks are in the right place and have access to the right objects, so some things should be done with seccomp and others with an LSM, they said. If you want to work with file objects, for example, that would have to be done in an LSM.

Cook agreed that filtering is happening in the wrong place, but the alternatives don't seem all that palatable either. The Landlock LSM project is working on ways to have unprivileged code be able to configure a sandbox by attaching eBPF programs to the LSM hooks. But that approach exposes LSM internals that the LSM developers are not comfortable exposing. In addition, doing system-call filtering would mean that the LSM hooks need to have a way to determine what system call invoked them, which is not something they have now (except in the limited SafeSetID case) and runs counter to the idea behind the LSM hooks.

H. Peter Anvin said that he disagreed that seccomp filtering was being done in the wrong place. Moving the filtering deeper in the system-call path will simply expose more attack surface. He acknowledged that means that seccomp filtering cannot do all of what people would like it to do, but that isn't necessarily a problem.

From his perspective, Brauner said that path-based filtering is not really required and that grafting it onto seccomp filtering seems wrong at some level. But there remains a need to be able to filter based on things like flags arguments that are inside structures. Others seemed to agree that handling kernel objects should be left to the LSMs, while arguments such as flags make sense for seccomp.

An attendee wondered why LSM plus eBPF was not "just the answer". Cook said that part of the problem is that there is no unprivileged way to do either right now. It was his hope that attaching eBPF programs to LSM hooks would make its way into the kernel, thus saving seccomp from having to solve the deep argument inspection problem. There are still a lot of discussions about how that would work and both unprivileged eBPF and unprivileged LSMs have met resistance from their maintainers.

Beyond that, attaching eBPF to LSMs gives user space a way to create "gadgets" that can be used in timing attacks of various kinds, an attendee said. That makes it hard to get the security parameters of the feature right. The LSM developers are also concerned about leaking their internal state via the eBPF programs that could be attached.

Right now, just figuring out where the inspection would be done would be a start, Cook said. Then there are more questions of how the filtering would be hooked up to it and so on. In addition, there are the upstreaming issues, Brauner said. Toward the end, Cook said that he was hoping that someone would have a great idea that they had not thought of to solve the problem neatly, but it would appear that is not the case. It is a problem that arises frequently, though, especially in its simplest form (e.g. filtering on flag values), so it seems likely that we have not heard the last of it.

[I would like to thank LWN's travel sponsor, the Linux Foundation, for travel assistance to Lisbon for LPC.]

Comments (12 posted)

Comparing GCC and Clang security features

By Jonathan Corbet
September 12, 2019

LPC
Hardening must be performed at all levels of a system, including in the compiler that is used to build that system. There are two viable compilers in the free-software community now, each of which offers a different set of security features. Kees Cook ran a session during the Toolchains microconference at the 2019 Linux Plumbers Conference that examined the security-feature support provided by both GCC and LLVM Clang, noting the places where each one could stand to improve.

Cook started by noting that most of the "old-school" security features have long since been supported by both compilers. These include stack canaries, warnings on unsafe format-string use, and more. Rather than look at those, he chose to focus on relatively new security-oriented features.

The first of these is per-function sections — putting each function into its own ELF section. This behavior is requested with the -ffunction-sections switch and is well supported by both [Kees Cook] compilers. The value of per-function sections is that they enable fine-grained address-space layout randomization, where the location of each function can be randomized independently of the others. It is a "bizarre and wonderful" feature, he said.

Implicit fall-through behavior in switch statements is a common source of bugs, so many projects are trying to eliminate it. To that end, both compilers support the -Wimplicit-fallthrough option. GCC has supported a special attribute making fall-through behavior explicit for some time; Clang has just gained that support as well. There are evidently no plans in the Clang community to support fall-through markers in comments, though, as GCC does. The kernel is now free of implicit fall-throughs; of the roughly 500 patches fixing fall-through warnings in the last year, Cook said, about 10% turned out to be addressing real bugs in the code.

Link-time optimization (LTO) works with both compilers now. It's not primarily a security feature, but it turns out to be necessary to implement control-flow integrity, which requires a view of all of the functions in a program. Both compilers support LTO, but updating the build tooling to make use of it is still painful. There are also, he said, concerns that LTO can expose differences between the C memory model and the model that the kernel uses, but nobody has provided any specifics about where things could go wrong. It is theoretically a problem, but "practicality matters" and these concerns shouldn't hold up adoption of LTO unless somebody can demonstrate a real-world problem.

Stack probing is the practice of reading a newly expanded stack in relatively small increments to defeat any attempt to jump over guard pages. GCC can build in this behavior now, controlled by the -fstack-clash-protection flag; Clang still lacks this capability. This feature is more useful in user space than in the kernel, Cook said, since the kernel has fully eliminated the use of variable-length arrays.

Clang provides a -mspeculative-load-hardening flag to turn on mitigations for Spectre v1; GCC does not have this support. Details about this feature can be found in this LLVM documentation. Enabling this feature has a notable performance impact, but it is still less costly than inserting lfence barriers everywhere. An attribute can be used to restrict hardening to specific functions, avoiding the need to slow down the entire program.

Functions do not need to preserve the contents of caller-saved registers, so they normally return with random data in those registers. Clearing those registers at return time, instead, may be useful to block any number of speculative attacks or side channels. The performance impact, Cook said, is tiny. Peter Zijlstra objected, saying that he would like to see a proper description of just what is being mitigated by this technique; the impact may be small, but the accumulation of such measures adds up to "death by a thousand cuts". Cook responded that there is value in bringing the architecture to a known state at function return; it may not block a specific attack right now, but "we don't know what is coming next". There is a patch for GCC implementing register clearing, but not for Clang.

Another relatively controversial measure is automatically initializing stack variables on function entry. GCC can do that now via a plugin; work is being done to add it to Clang, though the specific behavior is not what the kernel community would like. Clang will initialize variables to a poison pattern, but Linus Torvalds would rather be able to count on them being initialized to zero.

There are a couple of concerns about automatic initialization of stack variables, though. One is that it might mask warnings about the use of uninitialized variables; those warnings are still wanted. The tricks used by the GCC plugin can evidently confuse tools like KASAN. And, more importantly, this behavior is seen as a fundamental change to the semantics of C code, essentially creating a fork of the language. That is a big step that not everybody wants to take.

The next technique is structure layout randomization; GCC has been able to do this for the kernel via a plugin for a couple of years. There is a port of this support for Clang, but it seems to be stalled at the moment. Cook said that this feature is for "really paranoid builds" but is not really needed for most.

Signed integer overflow is technically undefined behavior in C — though Zijlstra quickly interjected that, in the kernel, it is well defined as twos-complement wrapping. Most of the time, the overflow of a signed int is unexpected, Cook said. Both compilers support the -fsanitize=signed-overflow flag, but its behavior is not ideal. If warnings are enabled, the build size grows by about 6%; if they are not enabled, the program just dies instead — not desirable behavior for the kernel. The warning also allows the overflow to happen; Cook would rather see the value saturate and stay there. Best, he said, would be to support a user-defined handler that can decide what to do about signed overflows.

Unsigned integer overflow, instead, is often done intentionally in the kernel. That behavior is well defined in C, but overflows can still lead to exploits. Clang can trap unsigned overflows now, while GCC cannot. Once again, though, he would rather see a mode where the value saturates rather than being allowed to wrap.

Control-flow integrity (CFI) is, to put it briefly, ensuring that code always jumps to a location that was intended to be jumped to. One aspect of that problem is returns from functions, which should go only to the place the function was called from. X86 processors can support this "backward-edge" checking in hardware, so no compiler support is needed. Arm64 processors have the PAC instruction, but those must be inserted by the compiler. Both compilers have support for these instructions. For processors without backward-edge CFI support, software needs to implement a shadow stack to preserve the integrity of function returns. Clang had support for shadow stacks, but problems resulted and the support has been removed; GCC has never had this support.

"Forward-edge" CFI, instead, ensures that indirect jumps go to the intended location; it's a matter of validating the destination as an appropriate target for the jump. Hardware support is limited to verifying that a given location is, indeed, the entry point of a function; that gives a big reduction in the attack surface, Cook said, but still does not provide a lot of real-world protection since attackers can just chain function calls together. X86 implements this feature with the ENDBR instruction, while Arm has BTI; both compilers support this feature. In software, Clang can make things tighter by checking in software that the called function has the correct prototype as well. But what we really need, Cook said, is truly fine-grained forward-edge CFI.

With that last item, Cook's talk concluded. The conversation returned briefly to integer overflow before things wound down; H. Peter Anvin suggested that, if the desire was to change the semantics of the integer type, a better approach might be to switch to a language like C++ where such changes are more readily supported. It is fair to say, though, that this suggestion was not widely accepted by the audience.

[Your editor thanks the Linux Foundation, LWN's travel sponsor, for supporting his travel to this event.]

Comments (43 posted)

The properties of secure IoT devices

By Jake Edge
September 17, 2019

OSS NA

At Open Source Summit North America 2019, David Tarditi from Microsoft gave a talk on seven different properties for highly secure Internet of Things (IoT) devices. The properties are based on a Microsoft Research white paper [PDF] from 2017. His high-level summary of the talk was that if you are creating a device that will be connecting to the internet and you don't want it to get "owned", you should pay attention to the properties he would be describing. Overall, it was an interesting talk, with good analysis of the areas where effort needs to be focused to produce secure IoT devices, but it was somewhat marred by an advertisement for a proprietary product (which, naturally, checked all the boxes) at the end of the talk.

Internet connected

He set the stage by noting that there are nine billion microcontroller-based devices deployed every year—many of which will be connected to the internet. He showed a microcontroller from 2014 ("forever ago") that had radios on the die with the CPU. He believes that most devices will be connected to the internet in the coming years and that it will lead to a "profoundly better experience" for users.

[David Tarditi]

He gave the example of an internet-connected refrigerator, which would allow better prediction and diagnosis of compressor problems. Instead of a customer detecting that the compressor has broken by finding spoiled food or melted ice cream, the manufacturer could proactively alert the owner that a problem was imminent. Because the performance of a large number of compressors in the field can be monitored, the manufacturer can observe patterns that indicate an upcoming failure in the field.

But, the internet is a "cauldron of evil", as Dr. James Mickens put it, so there are real dangers when devices get connected to the internet. He noted a bunch of different headlines regarding IoT security but, "to make it more concrete", spent some time describing the Mirai botnet.

In October 2016, the Mirai malware infected a bunch of web cameras, baby monitors, and other devices running Linux. It turned them into a botnet that more or less shut down the internet on the east coast of the US for a day. That attack only involved around 100K devices and exploited a well-known weakness (default administrative passwords). But there was no early detection of the attack and, worse yet, there was no remote update capability for the devices, he said.

Mirai highlighted some of the risks of internet-connected devices, but it also showed that the effects are more than just technical. Tarditi noted that Mirai appeared in the New York Times, on the "Technology" page for the first day, but on the "Politics" page on day two. Device security is a socioeconomic concern; governments will be paying more attention to these issues going forward.

In addition, the Mirai attack was just a small taste; future attacks could be much larger (imagine 100-million devices, he said) or could have much worse effects (e.g. "bricking" an entire product line or attacking critical infrastructure). Connecting a device to the internet is a "serious and challenging issue". There is a moral issue there too, he said, because people could get hurt; this isn't "turn off your PC and reboot, this is more serious".

Building a secured device

Microsoft has been on the front line for more than 25 years in terms of protecting customers and their devices, Tarditi said. In his slides [PDF], he had a timeline of some developments in internet attacks against computers, along with Microsoft projects and initiatives to combat them. He said that Windows has long been a target for attackers. He believes that attackers find Windows 10 to be a pretty difficult target these days.

Over that time period, the company has learned some lessons. The first is that "your code is going to have bugs". It is "really easy to get code wrong", Tarditi said. Also, "your device will be hacked"; attackers are smart, creative, and persistent so they will get through the defenses. It is important to recognize that "security is foundational"; it must be built in from the start, not grafted on later as an afterthought.

There are some concrete things you can do to ensure that a device is highly secured, rather than simply having some security features. There are seven properties that the white paper identifies as being crucial to that process. If anyone has another property that should be added to the list, he is interested in hearing about it. The properties are:

  • A hardware root of trust: Using cryptographic keys that cannot be forged will protect the identity of the device. Storing private keys in the hardware means that an attacker needs to mount a physical attack in order to subvert the root of trust. That can provide a secure boot process to ensure that only trusted software is being run on the device and can be used to attest to the integrity of the software running on the system to remote servers.
  • A small trusted computing base: The trusted computing base is the software that ensures the security of the system, so it should be as small as possible to reduce its attack surface. A complex kernel (e.g. Linux) should run atop a much smaller security monitor or hypervisor.
  • Dynamic compartments: Providing hardware-enforced boundaries between components in the system in order to isolate them and limit the damage a successful attack can do.
  • Defense in depth: Tarditi showed a photo of a castle as the prototypical illustration of defense in depth. The castle has many layers to its defense: multiple walls, moats, gates, and so on. For IoT devices this means having multiple mitigations in the system: firewalls on internal buses and on the network, no-execute protection of memory, address-space-layout randomization (ASLR), etc.
  • Certificate-based authentication: Passwords have multiple problems, they can be stolen and they require someone to handle their administration (e.g. set a good password for the device). With a hardware root of trust and remote attestation, the device identity and integrity is known. Certificates can be issued to the device that will allow it to access services; those certificates can be short-lived, so that a compromise only has a limited time to operate.
  • Failure reporting: Gathering information from many devices in the field can help detect when an attack is starting or something else is going wrong. Unknown attacks using zero-day flaws can be detected early because they cause crashes in some devices. Those crashes are an indication that something is wrong, so if that information is gathered centrally, it can be analyzed to help determine where the problem lies.
  • Renewable security: There is a need to be able to provide software updates from the cloud and for the device to be able to apply those updates to protect itself against vulnerabilities. It is also important to have a hardware mechanism to prevent rollback attacks, where a device is tricked into installing a previous signed update with known vulnerabilities.

Achieving those seven properties is challenging, he said. The security of the system is only as good as the weakest link, however. Threats evolve over time, so it is important to be able to recognize and react to those threats when they occur. And, in order to reduce the effectiveness of those attacks, device makers need infrastructure that can get updates to the devices in short order.

Ad time

At that point, Tarditi launched into a thinly veiled advertisement for the project he works on at Microsoft: Azure Sphere. While Azure Sphere is meant to provide an end-to-end solution for device makers that embodies the seven principles, it is a proprietary product that simply uses the Linux kernel as part of its Azure Sphere OS. Certainly "open source" was not a major component of the sales pitch, which was rather surprising at a conference called "Open Source Summit".

Beyond that, Tarditi did not address an elephant in the room with regard to devices that operate under the seven principles. Those devices may well be highly secured, but they are also completely controlled by the device vendor, not the putative owner of the device. Software updates can (and presumably will) fix security flaws, but they can also take away features that users want, add anti-features that users don't want, and, effectively, cause the device to act in ways that are directly contrary to the best interests of the person who paid for it, installed it, and runs it. Balancing between security and ownership needs is a difficult, largely unsolved problem, but it does seem like something that should at least be mentioned in a talk of this type at an open-source conference.

This kind of semi-infomercial talk is all too common at many conferences in our industry, but it is decidedly uncommon at open-source technical conferences. It may be that companies chafe at that restriction, but it seems like ground we don't really want to cede, at least for the technical side of our conferences. There are other outlets at some events, keynotes for example, where sponsors can pay to present their commercial solutions to a, seemingly, receptive audience. Keeping that kind of talk out of the main tracks of a technical conference is something worth pursuing. Or so it seems to me.

Interested readers can check out the video on Vimeo.

[I would like to thank LWN's travel sponsor, the Linux Foundation, for travel assistance to attend Open Source Summit in San Diego.]

Comments (29 posted)

The 2019 Linux Kernel Maintainers Summit

By Jonathan Corbet
September 14, 2019

Maintainers Summit
The 2019 version of the invitation-only Linux Kernel Maintainers Summit was held on September 12 2019, in Lisbon, Portugal. There, 31 kernel developers discussed a number of issues relating to the kernel development process and how it can be improved.

LWN once again had the privilege of attending the summit. The topics discussed there were:

Coverage of the 2019 Maintainers Summit is now complete. Or, at least, it can be deemed complete with the inevitable group photo:

[Group photo]

Comments (none posted)

Defragmenting the kernel development process

By Jonathan Corbet
September 14, 2019

Maintainers Summit
The first session at the 2019 Linux Kernel Maintainers Summit was a last-minute addition to the schedule. Dmitry Vyukov's Linux Plumbers Conference session on the kernel development process (slides [PDF]) had inspired a number of discussions that, it was agreed, should carry over into the summit. The result was a wide-ranging conversation about the kernel's development tools and what could be done to improve them.

Ted Ts'o introduced the topic by noting that his employer, Google, has a group dedicated to the creation of development tools, and that a lot of good things have come from that. The kernel community also has a lot of tools aimed at making developers more productive, but rather than having a single group creating those tools, we have many competing groups. While competition is good, he said, it also diffuses the available development time and may not be, in the end, the best way to go. He then turned the session over to Vyukov.

Lots of bugs

The kernel community has a lot of bugs, he began; various subsystems are often broken for several releases in a row. The community adds new vulnerabilities to the stable releases far too often. The 4.9 kernel, to take one example, has had many thousands of fixes backported to it. There are a lot of kernel forks out there, each of which replicates each bug, so keeping up with these fixes adds up to a great deal of work for the industry as a whole. The security of our [Dmitry Vyukov] kernels is "not perfect"; as we fix five holes, ten more are introduced — on the order of 20,000 bugs per release. We need to reduce the inflow of bugs into the kernel, he said, in order to get on top of this problem.

These bugs hurt developer productivity and reduce satisfaction all around. Many of them can be attributed to the tools and the processes we use. We have many of the necessary pieces to do better, but there is also a lot of fragmentation. Every kernel subsystem does things differently; there is a distinct lack of identity for the kernel as a whole.

More testing is good, but testing is hard, he said; it's not just a matter of running a directory full of tests. There are a lot of issues that come up. People run subsystem-specific tests, but often fail to detect the failures that happen. Many groups only do build testing. About 15 engineer-years of effort are needed to get to a functional testing setup; the kernel community is spending more than that, but that effort is not being directed effectively. There are at least seven major testing systems out there (he listed 0day, kernelci.org, CKI, LKFT, ktest, syzbot, and kerneltests) when we should just have one good system.

When the testing systems work, a single problem can result in seven bug reports, each of which must be understood and answered separately. So developers have to learn how to interact with each testing system. Christoph Hellwig interjected that he has never gotten a duplicate bug report from an automated testing system, but others have had different experiences. And, to the extent that duplicate reports do not happen, it indicates that the testing systems are not functional — they are not detecting the problems.

Laura Abbott pointed out that many of the testing systems are not doing the same thing; their coverage is different, but they still have to reimplement much of the same infrastructure. Thomas Gleixner replied that the problem is at the other end: there is no centralized view of the testing that is happening. There are far more than seven systems, he said; many companies have their own internal testing operations. There is no way to figure out whether the same problem has been observed by multiple systems. Some years ago, the kerneloops site provided a clear view of where the problem hotspots were; now a developer might get a random email and can't correlate reports even when they refer to the same problem.

Ts'o said that these systems will have to learn to talk to each other, which will require a lot of engineering work. But beyond that, even the systems we have now appear to be overwhelmed. Once upon a time, the 0day robot was effective, testing patches and sending results within hours. Now he will get a report five days later for a bug in a patch that has already been superseded by two new versions. Since there is no unique ID attached to patches, there is no way for the testing system to recognize updated versions. Gerrit has solved this problem, but "everybody hates it", and there is little acceptance of change IDs for patches. It all works nicely within Google, he said, but it requires a lot of internal infrastructure. He wondered whether the kernel community could ever have the same thing.

Dave Miller said that many companies and individuals are replicating this kind of testing infrastructure; that is the source of the scalability problem. Now, he said, he has to merge patches in batches before doing a test build; if something is bad, he will lose a bunch of time unwinding that work. He would much rather get pull requests with build reports attached so that he can act on them without running into trivial problems. The 0day robot used to help in that regard, but it has lost its effectiveness. Abbott wondered if all this effort could be centralized somewhere; given that the kernelci.org effort is moving into the Linux Foundation, perhaps efforts could be focused there.

Buy-in needed

Vyukov continued by agreeing that, when a maintainer receives a change for merging, they should know that it has passed the tests. Applying should be a simple matter of saying that it is ready to go in during the next merge window. But when individual subsystems try to improve their processes by switching to centralized hosting sites, they just make things worse for the community as a whole by increasing fragmentation. He doesn't know how to fix all of this, but he does know that it has to be an explicit effort with buy-in across the community. There should be some sort of working group, he said; the proposal from Konstantin Ryabitsev could be a good foundation to build on.

Alexei Starovoitov was quick to say that no sort of community-wide buy-in to a new system is going to happen; people have too many strong opinions for that. But, if there is a better tool out there, he will try it. The discussion so far has been all "doom and gloom" he said, but the truth is that the kernel has been getting better and development is getting easier; we are rolling out kernels quickly and each is better than the one that came before. It quickly became clear that this view was not universally shared across the room. Steve Rostedt did acknowledge, though, that the -rc1 releases have become more stable than they once were; he credited Vyukov's syzbot work for helping in that regard.

Dave Airlie pointed out that, after all these years, we still don't have universal adoption of basic tools like Git. Linus Torvalds said that email is still a wonderful tool for stuff that is in development and in need of discussion; work at that stage can't really be put into Git. Miller agreed that email is "fantastic" for early-stage code, but pointed out that the usability of email as a whole is no longer under our control.

Starting with patchwork

Torvalds said that the discussion made it sound like the sky is falling. Our current automated testing infrastructure generates a lot of "crap", but 1% of it is "gold". He encouraged the room to concentrate on concrete solutions to the problems; he liked Ryabitsev's suggestion, which starts by improving the patchwork system. That, he said, should be something that everybody can agree on. Airlie said that the freedesktop.org community has done this, though, with fully funded improvements to patchwork, but it is still "an unmanageable mess" that loses patches and has a number of other problems. Miller said that he was one of the first users of patchwork in the beginning. Back then, the patchwork developer was enthusiastic about improving the system to make developers' lives better. But the situation has long since changed, and it is hard to get patchwork improvements now.

Torvalds said that, if patchwork were to get smarter, more people might use it. There are ways that developers could help it work better. His life got easier, he said, when Andrew Morton started telling him the base for the patches he was sending for merging; the same could be done for patchwork. But patchwork is focused on email, and Miller argued that "email's days are numbered". A full 90% of linux-kernel subscribers are using Gmail, he said, and Google is turning email into a social network with a web site. That gives Google a lot of control over what the community can do.

Ts'o said that, rather than focusing on a specific tool, it would be better to talk about what works. Gerrit tracks the base of patches now and can easily show the latest version of any given patch series. Patchwork requires a lot more manual work, with result that he has thousands of messages there that he is unlikely to ever get to. Olof Johansson pointed out that Gerrit only understands individual patches and cannot track a series, which is a problem for the kernel community. Peter Zijlstra, said that, instead, its biggest problem is that it is web-based. Miller replied that he wants new developers to have a web form they can use to write commit messages; he spends a lot of time now correcting email formatting. Gleixner said that the content of the messages is the real problem, but Miller insisted that developers, especially drive-by contributors, do have trouble with the mechanics.

Git as the transport layer

If patchwork could put multiple versions of a patch series into a Git repository, Ts'o said, it would enable a lot of interesting functionality, such as showing the differences between versions. That is something Gerrit can do now, and it makes life easier. Torvalds said that about half of kernel maintainers are using patchwork; there is no need to enforce its use, but it is a good starting point for future work that people can live with. But, he repeated, there needs to be a concrete goal rather than the vague complaining about the process that has been going on for years. Ryabitsev's proposal might be a good starting point, he said.

Greg Kroah-Hartman agreed that patchwork would be a good foundation to build on. But it's not the whole solution. For continuous integration, he said, the focus should be on kernelci.org; it's the only system out there that is not closed. Johansson, though, said that he does not want to have to go into both patchwork and kernelci.org to see whether something works or not.

One problem with systems like patchwork is the inability to work with them offline. Miller said that, if patchwork stored its data in a Git repository, developers could pull the latest version before getting onto a plane and everybody would be happy. Hellwig said that he has never understood why people like patchwork. It would be better, he continued, to agree on a data format rather than a specific tool.

Ryabitsev worried that centralized tools would make "a tasty target" for attackers and should perhaps be avoided. He also said that, with regard to data formats, the public-inbox system used to implement lore.kernel.org can provide a mailing-list archive as a Git repository. Torvalds said that lore.kernel.org works so well for him that he is considering unsubscribing from linux-kernel entirely. Ts'o said that a number of interesting possibilities open up if Git is used as the transport layer for some future tool. Among other things, Ryabitsev has already done a lot of work at kernel.org to provide control over who can push to a specific repository. Ryabitsev remains leery of creating a centralized site for kernel development, though.

As the discussion wound down, Abbott suggested that what is needed is a kernel DevOps team populated with developers who are good at creating that sort of infrastructure. Hellwig put in a good word for the Debian bug-tracking system, which allows most things to be done using email. Ts'o summarized the requirements he had heard so far: tools must be compatible with email-based patch review and must work offline. If the requirements can be set down, he said, perhaps developers will come along to implement them, and perhaps funding can be found as well.

The session closed with the creation of a new "workflows" mailing list on vger.kernel.org where developers can discuss how they work and share their scripts. That seems likely to be the place where this conversation will continue going forward.

[Your editor thanks the Linux Foundation, LWN's travel sponsor, for supporting travel to this event.]

Comments (35 posted)

Dealing with automated kernel bug reports

By Jonathan Corbet
September 15, 2019

Maintainers Summit
There is value in automatic testing systems, but they also present a problem of their own: how can one keep up with the high volume of bug reports that they generate? At the 2019 Linux Kernel Maintainers Summit, Shuah Khan ran a session dedicated to this issue. There was general agreement that the reports are hard to deal with, but not a lot of progress toward a solution.

Khan began by noting that one pervasive problem with these systems is classification: who should be responsible for a problem, what priority should it have, and is anybody working on it now? Turning to syzbot in particular, she said that getting the reproducer — the program that causes the reported problem to manifest itself — for any given report is a manual task, and that kernel developers tend to lose track of reproducers once the problem is fixed. It would be better, she said, to hang onto these reproducers and use them as regression tests going forward. She is looking into adding them to the kernel self-test infrastructure.

Thomas Gleixner agreed that the reproducers can be useful and said that he often keeps them around. He and Linus Torvalds both said, though, that they should be kept separately from the kernel tree; there are far too many of them to be put into the standard tests. Ted Ts'o said that syzbot reproducers tend not to be good generic tests; they consist of bare system [Shuah Khan] calls, are not portable across architectures, and can contain unrelated code. Filesystem-bug reproducers usually contain a filesystem image embedded as a constant, for example, that he has to extract. These tests should be cleaned up by a human, and that tends to be more trouble than it is worth much of the time.

Gleixner replied that he doesn't bother cleaning the tests up; he just fires up a virtual machine and runs them as they are. All he really needs to see is where things go wrong; there is rarely a need to look at the code of the test itself. It is sufficient to stash them in a place where they can be easily run against new kernels.

Alexei Starovoitov asked who would look at failures of these tests. If nobody will, why are they being kept? He prefers to reverse-engineer syzbot reproducers and create his own test. Ts'o pointed out that the kernelci.org automated testing system runs mostly on Arm processors, so many of the syzbot reproducers will not run correctly in that environment. Despite the problems, though, there was general agreement that reproducers should be kept; they can always be improved later if it makes sense.

Ts'o put in a good word for the work that Eric Biggers has been doing with syzbot reports; he categorizes them and sends summaries (example) to the appropriate subsystem mailing lists. This is a spare-time activity, though, and thus fragile; it would be good to formalize some of that work somewhere. Kees Cook said that he is trying to get some staff at Google to do that work.

Khan moved on to bug tracking, which the kernel community tends not to do well. Rafael Wysocki said that, if the community is able to develop a better collaboration platform, it should be used for bug tracking too. Ts'o said that he uses bugzilla to track problems, and that it helps a lot. There is, however, a need for more high-quality bug reports that can be acted upon. Dave Miller, instead, would like to get more "drive-by" reports from people who don't normally participate in the development community. That requires that users be able to report bugs without having to register with the bug-tracking system first.

Dave Airlie asked how developers should balance their time between syzbot reports and bugs reported by real users in the field. Christoph Hellwig said he spends 95% of his time on user-reported bugs. In general, there was a feeling that bugs that are currently affecting users should take priority over "crazy bugs" turned up by an automatic fuzzing system. But Miller said that, at some point, the bulk of the crazy bugs will have been addressed and the flow from syzbot should slow; things will not always be as overwhelming as they are now. Hellwig said that the syzbot reports he has seen have all been real bugs in need of fixing. Gleixner said that the provenance of bugs doesn't matter; either way, they are bugs that need to be fixed.

As the session closed, Jan Kara requested a standardized way to find a specific subsystem's test suites. As it turns out, that is a part of the subsystem profile work that Dan Williams is doing.

[Your editor thanks the Linux Foundation, LWN's travel sponsor, for supporting travel to this event.]

Comments (1 posted)

The stable-kernel process

By Jonathan Corbet
September 16, 2019

Maintainers Summit
The stable kernel process is a perennial topic of discussion at gatherings of kernel developers; the 2019 Linux Kernel Maintainers Summit was no exception. Sasha Levin ran a session there where developers could talk about the problems they have with stable kernels and ponder solutions.

Levin begin by saying that he has been working on the complaints he got the year before. One of those was that the automatic patch-selection system "goes nuts" and picks the wrong things. It has been retrained twice in the last year and has gotten better at only selecting fixes. About 50% of recent stable releases has been made up of patches explicitly tagged for stable updates; the other half has come from the automated system.

One ongoing problem, he said, is that a lot of patches tagged for stable are not being backported properly. If a simple backport effort fails, Greg Kroah-Hartman sends an email to the people involved, who then have an opportunity to do the backport. But, by the time that happens, developers have moved on and are often unwilling to revisit that old work. Peter [Sasha Levin] Zijlstra said that he tends to ignore email about backport failures; he's not sure what else he should do with them. The answer, Levin said, is to send a working backport.

Dave Miller said that he does all the backports himself for the last two stable releases. But then people come back asking for backports to old kernels like 4.4. He just doesn't have the time to try to backport changes that far. As a result, a lot of poor work gets into those older kernels. Thomas Gleixner said that he had to give up on backporting many of the Spectre fixes to the 4.9 kernel. Even some of the more recent fixes for speculative-execution problems are nearly impossible to backport despite being much cleaner code. Kroah-Hartman said that there are people who are paid to do that sort of work; it's not something that kernel developers should have to worry about.

Levin said that he is trying to improve the backport process in general. He now gets alerts for patches that fix other patches that have been shipped in a stable update; those are earmarked for fast processing. He is also putting together a "stable-next" tree containing patches from linux-next that have been tagged for stable. It is intended to be an early-warning system for changes that will be headed toward the stable kernels in the near future.

Jan Kara complained that he recently applied a fix to the mainline that had a high possibility of creating user-space regressions. He had explicitly marked it as not being suitable for the stable updates, but it was included anyway. Levin replied that it is easy to miss those notes, along with other types of information like prerequisite patches for a given fix. There needs to be a better structure for that kind of information; he will be proposing some sort of tag to encapsulate it.

That said, Levin made it clear that he would rather include even the patches that have been explicitly marked as being unsuitable for stable updates. If there are bugs in those patches, users will encounter them anyway once they upgrade. Holding the scarier patches in this way just trains users to fear version upgrades, which is counter to what the community would like to see.

Ted Ts'o asked about the test coverage for stable releases; Kroah-Hartman answered that is is probably more comprehensive than the testing that is applied to the mainline. There are a lot of companies running tests on stable release candidates and reporting any problems they find. This testing goes well beyond basic build-and-boot tests, he said.

The final topic covered was running subsystem tests on backports. The BPF subsystem, for example, has a lot of tests that are known not to work on older kernels, so nobody should be trying to do that. But fixes to tests are backported, so the tests shipped with a given kernel version should always run well with that kernel.

[Your editor thanks the Linux Foundation, LWN's travel sponsor, for supporting travel to this event.]

Comments (3 posted)

Linus Torvalds on the kernel development community

By Jonathan Corbet
September 16, 2019

Maintainers Summit
The Linux Kernel Maintainers Summit is all about the development process, so it is natural to spend some time on how that process is working at the top of the maintainer hierarchy. The "is Linus happy?" session during the 2019 summit revealed that things are working fairly well at that level, but that, as always, there are a few things that could be improved.

Torvalds initially turned the question around, saying that it should be about whether everybody else is happy about his work. But then he turned it back to his current pet peeve: developers who do not put changes into linux-next before pushing them into the mainline. There was one specific tree that he was unhappy about in the 5.3 merge window; others have been problematic in the past but have improved somewhat. But, it seems, there is always somebody. In general, about 10% of the patches that show up during the merge window did not first show up in linux-next.

Dan Williams asked whether there should be a rule requiring any changes pushed upstream to be in linux-next for at least 24 hours. Torvalds responded that he doesn't even check for presence in linux-next early in the merge window; he is happy enough to get an early pull request that nothing more is required. As the merge window approaches its end, though, he does start checking, and absence from linux-next (earlier in the merge window) can result in pull requests not being acted upon.

Sasha Levin asked about whether the same sort of checking happens after -rc1 comes out; the answer was "generally not". Code entering the mainline after the merge window is supposed to be limited to important fixes, and linux-next is less useful for those. As far as Torvalds is concerned, fixes that do not appear in linux-next are not an issue at all. Levin protested that [Linus Torvalds] fixes are often broken; putting them in linux-next at least gives continuous-integration systems a chance to find the problems. Linux-next maintainer Stephen Rothwell noted that he keeps "fixes" trees separately now, and they can be tested independently and more frequently.

Torvalds pointed out, though, that developers tend not to realize that the creation of linux-next is a manual process; there is no automatic testing that happens just as a result of pushing something in that direction. It can take a few days before patches in linux-next are tested; instead, patches that go into the mainline are tested immediately. Beyond that, the rule is that mainline patches should not be picked for a stable release until one week after they have been merged. What actually happens is that they have to be there for at least one -rc release, but could be selected for stable before a full week has passed.

Ted Ts'o said that he used to be able to push changes into his ext4 tree and get results from the 0day tester within hours. That hasn't been the case for some time, though; he would like to have a URL where he can see if a particular tree has been tested or not. Thomas Gleixner said that it is possible to get a success email from the 0day system now, but those are not particularly reliable either.

Overall, Torvalds summarized, the process is working smoothly. There are still too many bugs, though, "no question about that". The rate of change has not increased much for a couple of years, which he described as a good thing. It's because there really isn't more work to do, rather than being a sign of a bottleneck in the system somewhere. It was quickly pointed out that the rate of change is a local phenomenon; the BPF subsystem is growing quickly, while the Arm architecture code is changing more slowly than it was a few years ago.

Greg Kroah-Hartman said that, of all the patches submitted to the mailing list, only 2% are not receiving a response, which is an improvement over past results. Christoph Hellwig jumped in to say that more patches should be ignored. Torvalds is too happy now, Hellwig said; he needs to be angrier and say "no" more often. Peter Zijlstra complained that he misses the "old Linus".

Dave Airlie asked about the review backlog in general, and whether there was any overall picture of how far behind reviewers are? Kroah-Hartman replied that some subsystems are known to be bad; nobody seems to have a full picture of the state of review across the tree, though.

Torvalds said that he varies quite a bit in how closely he looks at reviews for incoming patches. It depends a lot on the subsystem maintainer; some trees he feels he can just pull without needing to look inside at all, while others require an examination of every patch. He noted that he dislikes getting trees where patches contain only reviews from developers working for the same company as the author; he suggested that, while internal review is an important part of the process, those reviewers should not put Reviewed-by tags in the patches.

There was some discussion about the value of internal reviews, which are seen as being helpful at best and harmless at worst. Gleixner complained about getting low-quality patches with five Reviewed-by tags, though. Torvalds said that patches with only internal reviews show up mostly in the driver subsystems. Often, those patches are so hardware-specific that nobody else will be able to look at them anyway. In some cases — he mentioned the Intel graphics drivers — the subsystem as a whole is solid and internal reviewers deserve a lot of the credit.

Olof Johansson said that the best way for reviewers to build trust is to do their reviews publicly, on the mailing lists, and be seen to point out real problems. As the discussion closed, it was suggested that subsystem maintainers should be informed discreetly if company management is putting pressure on internal reviewers to let substandard code through so that those patches can receive more scrutiny once they hit the lists.

[Your editor thanks the Linux Foundation, LWN's travel sponsor, for supporting travel to this event.]

Comments (8 posted)

Maintainers Summit topics: pull depth, hardware vulnerabilities, etc.

By Jonathan Corbet
September 17, 2019

Maintainers Summit
The final sessions at the 2019 Linux Kernel Maintainers Summit covered a number of relatively quick topics, including the "pull depth" for code going into the mainline, the handling of hardware vulnerabilities, the ABI status of tracepoints, and more.

Pull depth

In the discussion prior to the summit, James Bottomley noted that a lot of subsystem trees are pulled directly into the mainline by Linus Torvalds. He wondered whether that is a good thing, or whether it might be better to have mid-level maintainers aggregating more pull requests to increase the "pull depth" of lower-level trees and decrease the load at the top. Bottomley was not at the summit itself, but his topic was discussed there; the answer was that things are mostly OK as they are. (For the curious, a graphic showing the pull paths for the 5.1 kernel can be found on this page.)

Torvalds responded to the question by saying that he loves to get large pull requests from maintainers he trusts implicitly; that way, he can get a lot of work into the mainline with little effort. There is just one little problem: there are few maintainers that he trusts to that degree. In the absence of that trust, he would prefer to get more, smaller requests that are easier to review and easier to refuse if there is something wrong. He mentioned some subsystems in particular that have been problematic in the past; bypassing the maintainer and getting more focused pull requests from lower-level maintainers has improved the situation. He is not happy about having to do that, but it is better than the alternative, he said.

He does not, however, feel overworked by the number of trees he is pulling now. He aims to act on about 25 pull requests per day during the merge window, normally spending about ten minutes on each of those.

There are advantages to having a maintainer hierarchy, though, that go beyond reducing the number of pull requests at the top. Dave Airlie pointed out that it is a good way to train others to manage the subsystem and know that others would be able to handle it. Torvalds said that, with many subsystems, he is not competent to review the patches himself; it is good to have a mid-level maintainer who understands the area looking at things.

One thing that can reduce his workload, Torvalds said, is getting conflict-resolution trees from maintainers with their pull requests. He still generally wants to resolve merge conflicts himself, but it can be helpful to see how the maintainer would do it. He also gave some advice for anybody wanting to sneak a feature in via a pull request: just delete a lot of lines of code. That makes Torvalds happy enough that he doesn't look closely at the rest.

Olof Johansson said that the problems with the Arm architecture subsystem back in 2011 would not have been solved without Torvalds "roaring" about them. Perhaps something similar is needed to address other problematic subsystems, he said.

The session closed with linux-next maintainer Stephen Rothwell letting it be known that he has been having a harder time than usual, with sixteen-hour days being required to put linux-next together. Why it is taking so long is not entirely clear, though; he was not able to put his finger on a specific cause. He also advised the group that he would be out and unable to work on linux-next during the 5.4 merge window.

Hardware vulnerabilities

The handling of hardware vulnerabilities like Meltdown and Spectre was a big topic at the 2018 Maintainers Summit. One year later, Thomas Gleixner gave a quick update on how the situation has changed.

In the past year, he said, the community has, with luck, put together a working set of processes for dealing with hardware-related problems. He said that "Intel is still trying to control things" in a number of ways, but that the community has been pushing back. There is a new document in the kernel describing how the process should work; some vendors are still quibbling about the details but, for the most part, this process has been accepted by the industry.

The document includes a list of "ambassadors" who serve as a liaison between specific companies and the community for dealing with hardware bugs. That list is slowly filling up. Many company lawyers evidently see the document as being a good thing; having the ground rules set down lets everybody know how the whole process is expected to work.

So overall it seems that the community's processes for dealing with these problems are in relatively good shape. That said, Gleixner hopes that we will never have to use those processes again — at least not until after he retires. There is "not much in the pipeline", which is a welcome bit of news.

Tracepoints as ABI

There are some topics that the attendees of the Maintainers Summit are happy to discuss; others provoke groans as soon as they are proposed. For an example of the latter variety, consider the question of whether kernel tracepoints can be seen as part of the user-space ABI, as raised by Steve Rostedt. This is a topic that has come around many times at these events, but which has never seemingly been fully resolved to everybody's satisfaction.

Torvalds repeated his viewpoint, which seems clear enough: if changing an aspect of kernel behavior will cause user space to break, then that behavior should be seen as part of the ABI. Alexei Starovoitov pointed out that tracepoints have changed in the past and the users of those tracepoints have simply adapted to the change. Torvalds answered that, if somebody complains to him about a change like that, the change will be reverted.

As a (post-summit) example of how seriously he takes this policy, consider the final commit before the release of the 5.3 kernel. An ext4 change made its I/O rather more efficient; that meant fewer I/O interrupts during the early boot process, generating less entropy for the random-number generator. A certain user-space setup then failed to boot properly because it blocked waiting for entropy. The ext4 change was reverted to avoid this problem while a real solution is worked out.

Dave Miller pointed out that it is now easy to build BPF-based tools that make use of kprobes — probes attached at run time to arbitrary points in the kernel. Some of these tools are widely used. Torvalds responded that, in that case, an ABI exists. That led to a fair amount of concern, since kprobes are entirely outside the control of the kernel community and could end up freezing almost any part of the kernel's behavior.

Ted Ts'o pointed out that many subsystems have avoided adding tracepoints out of the fear of creating ABI issues. That drives tool developers to resort to kprobes instead, perhaps making the problem even worse. It might, he suggested, be better to just add a few carefully thought-out tracepoints where they are really needed. Miller called the situation ironic: the community is afraid that it has created software so useful that people are actually using it.

Torvalds closed down the topic by reiterating that these discussions have come around for years without any real change. Over those years, there have been almost no real problems resulting from tracepoints. Why, he asked, are we still discussing this issue? He suggested that it should be blacklisted as a topic at future summits.

Subsystem profiles

As the 2019 Maintainers Summit approached its end, Dan Williams briefly raised the topic of the subsystem maintainer profile patches he posted recently. The purpose of these patches is to document how specific subsystems work, making it easier for developers to submit patches that follow the local rules. Torvalds said that the patches seem OK, but he would like to see some real users — subsystems that actually document their workings using the proposed mechanism.

I felt the need to point out that some developers oppose these patches, claiming that they provide justification for subsystem maintainers who want to create their own special rules. That could take the kernel further away from having more uniform processes, which most people think would be a good thing. Torvalds replied that these patches aren't creating that kind of diversity; instead, they are simply "documenting reality". Kees Cook, who has done a lot of cross-tree work and has run into various local customs, said that having this documentation in place would make his job easier.

And with that note, the summit adjourned for another year.

[Your editor thanks the Linux Foundation, LWN's travel sponsor, for supporting travel to this event.]

Comments (9 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>


Copyright © 2019, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds