Leading items

Welcome to the LWN.net Weekly Edition for September 26, 2019

This edition contains the following feature content:

Better guidance for database developers: a Linux Plumbers Conference session on how to improve communications between kernel and database developers.
5.4 Merge window, part 1: the first set of patches merged for the next major kernel release.
Many uses for Core scheduling: what was once just a defense against covert channels is finding a number of other uses.
System-call wrappers for glibc: the long period where the GNU C Library would not add system-call wrappers is coming to an end.
Monitoring the internal kernel ABI: tools for preventing unwanted kernel ABI breaks.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Better guidance for database developers

By Jake Edge
September 24, 2019

LPC

At the inaugural Databases microconference at the 2019 Linux Plumbers Conference (LPC), two developers who work on rather different database systems had similar complaints about developing for Linux. Richard Hipp, creator of the SQLite database, and Andres Freund from the PostgreSQL project both lamented the lack of definitive documentation on how to best use the kernel's I/O interfaces, especially for corner cases. Both of the sessions, along with others in the microconference, pointed to a strong need for more interaction between user-space and kernel developers.

SQLite

Hipp went first, with a brief introduction to SQLite (which he pronounced "ess-kyew-ell lite"), its ubiquity, and how it is different from other database systems, the gist of which can be seen in his slides. He also created a briefing paper since he had far more material to discuss than there was time in the session. In many ways, his title, "What SQLite Devs Wish Linux Filesystem Devs Knew About SQLite", summed up the presentation nicely.

He noted that SQLite is "in everything and it is everywhere"; he pointed to my Sony camera (see photo) and said that he didn't know if it had SQLite in it, but that it probably did. It is in cars, phones, televisions, and more. There are more than 200 SQLite databases in each of the 2.5 billion Android phones. In aggregate, the phones are doing more than 5GB of SQLite I/O per device per day. "It's a lot of I/O."

Unlike other database systems, SQLite is effectively a library that gets embedded into applications; there is no separate server thread or process. Most databases are designed to run in data centers, but SQLite is designed to run at the edge of the network. It uses a single file to hold the entire database, though there can be journal files to support atomic commits. There is no configuration file for SQLite, so it must discover the capabilities of the underlying system at runtime.

Multiple processes can all be accessing the database at the same time in an uncoordinated fashion, he said. There are three mechanisms to provide atomicity, consistency, isolation, durability (ACID) guarantees for the database. The most universal is a rollback journal, which is also the slowest. A write-ahead log, which is faster, can be used if it is known that the database file does not reside on a network filesystem. There is no good way to determine if that is true, however.

An attendee asked about using statfs() to determine the type of the filesystem. Hipp said that could be done, but then SQLite would have to maintain a list of which types are network filesystems or not. Since SQLite is often statically linked with applications, there would be no way to update that list, he said.

SQLite can also use the atomic write capabilities of F2FS. It is a Linux-only solution, but it is "really fast", he said. He has heard reports that reformatting the filesystem on an old Android phone from ext4 to F2FS will make the handset "seem like a perky new phone". There is a clunky, ioctl()-based interface to detect the feature and to use it; it would be nice to have a more generic way to query for this kind of information.

There were a few specific items he raised in the session, but said there were many more in the paper. The first was a reliable way to query for filesystem attributes. For example, if you create a file, write to it, and then call fsync() on it, do you also have to open its directory and fsync() that in order to be sure that the file is persistent in the directory? Is that even filesystem-specific?

Kernel filesystem developer Jan Kara said that POSIX mandates the directory fsync() for persistence. Generally, filesystem developers are not willing to guarantee anything more relaxed than that because it ties their hands. As it turns out, ext4 effectively does the directory fsync() under the covers, so it is not truly necessary, at least for now. Doing the directory fsync() anyway, as SQLite does, should not be expensive if there is no concurrent activity, Kara said. That is exactly the kind of information he needed, Hipp said: authoritative information from people who know.

He also wondered if it made sense for SQLite to tell the filesystem about unused chunks of the file. At some point in the future, they would be written and then matter, but they are effectively holes that are allocated but whose contents do not matter to SQLite. While there was some thought that filesystems could use that information to send TRIM commands to SSDs, overall the belief was that it probably was not worth it. Kara said that unless the holes were gigabytes in size, it did not make sense to bother with it.

PostgreSQL

Freund launched right into a complaint that durability handling for PostgreSQL is difficult; every Linux filesystem has different behavior. Most system calls do not document what happens when an error is returned. He specifically mentioned error returns from fsync(), which were only reliably reported starting with Linux 4.13; there is still no documentation on what those errors mean and whether the operation can be sensibly retried.

Kara essentially agreed. He noted that the standards define what should happen in the normal case; POSIX does not try to define any durability guarantees. "I share your pain", he said.

Freund continued by describing another documentation flaw: durability operations like sync_file_range() come with big warnings ("This system call is extremely dangerous and should not be used in portable programs ...") that tend to steer application developers away from them. But when the application developers run into performance problems in various cases, they get pointed to sync_file_range(). Kara said the warning is there because there are no durability guarantees provided by sync_file_range(); some filesystems will durably store the range, but others will not. Freund wondered how applications are supposed to actually use the function without having to read kernel code.

In addition, the error behavior is different depending on the filesystem, block device, and kernel version. Depending on the filesystem, you will either get the old or new contents of a page after an I/O error; for NFS you may not see the error at all until the file is closed. There needs to be some documentation of what applications need to check and what they can expect; you can't "complain about people writing crappy code if that's all the guidance that they have".

An attendee asked: "how deep do you want to go?" The filesystem developers are constrained by the block layer developers who are constrained by the devices themselves. The differences between various types of storage devices are going to make guarantees difficult, they said.

Freund said that he was looking for consistency of a different type. For example, right now on a thin-provisioned block device, you can get an ENOSPC error from random system calls that do not document that return. He would be fine with filesystem and block layer errors that were consistent, but does not think Linux needs to hide or try to paper over device errors of various sorts.

In the case of failures, the behavior needs to be documented, he said. If fsync() fails and gets retried, what happens? Does the original sync operation get tried again or does the new data get thrown away? The latter is kind of what happens now in some cases, he said. Application developers have to find and read threads on the Linux kernel mailing list to figure that out.

Beyond just documenting the failure semantics, there is a need for documentation of the right way to do certain things in a safe manner. Right now that is a guessing game or something that requires talking to kernel developers over beer. The latter is nice, he said, but does not work well remotely.

Continuing in that vein, he said that there is no documentation on how to achieve durability for data. Whatever application developers do, though, some kernel developer will complain about it. If there is a performance concern, a kernel developer will say that the application is doing too much, but if the concern is data loss, then someone will complain that it is not doing enough. "Opinions will contradict each other wildly."

Another example is renaming a file atomically; what is required to ensure that it is on disk? According to some filesystem developers, it requires an fsync() of the existing file and the directory containing it, followed by the rename(), and then an fsync() of the new file and of its containing directory. There was some back and forth about whether some of those steps were actually needed, but Tomas Vondra said that PostgreSQL had settled on that sequence after extensive testing; that is what finally made the data-loss problems disappear.

Kara agreed that documentation of the sort Freund is looking for is lacking. He suggested coming up with concrete questions to post to the linux-fsdevel mailing list. The responses can be summarized into a file for the kernel documentation directory. Kara said that the atomic rename() situation is "kind of sad" and suggested that might be a good question to bring to the list.

An attendee asked if Freund was looking for the lowest common denominator because the filesystems are going to have different answers for some things. Freund said that would be fine; if there are major performance implications, it might make sense to have some filesystem-specific code. In answer to another question, Freund said that he was looking for information on what errors it would make sense to retry—which have a chance of actually succeeding if you do so?

From these two sessions and some others in the microconference, it is clear that database developers (and likely other user-space application developers) need to find ways to collaborate more with the kernel developers—and vice versa. The microconference is a great start, but more discussion on the mailing lists and over beer is needed, as is the creation of better documentation. Guidance on how to perform certain operations safely, especially with regard to file data and metadata consistency, seems to be a great starting place.

[I would like to thank LWN's travel sponsor, the Linux Foundation, for travel assistance to Lisbon for LPC.]

Comments (30 posted)

5.4 Merge window, part 1

By Jonathan Corbet
September 23, 2019

As of this writing, 9,632 non-merge changesets have been merged for the 5.4 kernel. This merge window is thus off to a strong start. There has been a wide range of changes merged across the kernel tree, including vast numbers of cleanups and fixes.

Some of the highlights from the first half of the 5.4 merge window include:

Architecture-specific

The Arm64 architecture can now use 52-bit addresses on hardware that supports them.
It is now possible to pass tagged pointers (pointers with user data in the most significant byte) as system-call arguments on the Arm64 architecture. There is a new prctl() option to enable or disable the use of tagged pointers.
Support for the SGI SN2 (IA64-based) architecture has been removed.
The PA-RISC architecture has gained support for the kexec_file_load() system call and kprobes.
Support for Intel's MPX feature is being removed, seemingly as a result of the lack of the necessary support in the compiler toolchain.

Core kernel

The waitid() system call has a new P_PIDFD wait type; specifying that type will cause a wait for a pidfd rather than a normal process ID.
The "haltpoll" CPU idle governor has been merged. This governor will poll for a while before halting an otherwise idle CPU; it is intended for virtualized guest applications where it can improve performance by avoiding exits to the hypervisor. See this commit for some more information.

Filesystems and block I/O

The iocost I/O controller (formerly called io.weight) has been merged. It should provide better I/O performance, for some workloads at least. See this commit for more information.
Despite some controversy, the EROFS read-only filesystem has been moved into the main kernel from the staging tree.
Despite even more controversy, the exFAT filesystem has been added to the staging tree. There is, evidently, a different version of this module at Samsung that might eventually replace the one that has been merged; stay tuned.
The fscrypt filesystem encryption mechanism has gained a number of new ioctl() calls to improve key management and more; see this commit for details.
The fs-verity file integrity mechanism has been merged at last. This documentation file describes the feature in detail.
The kernel will no longer allow user space to write to active swap files.
A warning will now be issued whenever somebody mounts a filesystem that is unable to represent dates at least 30 years in the future.
It is now possible to boot a system using a CIFS filesystem as the root; see this commit for details.

Hardware support

Graphics: LG LB035Q024573 RGB panels, NEC NL8048HL11 RGB panels, Sharp LS037V7DW01 VGA LCD panels, Sony ACX565AKM panels, and Toppoly (TPO) TD028TTEC1 and TD043MTEA1 panels.
Industrial I/O: Analog Devices ADIS16460 inertial sensors, Maxim Integrated MAX5432-MAX5435 potentiometers, and ON Semiconductor NOA1305 ambient light sensors.
Input: FlySky FS-iA6B RC receivers.
Media: OmniVision ov5675 sensors, Allwinner A10 CMOS sensor interfaces, and NXP i.MX IPUv3 IC PP image processors.
Miscellaneous: firmware trusted platform modules running inside an Arm trusted execution environment, Inspur power-supply controllers, Silergy SY8824C regulators, MediaTek MT6358 power-management ICs, Nuvoton NPCM SPI controllers, devices connected to the Turris Mox "Moxtet" bus, Freescale linflexuart serial ports, Qualcomm QCS404 interconnect buses, Lantiq VRX200/ARX300 PCIe PHYs, SGI ASIC 1-Wire interfaces, and HiSilicon ZIP accelerators.
Network: Fintek F81601 PCIE to CAN controllers, Kvaser PCIe FD CAN controllers, TI TCAN4X5X M_CAN controllers, Microchip KSZ8795 series switches, ASPEED MDIO bus controllers, NXP ENETC central MDIO controllers, Analog Devices Industrial Ethernet PHYs, and Pensando Ethernet IONIC adapters.
Pin control: Aspeed G6 SoC pin controllers and Qualcomm SC7180 pin controllers.
Sound: Cirrus Logic CS47L15 and CS47L92 codecs, NXP UDA1334 codecs, and NXP i.MX audio digital signal processors.
USB: Cadence USBSS dual-role device controllers.

Networking

It is now possible to load a BPF program to generate SYN cookies; this hook can run either in the traffic control or XDP modes. See this commit for some more information.
There is now support for the SAE J1939 protocol used in car and truck networks; see this commit for details. This work has the unique distinction of carrying a Signed-off-by tag from the "kbuild test robot" at Intel; which parts of the patch were authored by the robot is not entirely clear.

Security-related

The Lenovo ThinkPad "PrivacyGuard" feature, which can restrict the usable viewing angles of the screen from software, is now supported. See this commit for information on how to control this feature.

Miscellaneous

Much of the "compile once, run everywhere" work for BPF has been merged. These patches (ending in this commit) enhance the user-space libbpf code to be able to read structure-field offsets from the kernel BTF data and relocate BPF code to match the configuration of the currently running kernel.

The 5.4 merge window can be expected to stay open until September 29, assuming the usual schedule holds (and there is no reason to assume it won't). The second half of the merge window is certain to be slower than the first, but there are still some significant trees to be pulled; LWN will post a followup article once the 5.4-rc1 kernel is out. If all goes well, the final 5.4 release will happen in the second half of November.

Comments (11 posted)

Many uses for Core scheduling

By Jonathan Corbet
September 20, 2019

LPC

Some new kernel features are welcomed by the kernel development community, while others are a rather harder sell. It is fair to say that core scheduling, which makes CPU scheduling harder by placing constraints on which processes may run simultaneously in a core, is of the latter variety. Core scheduling was the topic of (at least) three different sessions at the 2019 Linux Plumbers Conference. One of the most interesting outcomes, perhaps, is that there are use cases for this feature beyond protection from side-channel attacks.

Current status

The discussion started in a refereed-track talk by Julien Desfossez and Vineeth Remanan Pillai. Desfossez began by noting that core scheduling has been under development for about a year; its primary purpose is to make simultaneous multi-threading (SMT, or "hyperthreading") secure in the face of hardware vulnerabilities such as speculative-execution attacks. An SMT core contains two or more CPUs (sometimes called "hardware threads") that share a great deal of low-level hardware. That sharing, which includes a number of caches, makes SMT particularly vulnerable to cache-based side-channel attacks. For sites that are worried about such attacks, the only practical alternative now is to turn SMT off, which can have a significant performance impact for some workloads. Core scheduling was developed as a less drastic way to keep tasks that don't trust each other from executing in the same core at the same time.

Pillai took over at this point, saying that core scheduling groups trusted tasks on a core. It treats the CPUs in an SMT core as a unit, finding the highest-priority task on all sibling CPUs; that task will drive the scheduling decisions. If another task can be found that is compatible with the high-priority task, it will be able to run on a sibling CPU; otherwise, that sibling will have to be forced idle.

The first implementation of core scheduling was specific to KVM; it would only allow virtual CPU threads from the same virtual machine to share a core. It was then generalized, with the idea that administrators should be able to define the policy that sets the trust boundaries. The initial prototype uses control groups; an administrator can mark tasks as compatible by putting them in the same group, or setting the same value in the cpu.tag variable in multiple groups. The third version of the patch set is under discussion now; it is focused mostly on bug fixes and performance issues.

There are indeed a few of these issues. The scheduler's vruntime value is used to compare tasks when making scheduling decisions; that works on a single CPU, but it was never intended to be used for cross-CPU comparisons. That can lead to starvation issues for some tasks. Current thinking is to treat these comparisons as more of a load-balancing issue, perhaps along with the creation of a normalized or core-wide vruntime value that will support more accurate comparisons.

The forced idling of sibling CPUs is a necessary evil in core scheduling, but it's apparently even more evil than it really needs to be. In particular, a CPU-intensive process can cause a sibling to stay idle for a long time, starving the process executing there of CPU time. Somehow, the scheduler must learn to account for the forced-idle time to be able to trigger decisions (and switch to a starved task) at the right time. An alternative might be to create a special version of the idle task to run on a forced-idle CPU that can poke the scheduler when it is time to make a change.

Desfossez returned to talk a bit about testing, which has been done extensively for this patch set. Performance-oriented patches always need to demonstrate a clear benefit; in this case, it's far from clear that core scheduling is always better than just turning off SMT. The testing infrastructure also uses tracing to verify correctness, ensuring that no incompatible tasks ever run together.

Testing has revealed the fairness issues described above, and has shown that turning off SMT is indeed better for some workloads. In particular, I/O-intensive workloads do not benefit from core scheduling; he mentioned a MySQL benchmark that performs worse with it turned on.

Future work includes a rethinking of how process selection works. There is also the little problem that, while core scheduling protects processes from each other, it does not protect the kernel against user space. Fixing that would require adding synchronization points on system calls and exit from virtual machines, which is likely to be expensive. There is probably no other way to protect the kernel from MDS attacks, though. Finally, he said, the interface for identifying processes needs to be rethought.

After the talk, Len Brown asked about how well the code would work on systems that have more than two siblings in an SMT core. Such systems do not exist now, but one can imagine CPU designers are thinking about such things. The answer was that the code is generic and should be able to handle that case, but it is hard to know for sure since there's no hardware available to test it on.

Other uses

During the Scheduling Microconference, core scheduling was the topic of another set of sessions that were less focused on implementation details and more on other ways in which the feature might be used. Subhra Mazumdar started by describing a database use case from Oracle, which has its own virtualization setup and would like to use core scheduling to spread tasks optimally. But using core scheduling now leads to a significant (17-30%) performance decline, mostly as a result of the forced idling of CPUs. Often, a CPU goes idle when it could be running a task from another core elsewhere in the system. Mazumdar suggested that the scheduler's wakeup path needs to be changed to allow it to find a task with a matching tag anywhere in the system.

In the discussion, it was repeated that core scheduling is unlikely to ever be better for all workloads. There were references, in particular, to these benchmarks run by Mel Gorman, showing that enabling SMT can result in worse performance even in the absence of core scheduling.

Aubrey Li got up to discuss a different sort of use case for core scheduling: deep-learning training workloads. This kind of workload tends to use a lot of AVX-512 instructions, which can give significant performance benefits. But these instructions, as it turns out, can reduce the maximum CPU frequency for the entire core; if an unrelated task is running elsewhere on the same core, it may be adversely affected by the AVX-512 use. Having two AVX-512-using processes on the same core, though, is no worse than having one there.

It thus makes sense to keep processes making heavy use of those instructions together on the same core. Core scheduling can do this; his workload gets a 10% improvement in throughput and a 30% reduction in latency with it enabled. He thus believes that there would be value in merging core scheduling into the mainline.

Jan Schönherr, instead, has a different sort of use case: isolating some processes, while forcing others to run together. His patch set, confusingly (in this context) named coscheduling, allows an administrator to set policies that will force related processes to run on the same core while excluding others. The result should be the security benefits of core scheduling, but also some performance benefits from having related tasks share CPU resources.

He was asked whether the existing cpuset functionality could handle this use case. The answer was that it works well, but only until the system is overcommitted. Once the load gets too heavy everything breaks down, and the simultaneous-execution property is lost.

Realtime

One day later, during the Realtime Microconference, Peter Zijlstra led yet another session on core scheduling which, he said, is "all the rage these days". So many people want it that he's afraid there will be no alternative to merging it, even though it is not a complete solution to the side-channel problem.

It turns out that realtime developers have come to find the idea attractive as well. The problem, from the realtime point of view, is that SMT is not deterministic, or as Zijlstra put it, it's "deterministically awful". Realtime users tend to disable it to avoid the latency problems that it creates. But core scheduling can force sibling CPUs to go idle when a realtime process is running on the core, thus preventing this kind of interference. That opens the door to enabling SMT whenever a core has no realtime work, but effectively disabling it when realtime constraints apply, getting the best of both worlds.

Using core scheduling this way raises some interesting questions, though; the one that was discussed during this session was the impact on the admission control enforced by the deadline scheduler. Admission control prevents the scheduler from accepting a deadline task if the CPU resources are not available to let that task meet its deadlines. Forcing CPUs idle affects the total amount of CPU time available; if admission control does not take that into account, the system may take on more work than it can handle.

One possible solution that was raised in the session is to multiply a deadline process's worst-case execution time (essentially the amount of time it is allowed to run) by the number of CPUs in a core, since that process will, in effect, occupy all of those CPUs while it runs. There are a number of details to deal with, though, such as how to set the tag on such a task; allowing it to be done in a control group or with prctl() will be too late, potentially after the admission-control decision has been made. Perhaps sched_setattr() could be enhanced for this purpose, but that would create two different ways to tag tasks for core scheduling. Zijlstra said that the developers would have to find an interface that works for all of the use cases.

Getting it merged

Back in the Scheduler Microconference, Pillai wrapped up the session by stating the core scheduling is a big win for some use cases, and that it should be in the mainline kernel for those who can benefit from it. The feature will have to be turned off by default, though, since it is not beneficial to everybody. There is still the little problem that core scheduling does not protect the kernel; Pillai asserted that adding a security boundary on exit from virtual machines would be sufficient there. Providing isolation at system calls and interrupts is not as important. Thomas Gleixner disagreed strongly with that claim, though, saying that entry into the kernel is the same regardless of the mechanism used.

Paul Turner said that protection against hardware vulnerabilities is not just a scheduling problem, and that core scheduling is insufficient regardless. Coscheduling will also prove necessary, he said, and probably something like the ill-fated (so far) address-space isolation patches as well. All of the pieces have to be looked at, and developers need to find a way to assemble them all. Gleixner agreed, but said that there also needs to be an understanding of the picture as a whole or the pieces will never fit.

[Your editor thanks the Linux Foundation, LWN's travel sponsor, for supporting his travel to this event.]

Comments (14 posted)

System-call wrappers for glibc

By Jonathan Corbet
September 19, 2019

LPC

The GNU C Library has long had a reputation for being hostile to the addition of wrappers for new Linux system calls; that has resulted in many system calls being unsupported by the library for years. That situation is changing, though. During the Toolchain microconference at the 2019 Linux Plumbers Conference, Maciej Rozycki talked about glibc's new attitude toward system-call wrappers, but also served notice that there is still significant work to do for the addition of any new system call.

Rozycki, who put together the talk with Dmitry Levin, is not the person doing most of this work. He was, instead, "delivering a message from Florian Weimer", who was unable to attend the event.

For those who might appreciate a bit of background: applications running in user space do not call directly into the kernel; instead, they will call a wrapper function that knows how to invoke the system call of interest. If nothing else, the wrapper will place the system-call arguments in the right locations and do whatever is necessary to invoke a trap into kernel mode. In some cases, the interface implemented by the wrapper can be significantly different from what the kernel provides.

The provision of a specialized wrapper is not strictly necessary; an application can always gain access to an unsupported system call by way of the syscall() function. But, as Rozycki began, there are good reasons to not require applications to do that. There is no type-checking of arguments with syscall(), for example. System-call numbers vary from one architecture to the next, even if a Linux kernel is running in both cases, and there can be other ABI differences as well; that makes writing portable code with syscall() difficult. Then, there are the difficulties that come with POSIX threads and thread cancellation in particular. It is just better to have proper C-library support for the system calls that applications need to use.

So the intention in the glibc project has shifted from blocking system-call wrappers to accepting them. They can't all come in at once, though; each must clear some obstacles first. These include proper documentation in the glibc manual and, since glibc is a Free Software Foundation project, copyright-assignment paperwork in place. That last requirement led to a discussion on whether requiring copyright on system-call wrappers amounts to a recognition of ABI copyrights in general — surprisingly, no useful conclusions came from that part of the conversation. The final problem with getting wrappers merged, Rozycki said, is common to all free-software projects: a lack of reviewers to look them over.

The project has adopted a policy of not emulating system calls in almost all circumstances. If a given system call is not available, the library will return ENOSYS and be done with it. System-call emulation has proved to be error-prone, so it will only be done in the most trivial of cases. Glibc also requires that wrapper names be architecture-independent, the alternative being a "maintenance nightmare". If possible, the glibc developers want to add support for all architectures in a single release; otherwise keeping track of things gets difficult.

Glibc developers have also learned a lesson that has been felt in kernel circles, even if that lesson is still not always taken to heart: multiplexing system calls are painful to support. They make the checking of argument types difficult or impossible, and the situation is even worse for variadic calls (those which take different numbers of arguments for different operations). One result of this aversion to multiplexing system calls may well be that calls like futex() and bpf() will probably be implemented in glibc as a set of independent wrappers, one for each operation.

There are some specific ABI rules that have been adopted for system-call wrappers. For example, ssize_t or size_t should be used for all buffer sizes, regardless of the type the kernel uses; that helps to make the purpose of the argument clear. Flags should not have a long type, since it's often unclear how the upper 32 bits should be handled on 32-bit machines. Errors should always be returned via errno, except for the POSIX threads calls, which are a perennial exception. The glibc developers also feel that their lives would be easier if each new system call had a separate kernel UAPI header file for its associated types and constants. That allows them to include the required information without bringing in any unrelated declarations.

Rozycki concluded by asking for better cooperation with the C-library projects in general. They should be copied on patches containing ABI changes, for example. I noted that there are often times where C-library developers wish the kernel community had done things differently; how could those be avoided in the future? Members of the audience suggested that more glibc developers should perhaps join the linux-api list. The other suggestion was to "copy Florian on everything".

Levin added that sometimes documentation can be a limiting issue, ironically it can be worse if a new system call comes out of the kernel community with documentation already written. If the author will not assign copyright for that documentation to the FSF for inclusion in the manual, progress can be blocked; the glibc developers will have to find a way to recreate it without copying the original. This struck some in the audience as a self-imposed problem.

At the end of session, Christian Brauner said that getting changes into glibc used to be a painful process, but that has changed in recent years. Even the infamous gettid() system call, the subject of a years-long, acrimonious enhancement request, has now been merged, to the amazement of many. The glibc community is now interacting much more with the kernel community, a change that, hopefully, will continue over time and be echoed on the kernel side.

[Your editor thanks the Linux Foundation, LWN's travel sponsor, for supporting his travel to this event.]

Comments (10 posted)

Monitoring the internal kernel ABI

By Jake Edge
September 25, 2019

LPC

As part of the Distribution Kernels microconference at Linux Plumbers Conference 2019, Matthias Männich described how the Android project monitors changes to the internal kernel ABI. As Android kernels evolve, typically by adding features and bug fixes from more recent kernel versions, the project wants to ensure that the ABI remains the same so that out-of-tree modules will still function. While the talk was somewhat Android-specific, the techniques and tools used could be applied to other distributions with similar needs (e.g. enterprise distributions).

Männich is on the Google Android kernel team, but is relatively new to the kernel; his background is in build systems and the like. He stressed that he is not talking about the user-space ABI of the kernel, but the ABI and API that the kernel exposes to modules. The idea is to have a stable ABI over the life of an Android kernel. He knows that other distributions have been doing this "for ages", but the Android kernel and build system are different so it made sense to look at other approaches to this problem.

Out of tree

It is sometimes impossible to have everything in-tree, he said, which is part of the motivation for this work. Stabilizing the ABI will also decouple the development of the kernel and modules for it. The hope is to reduce the fragmentation in the Android space by reducing the number of different kernel versions out there while providing a single ABI/API for the module ecosystem.

Starting as part of Android 8, Project Treble decoupled the vendor-specific parts of Android from the rest of the stack. But that resulted in a big conglomeration of vendor drivers and kernel common code, so it did not really fully decouple the two. Since then, the kernel piece has been separated into the generic kernel image (GKI), along with GKI modules that are common, and the hardware-specific drivers that access the GKI via a stable ABI/API.

The stable interfaces are not something that is wanted upstream, he said. Maintaining stable interfaces will not be done for the mainline; the intent is to do it for trees based on the stable long-term support (LTS) branches. Dhaval Giani asked if the plan was to have the same interface for, say, both 4.9.x and 4.14.x, but Männich said that the intent was only for it to apply to a single LTS branch. So, for example, all Android kernels in the 4.19.x series would be compatible, but not with those in a 5.x kernel.

K. Y. Srinivasan asked if Google had given up on the idea of forcing everyone to put their code into the tree. Männich said that the company encouraged that. Greg Kroah-Hartman said that he "would love for everybody to be in the tree"; "talk to Qualcomm, please". The Android project is unfortunately working with vendors that are not in the tree, he said; he is working on that problem independently, "but we also have to deal with the real world". Tim Bird wondered if this plan would impose any constraints on the kinds of changes that would be accepted into the LTS branches, but Kroah-Hartman said that it would not.

In order to make this work, Android will need to find a kernel configuration that works for all of the vendors, Männich said. Android is still one step shy of having reproducible builds as it is still working on hermetic kernel builds, where all of the toolchains and dependencies, including utilities like uname, are packaged and used separately from the underlying system where the kernel is being built.

To reduce the scope of the problem, it is important to have ways to define what is and is not part of the ABI, he said. There will be whitelists and blacklists to facilitate that. There may be other mechanisms as well.

Currently, Android is only targeting the android-4.19 and android-5.x series—x has not yet been decided—for the stable ABI. There will be one GKI configuration for the kernel, though it may be somewhat different for each architecture that is supported. It only targets Clang builds in a hermetic environment, so the compiler and other tools cannot change over the life of the Android kernel.

In terms of the scope, the stable ABI only applies to the observable ABI. Instead of looking at the code, the project looks at the binary of the kernel to determine what the ABI is. The developers are working on whitelists and he is hopeful that symbol namespaces get merged so that parts of the stable ABI can be defined in terms of which namespaces are supported.

An attendee wondered if other distributions actually cared that much about the stable ABI problem. Several attendees answered that some did because they had customers who care. In some cases, like a popular desktop graphics driver, the source is not available to just rebuild the module for a new kernel, Laura Abbott said. Developers of those out-of-tree drivers can and do update the drivers, but if a distribution wants to ensure the drivers simply keep on working, enforcing a stable ABI would do that, she said.

libabigail

Android uses libabigail to analyze the kernel ABI. Libabigail is both a library and a set of tools; Android mostly just uses the tools to extract and serialize/deserialize the ABI description from the kernel and module binaries. Originally, libabigail only used the ELF and DWARF information, but more recently has added support for the kernel by looking at ksymtab rather than the ELF symbol table. It will generate an in-memory data structure that describes the ABI that it finds; that data structure can be serialized to XML, which can then be compared to previous or future versions of the ABI.

Bird wondered if this tool should be added to the upstream kernel Makefile. Kroah-Hartman and Männich agreed that it would add a kernel dependency on an external tool, which is probably not desirable. It is easy to simply invoke the tool on the kernel build tree after it is built, Männich said.

Giani asked whether the entire observable ABI needs to match between versions. That is where suppression and whitelists come into play, Männich said. Giani suggested that a full whitelist approach might be the way to go since the Android project knows all of the drivers and hardware-specific pieces that it wants to support. Otherwise, it risks growing the supported ABI to an unnecessarily large size.

Männich said that the configuration of the Android kernel is not terribly large. It is much smaller than a standard distribution configuration. In addition, he is hoping to see symbol namespaces, which will make it even easier to pick and choose pieces to use. The problem with a whitelist-only approach is that certain parts of the ABI may obviously not be of interest, for example the filesystem interfaces, but they may define structures and types that are used elsewhere, another Android team member said. So the process has been to try to remove pieces to bring the size of the stable ABI down.

Ben Hutchings asked about ABI changes that are still backward compatible; how are those handled? Männich said that some of that is still a work in progress. Libabigail maintainer Dodji Seketeli said that there are suppressions available that he likened to those for Valgrind. The suppressions can indicate changes that are known not to be problematic from an ABI standpoint.

Sasha Levin asked about kernel changes that do not manifest as ABI changes, such as locking semantics; can those be represented and tracked? Männich said that there are some things that cannot currently be handled, but that they are being worked on; he pointed to an example of an untagged enum value being returned as an integer from a function. If the enum values are rearranged, it changes the ABI but is not flagged by libabigail. Seketeli said that all types could be added to the ABI that the tool is tracking, not just those that appear in function signatures, but that they are not right now to keep memory usage down.

In general, things like locking semantics don't change in an LTS branch, the other Android project member said. If you care about locking semantics, you have to care about all of the ABI semantics; there will likely sometimes be problems in that area, but the project will have to find them on its own as the tooling is not going to help, he said.

As time for the session ran out, Männich quickly went over how the ABI tooling is integrated into the Android process. Basically everyone who builds an Android kernel will get the toolchain and tools, including libabigail, as part of the "repo sync" command to update their tree. The ABI generation and a diff against the baseline ABI will be run as part of the overall build process; any changes to the ABI will then bubble up to the Gerrit code review tool that Android uses. The tools are pretty generic, so they should be easily integrated into other workflows.

[I would like to thank LWN's travel sponsor, the Linux Foundation, for travel assistance to Lisbon for LPC.]

Comments (2 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>