LWN.net Weekly Edition for July 26, 2018

Welcome to the LWN.net Weekly Edition for July 26, 2018

This edition contains the following feature content:

PostgreSQL and patents: the PostgreSQL community grapples with contributions of patent-encumbered code.
Initializing the entropy pool using RDRAND and friends: which is worse: a lack of entropy at boot, or trusting the CPU's built-in random-number generator?
The problem with the asynchronous bsg interface: why part of the SCSI-generic interface is about to go away.
Statistics from the 4.18 development cycle: the usual numbers on where the code for this kernel came from.
A kernel event notification mechanism: a new proposal for a general-purpose event notification system.
Replacing AWK with Python in GCC?: which is the better tool for GCC's option-parsing code?

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

PostgreSQL and patents

By Jake Edge
July 25, 2018

Patents and open-source projects are always a messy combination it seems. A recent discussion on the pgsql-hackers mailing list highlights some of the problems that can result even when a patent holder wants to make their patents available to a project like PostgreSQL. Software patents are a minefield in many ways—often projects want to just avoid the problems entirely by staying completely away from code known to be covered by patents.

It started with a post from Takayuki Tsunakawa, who wondered how his employer, Fujitsu, could submit patches to PostgreSQL that implement various patented (and patent applied-for) techniques: "we'd like to offer our company's patents and patent applications license to the PostgreSQL community free of charge". He suggested three possibilities for how that might be accomplished: by way of a patent pool such as the one run by the Open Invention Network, doing an explicit patent grant (such as that in section 3 in the Apache v2 license), or by a patent promise like Red Hat's, but only for PostgreSQL.

Derivative and other works based on PostgreSQL would be problematic if the grant is only made to the PostgreSQL project itself. As Craig Ringer put it:

My big hesitation with all those approaches is that they seem to exclude derivative and transformative works.

PostgreSQL is BSD-licensed. Knowingly implementing patented work with a patent grant scoped to PostgreSQL would effectively change that license, require that derivatives identify and remove the patented parts, or require that derivatives license them.

I'm assuming you don't want to offer a grant that lets anyone use them for anything. But if you have a really broad grant to PostgreSQL, all someone would have to do to inherit the grant is re-use some part of PostgreSQL.

There are also questions about exactly what the Apache patent grant or a patent promise actually provide. There are worries that those have not ever been litigated, so they may not really give projects much protection—and patent promises could potentially change with a company takeover, change in direction, or bankruptcy. Throughout the discussion, though, Tsunakawa seems to be earnestly looking for a reasonable solution so that this patented code can be added to PostgreSQL to make it better.

He appears to be acting as a go-between for the Fujitsu legal department, but Nico Williams was a bit skeptical of that approach. He suggested that the company needs to make a decision about what it is trying to do, then work with the community to determine what would work; only after that should the legal department get involved to make that a reality. Beyond that, though, he is not convinced the patents can be granted in a way that still makes them useful to the granter for anything other than a defensive use:

Also, I would advise you that patents can be the kiss of death for software technologies. For example, in the world of cryptography, we always look for patent-free alternatives and build them from scratch if need be, leading to the non-use of patented algorithms/protocols in many cases. Your best bet is to make a grant so liberal that the only remaining business use of your patents is defensive against legal attacks on the holder.

Williams also brought up another area of concern: tainting developers by exposing them to these patents, which can lead to triple damages if they ended up using them in unrelated projects. Tsunakawa pointed out that PostgreSQL does not screen contributions for possible patent infringement currently, but this situation is a bit different than that, Williams said:

You're proposing to include code that implements patented ideas with a suitable patent grant. I would be free to not read the patent, but what if the code or documents mention the relevant patented algorithms?

[...]

I could choose not to read it. But what if I have to touch that code in order to implement an unrelated feature due to some required refactoring of internal interfaces used by your code? Now I have to read it, and now I'm tainted, or I must choose to abandon my project.

That is a heavy burden on the community. The community may want to extract a suitable patent grant to make that burden lighter.

Much of the discussion became moot, however, when Tom Lane announced that the core team had decided against accepting patented code into PostgreSQL:

[...] we will not accept any code that is known to be patent-encumbered. The long-term legal risks and complications involved in doing that seem insurmountable, given the community's amorphous legal nature and the existing Postgres license wording (neither of which are open for negotiation here). Furthermore, Postgres has always been very friendly to creation of closed-source derivatives, but it's hard to see how inclusion of patented code would not cause serious problems for those. The potential benefits of accepting patented code just don't seem to justify trying to navigate these hazards.

Bruce Momjian, who is another member of the core team, added: "any patent assignment to Postgres would have to allow free patent use for all code, under _any_ license. This effectively makes the patent useless, except for defensive use, even for the patent owner." He noted that the core team (which is part of the PostgreSQL Global Development Group or PGDG) might consider accepting patents for defensive use, but there are some problems with that as well. For one thing, there is no PGDG legal entity; it would entail some "operational complexity" as well, he said.

The core team's feeling is that it not worth it, but that discussion can be re-litigated on this email list if desired. The discussion would have to relate to such patents in general, not to the specific Fujitsu proposal. If it was determined that such defensive patents were desired, we can then consider the Fujitsu proposal.

Tsunakawa was somewhat disappointed, as might be expected. Williams strongly agreed with the decision. There has been no other real reaction to it on the mailing list.

It is a little hard to see how any other choice could have been made, really. While the Fujitsu offer was generous, and might have provided some nice technical benefits, it was never going to fit well with an open-source project with a liberal license. Patent grants like the one IBM made for read-copy update (RCU), which was only allowed for GPL-covered software, do not work in an environment where proprietary forks are possible. Any grant that is wide enough to cover the PostgreSQL community and its participants is likely to be too wide for the patent to be of much use to its owner.

It is far from the first (nor will it be the last) problem we have seen for patents and free software. The patent system, at least in the software world, is not working well, but that's been clear for many years now. Patent reform is certainly due, but seems unlikely to happen—ever, perhaps. Meanwhile, free software will keep rolling along, dodging the patent mess as best it can.

Comments (32 posted)

Initializing the entropy pool using RDRAND and friends

By Jake Edge
July 24, 2018

Random number generation in the kernel has garnered a lot of attention over the years. The tensions between the need for cryptographic-strength random numbers versus getting strong random numbers more quickly—along with the need to avoid regressions—has led to something of a patchwork of APIs. While it is widely agreed that waiting for a properly initialized random number generator (RNG) before producing random numbers is the proper course, opinions differ on what "properly" means exactly. Beyond that, waiting, especially early in the boot process, can be problematic as well. One solution would be to trust the RNG instructions provided by most modern processors, but that comes with worries of its own.

Theodore Ts'o posted a patch on July 17 to add a kernel configuration option that would explicitly trust the CPU vendor's hardware RNG (e.g. the RDRAND instruction for x86). Kernels built with RANDOM_TRUST_CPU will immediately initialize the random number pool using the architecture's facility, without waiting for enough entropy to accumulate from other sources; this means that the getrandom() system call will not block. Waiting for systems to gather enough entropy has been a problem in the past, especially for virtual machines and embedded systems where such entropy can be difficult to find.

Currently, the kernel uses CPU-provided random number instructions as part of the process of mixing data into the entropy pool, without crediting any entropy to it; the patch would go further than that. Instead of waiting for enough entropy to be gathered at boot time, it would simply initialize the pool from the output of RDRAND (or other similar instructions). As Ts'o put it in the patch: "I'm not sure Linux distro's will thank us for this. The problem is trusting the CPU [manufacturer] can be an emotional / political issue."

There are some who would rather these hardware RNG instructions not be used at all for kernel random numbers. But blocking in getrandom() can lead to user space being unable to get random numbers as quickly as needed. By providing a configuration option, the kernel developers neatly avoid making a decision on whether to trust these hardware RNG instructions, but that leaves the decision to the distributions:

Even if I were convinced that Intel hadn't backdoored RDRAND (or an NSA agent backdoored RDRAND for them) such that the NSA had a NOBUS (nobody but us) capability to crack RDRAND generated numbers, if we made a change to unconditionally trust RDRAND, then I didn't want the upstream kernel developers to have to answer the question, "why are you willing to trust Intel, but you aren't willing to trust a company owned and controlled by a PLA [People's Liberation Army] general?" (Or a company owned and controlled by one of Putin's Oligarchs, if that makes you feel better.)

With this patch, we don't put ourselves in this position --- but we do put the Linux distro's in this position [instead]. The upside is it gives the choice to each person building their own Linux kernel to decide whether trusting RDRAND is worth it to avoid hangs due to userspace trying to get cryptographic-grade entropy early in the boot process.

Sandy Harris wondered if there should be a value associated with the option, which would say how many bits of entropy the user thinks should be credited per 32-bit value from the hardware RNG. But Ts'o pointed out that his patch (which has a code diff roughly the same size as the text for its configuration option) does not mess with the entropy estimation. In any case, he is skeptical that kind of knob would truly be useful:

In practice I doubt most people would be able to deal with a numerical dial, and I'm trying to [encourage] people to use getrandom(2). I view /dev/random as a legacy interface, and for most people a CRNG [Cryptographic-strength RNG] is quite sufficient. For those people who are super paranoid and want a "true random number generator" (and the meaning of that is hazy) because a CRNG is Not Enough, my recommendation these days is that they get something like an open hardware RNG solution, such as ChaosKey from Altus Metrum.

Other external hardware RNG devices were also mentioned as possibilities in the thread.

In case distributors choose to enable this in their kernels, Yann Droneaud would like to see a kernel command-line option to disable it. That would likely make it easier for distributions to build kernels with it enabled since their users would have a way to override it without building their own kernel. Ts'o did not reply to that particular request directly, but his patch is meant as an RFC, so perhaps we will see that in the next version. In response to Droneaud, he did elaborate, including some thoughts on threat models:

The trust model that we're using is this. The presumption is that (at least for US-based CPU [manufacturers]) the amount of effort needed to add a [blatant] backdoor to, say, the instruction scheduler and register management file is such that it couldn't be done by a single engineer, or even a very small set of engineers. Enough people would need to know about it, or would be able to figure out something untowards was happening, or it would be obvious through various regression tests, that it would be obvious if there was a generic back door in the CPU itself. This is a good thing, because ultimately we *have* to trust the general purpose CPU. If the CPU is actively conspiring against you, there really is no hope.

However, the RDRAND unit is a small, self-contained thing, which is *documented* to use an AES whitener (e.g., it does an AES encryption as its last step). So presumably, a change to make the RDRAND unit effectively be:

	AES_ENCRYPT(NSA_KEY, COUNTER++)

Is much easier to hide or introduce.

Laura Abbott was set to test the patch on a Fedora Rawhide kernel, but found that the hang at boot time waiting for random numbers had been fixed some other way. But she did think the option was a good idea:

That said, I do think this is a beneficial option to have available because it actually fixes the underlying problem instead of hoping nobody else decides to block early in the bootup process.

It is a simple change, though likely somewhat controversial—for distributions anyway. Another approach might be to simply change the kernel to always allow boot-time selection of whether to trust the CPU's hardware RNG. Defaulting that to "do not trust" would effectively leave things as they are today, but give users a way to decide for themselves—and take distributions off the hot seat if one truly exists.

Comments (20 posted)

The problem with the asynchronous bsg interface

By Jonathan Corbet
July 19, 2018

The kernel supports two different "SCSI generic" pseudo-devices, each of which allows user space to send arbitrary commands to a SCSI-attached device. Both SCSI-generic implementations have proved to have security issues in the past as a result of the way their API was designed. In the case of one of those drivers, these problems seem almost certain to lead to the removal of a significant chunk of functionality in the 4.19 development cycle.

The SCSI standard is generally thought of as a way to control storage devices, such as disk and tape drives (younger readers, ask a coworker what the latter were). But SCSI can be thought of as a sort of network protocol with more general capabilities, as demonstrated by its use to control tape-changing robots, scanners, optical-disk writers, and more. Drivers for such devices tend to run in user space; to support those drivers, the SCSI generic (SG) interface was created. This interface provides direct access to the SCSI protocol, allowing user-space code to control devices in ways not supported by the in-kernel disk and tape drivers.

The original SG interface was simply called "sg"; like the "sd" driver for SCSI disks and "st" driver for tape drives, its name highlights the SCSI developers' focus on efficiency, in that no letters were wasted. The sg driver implements a low-level device that interfaces directly with the SCSI midlayer. Back in 2004, Jens Axboe posted a new implementation that he called "bsg"; unlike sg, it worked at the level of the block layer, taking advantage of its request-queue infrastructure to manage SCSI operations. It took a while, but bsg was finally merged for the 2.6.23 release in 2007. Since then, both interfaces have coexisted in the kernel. The sg interface retains a number of users; older code makes up some of them, but some users have found that it works better for their needs (as will be revisited below). The bsg interface, instead, is the only way to gain access to some newer SCSI protocol features.

Both devices implement two different APIs to accomplish the same task. The synchronous interface uses ioctl() commands; results of operations are returned when ioctl() returns. There is also an asynchronous interface based on simple read() and write() calls, where one uses write() to issue a command, followed by a later read() to obtain the results. The system calls involved are simple, but the data that is transferred is not: SCSI commands are executed by writing an sg_io_hdr structure to the device. The structure is complex in its own right, but it can also contain pointers to other ranges of user-space memory. Normally, a write() call will not access memory outside of the provided buffer; with these interfaces, instead, a write() call can cause accesses to memory almost anywhere in the address space.

The dangers of this kind of interface have become increasingly clear in recent years. In this case, there have been a few security issues related to indirect memory access through the SG devices. There is also the persistent concern that an attacker may succeed in convincing a setuid program to write the wrong thing to such a device, opening up another vulnerability. Worries about this kind of problem led to the recent rejection of the write-based filesystem mounting API. For SG, though, the interfaces have been established for a long time, so they cannot be withdrawn without breaking applications.

For bsg, though, that may not actually be the case.

In June, Jann Horn tried to harden these interfaces by adding more restrictions on the contexts in which they can be used. Almost as an aside, the changelog noted that, in the case of bsg, arbitrary access to memory can also happen in a release() call, when the file descriptor is being closed. That immediately set off a new round of alarms; even a legitimate user-space memory access can run into trouble at release time, when that memory may no longer be present. The results would be unpredictable — but they would be predictably bad.

There was some discussion about how this problem might be fixed, but it didn't take long for Christoph Hellwig to suggest that the asynchronous side of the bsg interface be removed outright. There are reasons to believe that it is not actually being used in the real world, some of which were described by Douglas Gilbert, the maintainer of the sg interface. Among other things, if two processes are issuing commands to the same device, bsg is unable to keep the responses straight. "Once real world users (needing an async SCSI (or general storage) pass-through) find out about that bsg 'feature', they don't use it". Horn did some searching in the Debian Code Search database and concluded that there were no users that needed to be worried about.

The end result of the discussion is that Axboe has merged Hellwig's patch to remove the asynchronous bsg functionality. The synchronous ioctl()-based API, which does not have the same problems (and which is actually used by applications), will remain. Linus Torvalds has stated that this patch should also be applied to the stable kernels as well. So, unless some users of the asynchronous API come forward in the near future, this particular feature will soon disappear.

Comments (13 posted)

Statistics from the 4.18 development cycle

By Jonathan Corbet
July 24, 2018

The 4.18-rc6 kernel prepatch came out on July 22, right on schedule. That is a sign that this development cycle is approaching its conclusion, so the time has come for a look at some statistics for how things went this time around. It was another fairly ordinary release cycle for the most part, but with a couple of distinctive features.

As of 4.18-rc6, 12,879 non-merge changesets had found their way into the mainline repository. This work was contributed by 1,668 developers who added 553,000 lines of code and removed 652,000 lines, for a net reduction of 99,000 lines. This will be the fourth time in the project's history that a release is smaller than its predecessor — and the first time that this has happened for two releases in a row. Of those 1,668 developers, 226 made their first contribution to the kernel this time around; that is the smallest number of first-time contributors since 4.5 was released in March 2016.

More generally, the number of first-time contributors to each release since 3.0 looks like this:

While the number of new contributors varies a bit over time, it has remained consistently between 200 and 300 for each development cycle for a long time. New contributors are important to the health of any development project, so it is good that the kernel continues to attract developers over time.

The most active developers for 4.18 were:

Most active 4.18 developers

By changesets

Christoph Hellwig 218 1.7%

Sergio Paracuellos 203 1.6%

Ben Skeggs 162 1.3%

Mauro Carvalho Chehab 159 1.2%

Colin Ian King 137 1.1%

Geert Uytterhoeven 112 0.9%

Chris Wilson 111 0.9%

Christian Lütke‑Stetzkamp 109 0.8%

Arnaldo Carvalho de Melo 108 0.8%

Arnd Bergmann 106 0.8%

Ajay Singh 106 0.8%

Fabio Estevam 94 0.7%

David Ahern 87 0.7%

Neil Brown 83 0.6%

Masahiro Yamada 81 0.6%

Darrick J. Wong 77 0.6%

Hans de Goede 75 0.6%

Quytelda Kahja 75 0.6%

Jakub Kicinski 69 0.5%

Wolfram Sang 68 0.5%

By changed lines

Greg Kroah-Hartman 207274 20.1%

Sakari Ailus 168085 16.3%

Eric Biggers 32062 3.1%

Ben Skeggs 17368 1.7%

Ondrej Mosnacek 15787 1.5%

Christoph Hellwig 10553 1.0%

Srinivas Kandagatla 9984 1.0%

Ian Kent 7834 0.8%

Alexandre Belloni 6801 0.7%

Martin KaFai Lau 6518 0.6%

John Fastabend 6479 0.6%

Oleksandr Andrushchenko 6203 0.6%

Steven Eckhoff 5993 0.6%

Felix Kuehling 5886 0.6%

Mathieu Desnoyers 5626 0.5%

Dave Chinner 5588 0.5%

Kai Chieh Chuang 5584 0.5%

Manivannan Sadhasivam 5311 0.5%

Christian Lütke‑Stetzkamp 5272 0.5%

Niklas Söderlund 5112 0.5%

Christoph Hellwig ended up at the top of the per-changesets list with work throughout the block, virtual filesystem, and driver subsystems, including the since-reverted AIO polling interface. Sergio Paracuellos made a number of improvements to a couple of staging drivers, Ben Skeggs did a great deal of work on the Nouveau driver as usual, Mauro Carvalho Chehab's work was mostly in the media subsystem (of which he is the maintainer), and Colin Ian King continued his work fixing spelling errors and similar issues throughout the tree.

In the lines-changed column, Greg Kroah-Hartman removed the Lustre filesystem and the ncpfs filesystem as well. Sakari Ailus removed the atomisp driver (which was also in staging), Eric Biggers did a bunch of cryptography-related work (removing a bunch of code in the process), and Ondrej Mosnacek added some new optimized crypto algorithm implementations.

The developers working on 4.18 were supported by 233 companies that we were able to identify. The most active employers this time around were:

Most active 4.18 employers

By changesets

Intel 1218 9.5%

(None) 1008 7.8%

Red Hat 965 7.5%

(Unknown) 718 5.6%

AMD 587 4.6%

IBM 553 4.3%

Linaro 485 3.8%

Renesas Electronics 443 3.4%

Google 380 3.0%

SUSE 371 2.9%

Samsung 356 2.8%

(Consultant) 335 2.6%

Mellanox 281 2.2%

Huawei Technologies 266 2.1%

Oracle 255 2.0%

Facebook 226 1.8%

Orbital Critical Systems 203 1.6%

Bootlin 184 1.4%

Code Aurora Forum 183 1.4%

Canonical 176 1.4%

By lines changed

Intel 229121 22.2%

Linux Foundation 208382 20.2%

Red Hat 58057 5.6%

Google 49540 4.8%

AMD 35006 3.4%

(None) 31371 3.0%

Linaro 29845 2.9%

(Unknown) 26953 2.6%

IBM 24816 2.4%

Renesas Electronics 23568 2.3%

Bootlin 20972 2.0%

Code Aurora Forum 19634 1.9%

Facebook 17391 1.7%

Samsung 17185 1.7%

(Academia) 16786 1.6%

(Consultant) 13790 1.3%

Mellanox 13353 1.3%

MediaTek 12135 1.2%

SUSE 10309 1.0%

Oracle 9105 0.9%

If a developer applies a Signed-off-by tag to a patch that they are not the author of, it usually means that said developer was the maintainer who applied the patch and set it on the path to mainline inclusion. Looking at non-author signoffs (and associated employers) for 4.18 yields a table that looks like this:

Non-author signoffs in 4.18

By developer

David S. Miller 1304 10.7%

Greg Kroah-Hartman 1117 9.2%

Alex Deucher 477 3.9%

Mark Brown 362 3.0%

Mauro Carvalho Chehab 346 2.9%

Martin K. Petersen 291 2.4%

Daniel Borkmann 261 2.2%

Kalle Valo 261 2.2%

Michael Ellerman 235 1.9%

Simon Horman 183 1.5%

Andrew Morton 173 1.4%

Jens Axboe 171 1.4%

Jonathan Cameron 169 1.4%

Ingo Molnar 162 1.3%

David Sterba 159 1.3%

Rafael J. Wysocki 141 1.2%

Thomas Gleixner 139 1.1%

Alexei Starovoitov 127 1.0%

Linus Walleij 125 1.0%

Hans Verkuil 121 1.0%

By employer

Red Hat 2242 18.5%

Linux Foundation 1135 9.4%

Linaro 959 7.9%

Intel 928 7.6%

AMD 572 4.7%

Samsung 489 4.0%

Google 441 3.6%

IBM 439 3.6%

SUSE 402 3.3%

Oracle 391 3.2%

Huawei Technologies 380 3.1%

Facebook 340 2.8%

Code Aurora Forum 316 2.6%

(None) 305 2.5%

Mellanox 271 2.2%

Renesas Electronics 270 2.2%

ARM 204 1.7%

Bootlin 169 1.4%

(Unknown) 158 1.3%

linutronix 153 1.3%

It can be instructive to compare these numbers to those that were published for 2.6.24 in 2008. Many of the names in the left column were the same, though the ordering has changed — Andrew Morton had 1,679 non-author signoffs in 2.6.24. Many of the employer names are the same as well. But, in 2008, just over half of the non-author signoffs were made by developers working for two companies: Red Hat and the Linux Foundation. In 2018, those two organizations retain the top positions in the table, but one has to look at the top six companies to get up to 50% of the total. The process has been slow, but the concentration of maintainers has been dispersing over time.

Finally, with regard to test and review credits, the numbers are:

Test and review credits in 4.18

Tested-by

Andrew Bowers 57 7.7%

Nicholas Piggin 43 5.8%

Marek Szyprowski 34 4.6%

Arnaldo Carvalho de Melo 21 2.8%

Aaron Brown 15 2.0%

Angelo Dureghello 14 1.9%

Mathieu Malaterre 14 1.9%

Randy Dunlap 13 1.7%

Ard Biesheuvel 13 1.7%

Dmitry Osipenko 12 1.6%

Vijaya Kumar K 12 1.6%

Xiongfeng Wang 12 1.6%

Tomasz Nowicki 12 1.6%

Nguyen Viet Dung 11 1.5%

Jarkko Sakkinen 8 1.1%

Song Liu 8 1.1%

Geert Uytterhoeven 7 0.9%

Reviewed-by

Alex Deucher 158 3.2%

Rob Herring 153 3.1%

Geert Uytterhoeven 115 2.3%

Christian König 104 2.1%

Darrick J. Wong 103 2.1%

Christoph Hellwig 99 2.0%

David Sterba 95 1.9%

Neil Brown 90 1.8%

Laurent Pinchart 87 1.7%

Simon Horman 84 1.7%

Tony Cheng 70 1.4%

Andrew Morton 61 1.2%

Hawking Zhang 55 1.1%

Hannes Reinecke 51 1.0%

Brian Foster 51 1.0%

Chris Wilson 46 0.9%

Mika Kuoppala 46 0.9%

The 4.18 kernel appears to be on track for an August 5 release, assuming no severe last-minute problems turn up. Once again, it will be the work of a huge community of developers, all of whom have managed to come together to improve this common resource. For all its faults, the kernel development community continues to function like a well-tuned machine, producing and integrating code at a pace that few other projects can match.

Comments (8 posted)

A kernel event notification mechanism

By Jonathan Corbet
July 25, 2018

The kernel has a range of mechanisms for notifying user space when something of interest happens. These include dnotify and inotify for filesystem events, signals, poll(), tracepoints, uevents, and more. One might think that there would be little need for yet another, but there are still events of interest that user space can only learn about by polling. In an attempt to fix this problem, David Howells, not content with his recent attempt to add seven new system calls for filesystem mounting, has put forward a proposal for a general-purpose event notification mechanism for Linux.

The immediate use case for this mechanism is to provide user space with notifications whenever the filesystem layout changes as the result of mount and unmount operations. That information can be had now by repeatedly reading /proc/mounts, but doing so evidently can impair the performance of the system. The patch set also provides for superblock-level events, such as I/O errors, filesystems running out of space, or processes running into disk quotas. Finally, the ability to watch for changes to in-kernel keys or keyrings is also included.

The BSD world has long had the kqueue() and kevent() system calls for this purpose. Naturally, the mechanism proposed by Howells looks nothing like that API. It is, instead, seemingly designed for performance even with high event rates; to get there, user space must set up and manage a circular buffer that is used to transfer events from the kernel. (As an aside, the kernel already has a whole set of circular-buffer mechanisms for perf events, ftrace events, network packets, and more. This patch set adds yet another. It would have been nice, years ago, to create a single abstraction for these buffers so that a set of library functions could be provided to work with all of them, but that ship sailed some time ago.)

Setting up the event buffer

There is no system call dedicated to setting up the event buffer; instead, the first step is to open a special device (/dev/watch_queue) for that purpose. User space then uses ioctl() to configure this buffer, starting with the IOC_WATCH_QUEUE_SET_SIZE command to set its size (in pages). The application will need to call mmap() on the device file descriptor to map the event buffer into its address space.

Then, the application needs to arrange for events of interest to be delivered into this buffer. There are actually two separate tasks that must be done here: asking for events to be delivered, and configuring a filter to control which events actually make it into the ring buffer. Requesting delivery is dependent on the event type. For events related to keys, there is a new command for the keyctl() system call:

    int keyctl(KEYCTL_WATCH_KEY, key_serial_t id, int buffer,
               unsigned char watch_id);

Where id identifies the key of interest, buffer is the file descriptor for the event buffer, and watch_id is an eight-bit identifier that will appear in any generated events. For filesystem topology events, a new system call is used:

    int mount_notify(int dfd, const char *path, unsigned int flags,
    		     int buffer, unsigned char watch_id);

Here, dfd and path identify the mount point, flags is one of the AT_* flags controlling how path is followed, buffer is the file descriptor for the event buffer, and watch_id is the user-supplied identifier. For superblock events, a similar system call has been added:

    int sb_notify(int dfd, const char *path, unsigned int flags,
    		  int buffer, unsigned char watch_id);

No doubt there will be other types of notifications added in the future if this mechanism makes it into the mainline kernel.

Each of the calls above will generate notifications for a number of different event types. For example, superblock events in the current patch set include "filesystem was toggled between read/write and read-only", "I/O error", "disk quota exceeded", and "network status change". The requesting application may not be interested in all of these event types. Getting the right ones requires setting up a filter, which is done by filling in a watch_notification_filter structure:

    struct watch_notification_type_filter {
	__u32	type;			/* Type to apply filter to */
	__u32	info_filter;		/* Filter on watch_notification::info */
	__u32	info_mask;		/* Mask of relevant bits in info_filter */
	__u32	subtype_filter[8];	/* Bitmask of subtypes to filter on */
    };

    struct watch_notification_filter {
	__u32	nr_filters;		/* Number of filters */
	__u32	__reserved;		/* Must be 0 */
	struct watch_notification_type_filter filters[];
    };

For each entry in the filters array, type identifies the subsystem type of the event (WATCH_TYPE_MOUNT_NOTIFY, WATCH_TYPE_KEY_NOTIFY, or WATCH_TYPE_SB_NOTIFY in the current patch set), subtype_filter is a bitmask indicating the specific events that the application is interested in — notify_key_instantiated, notify_mount_unmount, or notify_superblock_error, for example. The info_filter field can be used to further filter on event-specific information; it can be used to catch mount-point transitions to read/write, for example, while ignoring transitions to read-only.

The IOC_WATCH_QUEUE_SET_FILTER ioctl() command must be used to set the filter once the description is ready. At that point, events can be delivered into the circular buffer.

Receiving events

The buffer itself is defined with this structure:

    struct watch_queue_buffer {
	union {
	    /* The first few entries are special, containing the
	     * ring management variables.
	     */
	    struct {
		struct watch_notification watch; /* WATCH_TYPE_SKIP */
		volatile __u32	head;		/* Ring head index */
		volatile __u32	tail;		/* Ring tail index */
		__u32		mask;		/* Ring index mask */
	    } meta;
	    struct watch_notification slots[0];
	};
    };

The union setup may look a bit strange; it is designed so that the meta information looks like a special type of event entry that will be automatically skipped over by code reading through the buffer. The head index points to the first free slot (where the kernel will write the next event), while tail points to the first available event. User space can adjust the tail pointer only. If head and tail are equal, the buffer is empty.

The actual events look like:

    struct watch_notification {
	__u32			type:24;
	__u32			subtype:8;
	__u32			info;
    };

The type and subtype fields describe the specific event; info is rather more complicated, though, being made up of several fields that must be masked to be used. For example, events can take up more than one slot in the buffer; masking with WATCH_INFO_LENGTH yields the number of slots used. Use WATCH_INFO_ID to get the watch_id value provided when the event was requested. Also crammed into info are flags to indicate buffer overruns or lost events, and a bunch of event-specific flags. The info_filter in the filter set up by user space can filter on most of the fields within info.

Once all that is set up, it's just a matter of watching head and tail (using appropriate barrier operations) to detect when there are events in the structure to be consumed. It is also possible to call poll() on the buffer file descriptor to wait for new events to arrive.

This is the first posting of this patch set, and the work is clearly still changing quickly; this can be observed by noting that the API descriptions in the changelogs are seemingly from a previous version and do not match what the code actually implements. Anybody interested in how this API looks from user space can look at this example program included with the patch set. About the only comment so far has been from Casey Schaufler, who is concerned about how the mechanism interacts with security modules and keeping users from receiving events that they shouldn't.

These patches are clearly intended to create a general-purpose mechanism that could be used throughout the kernel, so they will need a fair amount of review before they can be accepted. Changes seem likely. If the inevitable concerns can be addressed, Linux may yet have a general event-notification mechanism, even if we'll never get kevent() and kqueue().

Comments (23 posted)

Replacing AWK with Python in GCC?

By Jake Edge
July 25, 2018

GCC has a lot of command-line options—so many, in fact, that its build process does a fair amount of processing using AWK to generate the option-parsing code for the compiler. But some find the AWK code to be difficult to work with. A recent post to the GCC mailing list proposes replacing AWK with Python in the hopes of more maintainable option-parsing generation in the future.

Martin Liška raised the idea on July 17 to gauge the reaction of the GCC development community to a switch away from AWK; there are a number of cleanups that he would like to make, but he doesn't want to make them in AWK. One problem that he noted is that the .opt file format used for specifying the options is not well-specified, so part of what he would like to do is to clean that up. That should make it easier to parse those files and for some targets that generate them (e.g. ARM). There are other problems with the AWK code, he said, including too few sanity checks on the options and the code generally being "quite unpleasant to make any adjustments".

His post was accompanied by a Python script that can build code from an optionlist file. That file is created as part of the process of building GCC by combining all of the options in various .opt files (using an AWK script that would presumably be replaced as well). Right now, the optionlist file is processed by other AWK scripts to generate .c and .h files that will do the option parsing. The .opt files specify the names, types, and arguments for the GCC command-line options.

While there was not a huge amount of opposition to Python per se, there were a number of questions about requiring it to build GCC. There were questions about which version of Python to require (Liška's script uses Python 3) as well as how difficult it will be to bring up Python on targets that do not have it available. For example, Karsten Merker was concerned that a Python dependency would make it harder to bring up GCC on new target architectures, but Matthias Klose said that building Python should not be any more of a burden than building AWK. In addition, Python can be cross-built easily, unlike two other languages that were mentioned along the way: "you can cross build python as well more easily than for example perl or guile".

A bigger problem would seem to be for Windows, where building Python using GCC is not supported, Vadim Konovalov said. That's all a bit circular—using GCC to build Python to build GCC—but using GCC to build itself is a time-honored tradition. There are various binary versions of Python for Windows available, however.

The Python version question is also something that would need to be resolved. There are lots of systems out there that only have Python 2 available, particularly those running enterprise Linux distributions. But creating Python programs that can run on either Python 2 or 3 is certainly possible—and perhaps desirable. Eric S. Raymond said: "It's not very difficult to write 'polyglot' Python that is indifferent to which version it runs under." He pointed to a FAQ/HOWTO that he co-authored, which documented the techniques he has used for projects like reposurgeon.

Those methods do not provide compatibility with Python 2.6, though, which is the version available in some of the older (but still supported) enterprise distributions. Raymond pointed out that Python 2.6 stopped being supported by the Python core developers in 2013; beyond that, the lack of 2.6 support has not been a problem for his projects:

The HOWTO introduction does say that its techniques won't guarantee 2.6 compatibility. That would have been a great deal more difficult - some 3.x syntax backported into 2.7.2 makes a large difference here.

In practice, no deployment of reposurgeon or src or doclifter or any of the other polyglot Python code I maintain has tripped over this, or at least I'm not seeing issue reports about it.

Perl is already required for building certain targets, so some wondered if it made more sense to use it to replace AWK instead of Python. There was also mention of using the GNU Project's preferred scripting languages, which is Guile. But there are a number of advantages for choosing Python, not least that Liška is offering to do the work for that switch. He also elaborated on why he wants to move away from AWK:

Yes, using Python is mainly because of object-oriented programming paradigm. It's handy to have encapsulation of functionality in methods, one can do unit-testing of parts of the script. Currently AWK scripts are mix of input/output transformation and various emission of printf('#error..') sanity checks. In general the script is not easily readable and contains multiple global arrays that simulate encapsulation in classes.

Others agreed with the idea of switching away from AWK and thought that Python was a good choice. Several in the thread said that they found AWK hard to read. Paul Konig noted that he supports switching to Python:

In roughly 40 years, and roughly 40 programming languages, I've only twice encountered a language where I could go from knowing nothing at all to writing a substantial real world program in just one week: Pascal (in college) and Python (about 15 years ago). This is why Python became my language of choice whenever I don't need the speed or small memory footprint of C/C++.

Joseph Myers also supported the switch. He had some suggestions for features that are not part of the current AWK scripts, as well.

More generally, I don't think there are any checks that flags specified for options are known flags at all; I expect a typo in a flag to result in it being silently ignored.

Common code that reads .opt files into some logical datastructure, complete with validation including that all flags specified are in the list of valid flags, followed by converting those structures to whatever output is required, seems appropriate to me.

As noted in his first message, Liška thinks the decision will ultimately need to be made by the GCC steering committee (which Myers is a member of). Given the reaction to Liška's RFC of sorts, it would seem likely that a decision would be favorable. He said that he is targeting GCC 10 for these changes, which is still two years or so out at this point. A more full-featured scripting language (and someone actively working on making the option handling better) will serve the project well down the road.

Comments (35 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

Briefs: NetBSD 8.0; libinput; oomd; Python in The Economist; Quotes; ...
Announcements: Newsletters; events; security updates; kernel patches; ...

Next page: Brief items>>