LWN.net Weekly Edition for January 6, 2022

Welcome to the LWN.net Weekly Edition for January 6, 2022

This edition contains the following feature content:

LWN's unreliable predictions for 2022: some wild guesses about what might happen this year.
Restricting SSH agent keys: a look at a new feature coming in OpenSSH 8.9.
Another Fedora integrity-management proposal: a way to support remote attestation for Fedora systems — if we want it.
User-managed concurrency groups: a lightweight threading feature from Google gets closer to the mainline.
Zero-copy network transmission with io_uring: possibly the most efficient way to transmit large amounts of data.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

LWN's unreliable predictions for 2022

By Jonathan Corbet
January 3, 2022

It is 2022 already, and that can only mean one thing: it's time for your editor to make a (bigger) fool of himself by posting a set of predictions for what may come in the new year. One should never pass up an opportunity for a humbling experience, after all. There can be no doubt that interesting things will happen this year; let's see how many random darts thrown in that direction can hit close to the mark.

Starting with something that is, hopefully, fairly obvious: 2022 will see a wider awareness that maintainers need support for free-software projects to be healthy. It has been a while since companies working with free software realized that they needed to support the developers of that software; that is the path toward stronger projects and better influence over how those projects evolve. But even the projects with the most economic support struggle to support their maintainers, and the effects can be felt across the entire community. The ongoing Log4j debacle is just the latest symptom of this problem.

Supporting maintainers can be a hard sell for a corporate manager. Developers can focus most of their time directly on their employers' needs, but maintainers have to make the project work for all participants, including their employers' competitors. The value of their contribution is harder to quantify. But the cost of neglected maintenance is high and growing, and the smarter companies will start to figure this out.

This support will also take the form of a greater willingness to pay for supported free-software products in areas where that has not generally happened. The recent announcement that support for GnuPG is selling well is a case in point. This critical project has languished for years, depending on donations from individuals; maintainer Werner Koch is now telling donors that their support is no longer needed.

The browsers wars will return with a vengeance as the Chrome browser builds on its dominance and increasingly serves its owner's agenda. The Firefox browser saved us from an oppressive browser monopoly once; it now seems that only Firefox is in a position to do that again. A single-browser world is not a good result, even if that browser isn't owned by a large advertising company. Awareness of this problem, and efforts to fight it, will grow in 2022; whether Mozilla can overcome its own problems and rise to the challenge again remains to be seen, though.

Use of centralized, proprietary services will be a bone of contention in 2022, much like it was in 2021. Whether they are Git forges, fallback DNS servers, or content-delivery networks, free-software projects will find it hard to work efficiently without these services, but they will also be uncomfortable depending on them. In the absence of freer alternatives, though, the trend toward proprietary services is likely to continue. Keeping a project going is hard enough as it is; requiring projects to maintain unrelated services will not make it easier.

The 6.0 kernel will be released in 2022, with December 4 being the most likely release date (though it could happen in early October if Linus Torvalds decides to stop 5.x at 5.19 rather than 5.20). As always, there will be nothing special about 6.0; it will be just yet another kernel release, but the dot-zero release number will look like a milestone anyway.

Support for kernel modules written in Rust will be merged in 2022, but not before we have to endure at least one more long discussion on whether adding a new language makes sense. Some developers will resist the burden of learning a new language, while others will repeat strange theories about how adding Rust is a sign of some sort of corporate takeover of the project. But the kernel project needs to be looking at safer technologies, and Rust seems well positioned to address that need.

Python will lose its infamous global interpreter lock (GIL) in 2022. This change will certainly not be in the 3.11 release, due in October, but it will be on a clear track for merging into a later release. Just like the big kernel lock, Python's global lock seems entrenched and impossible to remove, but dedicated developers are getting there.

GNU projects will continue to push toward independence from the Free Software Foundation. This could be seen in 2021 when the GCC and GNU C Library projects chose to drop the FSF's longstanding copyright-assignment requirement. Maintainers in the Emacs project, arguably the one that remains most firmly under Richard Stallman's control, are getting grumpier about that requirement as well, and the project as a whole is under pressure to change its processes, some of which were established in the 1980s. It is a stretch, but 2022 may be the year that Emacs, too, starts making more of its own decisions.

Machine learning will play a bigger role in free-software development. Much of the commercial use of machine learning is built on free software, of course, but our community makes relatively little use of that technology. There are opportunities in many areas, including patch review and code generation, that should be pursued. Investments in our tools tend to pay off in a big way, and there is reason to believe that can happen with machine learning as well.

More importantly (but more of a stretch): machine learning will become more widely available outside of large proprietary services. Not that long ago, building and maintaining a comprehensive, global map database looked like something only large companies could do; now many of those companies depend on OpenStreetMap instead. Machine-learning applications can require massive amounts of CPU power and can raise interesting intellectual-property questions, but they should not be limited to corporate data centers. Free software is going to require free models to work with; the alternative is likely to be that whole problem domains will lack free solutions.

Linux may lose some embedded market share this year to competitors like Fuchsia. Alternative systems will continue to find it hard to compete with Linux's massive development community, but they offer advantages like relative simplicity, permissive licensing, and greater amenability to corporate control. These systems will not displace Linux in a big way in 2022, but they may well make some inroads around the edges.

Some other notes

It seems strange to have a set of predictions for the year that don't mention COVID, but it has become increasingly clear that nobody has a clue of what will happen in that regard. The free-software community has been lucky in that COVID has not hit us that hard, so far. Let us hope that continues as we, with luck, start to put this pandemic behind us.

Finally, LWN will begin its 25th year of publication in late January. A quarter of a century ago, we were convinced that Linux had a bright future, but we could have never imagined where Linux would end up in 2022 — or that LWN would still be a part of it. It has been a great thing to participate in, and the outstanding community of readers that has come together here is a big part of that.

Expect some changes at LWN over the coming year as we work out how to respond to Rebecca Sobol's retirement and, more generally, how to position LWN for the next 25 years. The free-software community has a lot of work to do still, and we don't plan to miss it. Meanwhile, please accept our best wishes for a 2022 that, with any luck at all, will be far better than its immediate predecessors.

Comments (80 posted)

Restricting SSH agent keys

By Jake Edge
January 5, 2022

The OpenSSH suite of tools for secure remote logins is used widely within our communities; it also underlies things like remote Git repository access. A recent experimental feature for the upcoming OpenSSH 8.9 release will help close a security hole that can be exploited by attacker-controlled SSH servers (e.g. sshd) when the user is forwarding authentication to a local ssh-agent. Instead of allowing the keys held in the agent to be used for authenticating to any host where they might work, SSH agent restriction will allow users to specify where and how those keys can be used.

The ssh-agent is used to to simplify making repeated connections to the same host; it stores and manages SSH keys so that the passphrases protecting them do not need to be entered each time a connection is made. Normally, the passphrase is used once to unlock a key, which then gets stored by the agent when ssh-add is used; alternatively, the ssh_config option AddKeysToAgent can be used for the same purpose. ssh-agent is a "deliberately simple program" since it holds private keys. Damien Miller's description of the new feature (linked above) described it this way:

It speaks a simple, client-initiated protocol with a small number of operations including adding or deleting keys, retrieving a list of public halves for loaded keys and, critically, making a signature using a private key. Most interactions with the agent are through the ssh-add tool for adding, deleting and listing keys and ssh, which can use keys held in an agent for user authentication, but other tools can also be used if they speak the protocol.

Since the agent contains high-value keys, it is "a desirable and frequently-exploited target for attackers". The agent is only accessible from the local system, which greatly limits the attack surface, unless access to the agent has been forwarded to a remote system. Using the -A option to ssh (or the ForwardAgent configuration directive) will arrange for the remote host to be able to communicate with the local agent. That remote host can then perform all of the agent operations in the same way that local programs can.

This kind of agent forwarding is generally used so that multi-hop SSH connections can be made without needing to re-enter the passphrase to unlock the key on the remote host—possibly many times. It also means that the private keys do not need to stored on the remote hosts. A user who remotely connects to HostA, and from there to one or more other hosts using the same SSH key, will likely find it convenient to make the initial SSH connection to HostA with agent-forwarding enabled; SSH connections from HostA may extend the agent-forwarding path, as well.

The problem occurs when agent access is forwarded to an attacker-controlled system that then can use the keys stored in the agent to authenticate to any other host the user's keys give them access to. So the user may be using forwarding for HostA, HostB, and HostC, but their key will grant them access to HostV or HostZ that an attacker wants to target for some reason. Currently, SSH has no way to restrict how the keys held by the agent can be used; that is the problem that the new feature is meant to address.

Part of the solution is separating local access from remote access to the agent, so that some keys can only be used from the local system even if agent access is forwarded. Arguably, conflating those two types of agent access was a mistake made long ago, so being able to add keys with restrictions on how they can be used will help to rectify that. A new -h option has been created for ssh-add to describe the legal uses of a key, as an example from the feature description shows:

These extensions allow the user to add destination constraints to keys they add to a ssh-agent and have ssh enforce them. For example, this command:
    $ ssh-add -h "perseus@cetus.example.org" \
              -h "scylla.example.org" \
    	      -h "scylla.example.org>medea@charybdis.example.org" \
              ~/.ssh/id_ed25519
Adds a key that can only be used for authentication in the following circumstances:

From the origin host to scylla.example.org as any user.
From the origin host to cetus.example.org as user perseus.
Through scylla.example.org to host charybdis.example.org as user medea.
Attempts to use this key to authenticate to other hosts will be refused by the agent because they weren't explicitly listed, as would an attempt to authenticate through scylla.example.org to cetus.example.org because the path was not permitted. Likewise, trying to authenticate as any other user then perseus to cetus.example.org or medea to charybdis.example.org would fail because the destination users are not permitted.

More complicated paths can be specified, but each hop needs to be listed in its own -h option. So a multi-hop path might look something like:

    $ ssh-add -h "HostA" \
              -h "HostA>HostB" \
    	      -h "HostB>HostC" \
              key-file

It should be noted that an agent configuration like the above would not allow the agent's key to be used to go directly from the local system to HostB or HostC, they could only be reached via the appropriate hops. The user may still be able to bypass the agent and fall back to typing in the passphrase for the key when prompted by ssh in order to go directly to HostB or HostC, however.

The new feature requires updates to the client-side tools, but also needs updated SSH servers on the remote systems. The agent protocol needed to change to incorporate the server host key in the authentication requests, so older SSH servers will not be able to participate in the new scheme. The feature will "fail safe" if ssh-add or the SSH server do not support the agent restrictions; ssh-add will fail if it does not understand the destination constraints and the agent will decline authentication requests that are not sent with the host key.

There are some caveats. The biggest is that attackers can still hijack the agent connection so they can request authentication to hosts (and users) that have been authorized, but from different hosts than expected:

Less obviously, they will also be able to forward use of the agent to other hosts, e.g. by using an SSH implementation that doesn’t cooperate with ssh-agent, or another tool entirely, such as socat. Note that the attacker isn’t gaining any new access to keys here, they are still forced to act via the compromised host and their access is still restricted to the keys that were permitted for use though the intended host only.
[...] Because of these subtleties, it’s better to think of key constraints as permitting use of a key through a given host rather than as from a particular host, and, more generally, that any forwarding path is only as strong as its weakest link. Another helpful way to think about key constraints is that each one represents a delegation of a key to a host, that is only slightly more trustworthy than the delegate is.

Overall, this seems like a welcome addition to the SSH toolbox; the restrictions provided will be useful. It is nice to know that agent forwarding will no longer provide carte blanche access to any host where the key will work. The document describing the feature is admirably detailed, with a look at the implementation details, plans for the future, and more. Interested readers are encouraged to take a look.

Comments (19 posted)

Another Fedora integrity-management proposal

By Jake Edge
January 4, 2022

File-integrity management for the Fedora distribution has been the overarching theme of a number of different feature proposals over the last year or so. In general, they have been met with skepticism, particularly with regard to how well the features mesh with Fedora's goals, but also in how they will change the process of building RPM packages. A new proposal that would allow systems to (optionally) perform remote attestation is likewise encountering headwinds; there are several different concerns being raised in the discussion of it.

Background

As is usual for feature proposals, Fedora program manager Ben Cotton posted it to the Fedora devel mailing list on behalf of the feature owner: Roberto Sassu. The change proposal is also on the Fedora wiki. The new feature would use the Digest Lists Integrity Module (DIGLIM) feature, which has been proposed by Sassu as an addition to the kernel's Integrity Measurement Architecture (IMA). Ensuring that file contents and metadata do not change in unexpected ways is IMA's job; DIGLIM is an optimization of sorts to IMA.

IMA has a number of different functions, but at its core it maintains "digests" of file contents and metadata; these digests are cryptographic hashes that can be used to reliably detect file changes. IMA can also use the digests, in combination with the system's Trusted Platform Module (TPM), to calculate a value that proves that the system is running a known set of software. That value can be used to ensure the system has been securely booted or it can be sent elsewhere to remotely attest to the state of the system.

Each file being protected by IMA needs its digest stored with the file, which is normally done using extended attributes in the filesystem. IMA can be configured to check each file before it is accessed to see if its digest still matches the stored value; if not, access can be denied. As files are assessed, their digest can be submitted to the TPM to extend a Platform Configuration Register (PCR); the resulting value is a reflection of the files measured, but it is also affected by the order of the accesses.

According to the DIGLIM proposals (for Fedora and the patch set for the kernel), parallel execution during the assessment results in differing values from the TPM; even if the same code is used, it may result in a different attestation value. DIGLIM provides a mechanism to take a digest value of all of the files installed, instead, and use that for calculating the attestation value. Only files that have digests that were not included in the overall "installation digest" would be used to further extend the PCR in the TPM.

It does so by providing a mechanism to enroll digest values from the installed files into a kernel "digest list", which can then be consulted as files are accessed. If the digest of a file appears on the list, it can be considered to be unchanged and its digest value does not get submitted to the TPM; otherwise, the file has been modified or was not included in the digest list at all, so access could be denied and the file's digest added into the attestation value. The latter would likely mean that the system fails its attestation.

One might expect that adding digests for all of the files installed on a system would be expensive in terms of memory, but it turns out to have fairly modest impact. "A repository of executables and shared libraries digests, of a Fedora 33 minimal installation, occupies less than 800k of memory."

Instead of calculating the digest values for each file, DIGLIM for Fedora would use the values that are already calculated and stored in the RPM header for the packages that get installed. Those values are signed with the GPG key for the Fedora release, so they can be trusted. The kernel needs to be able to verify GPG signatures, which is part of the DIGLIM kernel feature; the Fedora release GPG public key could then be added to the kernel keyring and used for verification. The RPM headers would be processed early in the boot sequence: "A user space parser, executed by the kernel during early boot, parses RPM headers found in /etc/diglim in the initial ram disk (included with a custom dracut script) and uploads them to the kernel."

In part, the idea behind DIGLIM is to provide a way for systems to handle integrity management without the distributions having to be intimately involved in the process. For systems where remote attestation is wanted, there will be ways to achieve it without Fedora (or any other distribution) having to directly manage the process. DIGLIM would also allow the detection of tampering with the installed files, as with IMA, but it can do so with better performance than the standard IMA mechanism, according to the kernel patch description.

There is a fairly long list of kernel prerequisites that need to be maintained "and possibly have them accepted in the upstream kernel". In fact, as the discussion thread made clear, Fedora is unlikely to adopt the feature unless those patches do make their way upstream. There is some additional work that needs done for Fedora so that the RPM headers get processed properly for those systems that wish to enable the feature, which is the other piece of the proposal.

Reaction

The optional nature of the feature, and the generally fairly narrow use cases for it, mean that it will have limited impact for the vast majority of Fedora users. It also means that it will have limited—or no—utility for those users. Some of those with concerns, or objections, seemed to misunderstand that it would only be enabled by user choice, not distribution fiat. For example, Kevin Kofler said: "[...] I do not see how the 'feature' implemented by this Change provides any value at all that does not contradict the very definition of Free Software." He is concerned about being unable to install software built by a third party or by the user themselves.

But, as Mattia Verga put it, the intent is quite different: "It doesn't deny a user to install any software they want, it is about preventing unwanted/unsolicited/malevolent software from being installed without user (admin) approval." Kofler is worried about a slide toward iOS-style centralization of the control of software sources, however, which would obviously run counter to free-software ideals. But few seem to see this effort as headed down that path. Michel Alexandre Salim envisioned something more in keeping with Fedora's goals:

If/when something like this gets shipped, I hope Fedora limits itself to shipping a policy that is the equivalent of SELinux's 'targeted' policy: protect the RPMs that Fedora ships from being tampered with, let users do whatever on top. With an option to turn it off completely or to enforce more strictly.

In fact, Sassu said his work on DIGLIM provides a good example of how third-party code might be handled:

I'm using myself a COPR [Cool Other Package Repo] repository to build the kernel package with the DIGLIM patches. That kernel also includes my own GPG key generated for me by COPR.
This has not been decided yet, but likely the Fedora kernel will contain the official Fedora keys, and the user will decide to add new keys (including those from COPR).

In a longer message, Sassu sought to alleviate some of the concerns and fears about the feature: "[...] its primary goal is to aid the users to satisfy their security needs, and let them decide how this will be done". He noted that the feature is already in production use in the Fedora-based openEuler distribution, but he also recognized that getting the feature upstream was important. Neal Gompa agreed that Fedora generally "does not include non-upstream functionality in its Linux kernel builds", but noted that DIGLIM could be useful:

I also agree that this feature is unlikely to affect people, as this feature will not be enabled by default. It would be extremely useful for people building Fedora-based appliances which need tamper protection for various reasons. And Fedora derivatives (like RHEL/CentOS, Amazon Linux, openEuler, etc.) can benefit from us having the functionality integrated even if we don't enable it by default.

DRM

It is clearly true that the TPM and remote attestation can be used for various digital rights management (DRM)—digital restrictions management, to some—schemes. Nico Kadel-Garcia is convinced that the feature "has no use but digital rights management". But Sassu pointed out that there are other uses:

If you want to enforce an IMA measurement policy instead, access to the files will be always granted, regardless of whether the digest lists are signed or not. IMA, in this case, will simply record the execution of unknown files, in addition to the digest lists you generated.
The IMA measurement list remains in your system, unless you decide that your system should be remotely attested by a remote verifier.

Kadel-Garcia was adamant that it is all about DRM, but Gompa tried to clarify further:

The difference between IMA/verity and DRM is that the former is under the system owner's control (in this case, *you*), and the latter is *not*.
[...] There is a ton of value in user-controlled versions of this capability. And again, none of this is on by default, it's up to *you* to turn it on.

Implementation questions

Zbigniew Jędrzejewski-Szmek had some questions about the feature and its implementation. In particular, the need for a user-space parser that is run by the kernel was questioned, but Sassu said that the helper needs to be run before the init process, so that init (and all of its dependencies) can be integrity checked using the digest list. It is something of a chicken-and-egg problem, which also led Sassu to statically link the helper program; only the digest of that one file needs to be added before it can run to add all of the rest of the digests.

Lennart Poettering thought that the digest information should simply be extracted into a special initrd that can be passed to the kernel. That would avoid having to "upload" the digests to the kernel, but, as Sassu pointed out, that would tie the kernel to a particular digest-list format. The initial version of the feature more or less worked that way, Sassu said; the parser was in the kernel and it read the information from the RPM headers, but every time a new format (e.g. a Debian deb file) needs to be supported, the kernel would have to change.

Poettering was also unhappy with the statically linked helper ("Static linking is a mess."), but Sassu explained that it makes it simpler to bootstrap the integrity checking:

I consider this approach self-contained: everything is needed to bootstrap DIGLIM is contained in the kernel-tools-diglim package. With dynamic linking, you also have to take care of all shared libraries. Since the parser is not yet functional (the kernel is in enforcing mode), you need to generate a digest list in the native format (in the spec file) for every shared library you want to load.
I liked the fact that, once you have the modified kernel with the appropriate GPG keys, and kernel-tools-diglim, you are able to run IMA appraisal without additional effort for its management.

The relationship between the DIGLIM feature and the recently discussed fs-verity change was raised in the January 3 meeting of the Fedora Engineering Steering Committee (FESCo). The main difference, Sassu said, is that DIGLIM works with the existing RPM packages and format, while the fs-verity support would need to add additional per-file information in the RPM header:

DIGLIM adopts the current scheme of RPMs, and verifies with one signature all the files contained in the RPM. Since this data format is not suitable for use by the Linux kernel, for enforcing the integrity policy, DIGLIM extracts the digests and adds them in a hash table stored in kernel memory. Enforcement (it would be better to say security decision) is achieved by doing a lookup in the hash table.
The main advantage is that DIGLIM can achieve its objective, providing reference values, without any change to existing RPMs.

RPM developer Panu Matilainen was happy to see that approach:

Besides not bloating up RPMs with seriously expensive per-file data, this side-steps the other issues associated with both IMA and fs-verity: both require separate signing steps for the file signatures which is a non-trivial cost and complexity, and unlike those the file hashes are covered (and thus protected) by normal rpm-level signatures too.

It seems clear that the DIGLIM feature will not be adopted by Fedora until and unless DIGLIM itself gets merged into the mainline kernel—and maybe not even then. The concerns about locked-down systems and DRM are reasonable to a certain extent, but that is not at all what DIGLIM is targeting. On the other hand, though, it seems like a niche feature, at best; even though it will have a negligible impact for most Fedora users, that is a bit of a double-edged sword. It will not impact most of them because it will not help them or their use cases either.

But there are use cases that will benefit, and the other contenders for integrity management in Fedora seem even more complicated. DIGLIM and fs-verity are both on FESCo's radar (here and here), so we should know the outcome soon. Given the intrusiveness of the fs-verity scheme, and the unclear status of DIGLIM in the mainline, one might guess that both will be pushed off until Fedora 37—at least.

Comments (13 posted)

User-managed concurrency groups

By Jonathan Corbet
December 28, 2021

The kernel's thread model is relatively straightforward and performs reasonably well, but that's not enough for all users. Specifically, there are use cases out there that benefit from a lightweight threading model that gives user space control over scheduling decisions. Back in May 2021, Peter Oskolkov posted a patch set implementing an abstraction known as user-managed concurrency groups, or UMCG. Several revisions later, many observers still lack a clear idea of what this patch is supposed to do, much less whether it is a good idea for the kernel. Things have taken a turn, though, with Peter Zijlstra's reimplementation of UMCG.

One developer reimplementing another's patch set is likely to raise eyebrows. Zijlstra's motivation for doing that work can perhaps be seen in this message, where he notes that the UMCG code looked little like the rest of the scheduler code. He also remarked that it required "reverse engineering" to figure out how UMCG was meant to be used. By the time that work was done, perhaps, it was just easier to recast the code in the form he thought it should take.

In truth, the documentation for UMCG is no better than before — a significant problem for a major proposed addition to the system-call API. But it is possible to dig through the code (and a "pretty rough" test application posted by Zijlstra) to get a sense for what is going on. In short, UMCG calls for a multi-threaded application to divide itself into "server" and "worker" threads, where there is likely to be one server thread for each CPU on the system. Server threads make scheduling decisions, while workers run according to those decisions and get the actual work done. The advantage of a system like UMCG is that scheduling can happen quickly and with little overhead from the kernel — assuming the server threads are properly implemented, of course.

Setting up

UMCG introduces three new system calls and one new structure that handles most of the communication with the kernel. Every thread participating in UMCG must have a umcg_task structure, which looks like this:

    struct umcg_task {
	__u32	state;
	__u32	next_tid;
	__u32	server_tid;
	__u64	runnable_workers_ptr;
	/* ... */
    };

Some fields have been omitted here. Note that this structure, as it will eventually be provided by the C libraries, is likely to look different. The specific fields will be discussed as they become relevant.

The first new system call is umcg_ctl(), which is used to register and unregister threads with the UMCG subsystem:

    int umcg_ctl(unsigned int flags, struct umcg_task *self, clockid_t which_clock);

The flags argument describes the operation to be performed, self is the umcg_task structure corresponding to the current thread, and which_clock controls the clock used for timestamps for this thread.

If flags contains UMCG_CTL_REGISTER, then this call is registering a new thread with the subsystem. There are two alternatives, depending on which type of thread is being registered:

If flags contains UMCG_CTL_WORKER, then this is a new worker task. In this case, self->state must be UMCG_TASK_BLOCKED, indicating that the worker is not initially running. The thread ID of the server that will handle this worker must be provided in server_tid.
Otherwise, this is a server task. Its initial state must be UMCG_TASK_RUNNING (since it is indeed running) and server_tid must be the calling thread's ID.

Workers and servers must be threads of the same process (more specifically, they must share the same address space). The system call returns zero if all goes well. For workers, though, that return will be delayed, as the calling thread will be blocked until the server schedules it to run. Registering a new worker will cause the indicated server to wake up.

The other thing that happens when a worker is registered is that its state is set to UMCG_TASK_RUNNABLE and it is added to the server's singly-linked list of available workers. The list is implemented using the runnable_workers_ptr field in each task's umcg_task structure. The kernel will push a new task onto the head of the list with a compare-and-exchange operation; the server will normally use a similar operation to take tasks off the list.

Scheduling

Most scheduling is done with calls to umcg_wait():

    int umcg_wait(unsigned int flags, unsigned long timeout);

The flags field must be zero in the current patches. The calling thread must be registered as a UMCG thread or the call will fail. If the caller is a worker thread, the timeout must also be zero; this call will suspend execution of the worker and wake the associated server process for the next scheduling decision. If the worker's state is UMCG_TASK_RUNNING (as it should be if the task is running to make this call), it will be set back to UMCG_TASK_RUNNABLE and the task will be added to the server's runnable_workers_ptr list. Thus, for a worker task, a call to umcg_wait() is a way to yield to another thread while remaining runnable.

In the case of the server, the usual reason for calling umcg_wait() is to schedule a new worker to run; this is done by setting the worker's thread ID in the next_tid field of the server's umcg_task structure before the call. If this is done, and the indicated thread is a UMCG worker in the UMCG_TASK_RUNNABLE state, it will be queued to run. The server, instead, will be blocked until either some sort of wakeup event happens or the specified timeout (if it is not zero) expires.

One important detail is that the kernel, once it successfully wakes the new worker thread, will set the server's next_tid field to zero. That allows the server to quickly check, on return from umcg_wait(), whether the thread was actually scheduled or not.

There are a few events that will cause a server to wake. If the current running worker blocks in a system call, for example, its state will be changed to UMCG_TASK_BLOCKED; the server can detect this by looking at the (previously) running worker's umcg_task structure. As noted above, a new task becoming runnable will cause a wakeup. If your editor's reading of the code is correct, there does not currently appear to be a way to notify the server that a worker task has exited entirely.

Preemption

The timeout parameter to umcg_wait() can be used by server threads to implement forced preemption after a worker has run for a period of time. If umcg_wait() returns ETIMEDOUT, the server knows that the current worker has been running for the full timeout period; the server may then choose to make it surrender the CPU. That is done in a two-step process, the first of which is to add the UMCG_TF_PREEMPT flag to the running worker's state field (again, using a compare-and-exchange operation). Then a call should be made to the third new system call:

    int umcg_kick(unsigned int flags, pid_t tid);

Where flags must be zero and tid is the thread ID of the worker to be preempted. This call will cause the worker to re-enter the scheduler, at which point the UMCG_TF_PREEMPT flag will be noticed, the worker will be suspended, and it will be placed back onto the server's runnable_workers_ptr list. Once that completes, the server will wake again to schedule a new thread.

That is pretty much the entirety of the new API at this point. This work is still clearly in an early state, though, and it would not be surprising to see a fair amount of evolution take place before it is considered for merging. UMCG arises out of Google's internal systems and reflects its use case, but there will almost certainly be other use cases for this sort of functionality, and those users have not yet made their needs known. As awareness of this work spreads, that situation can be expected to change.

Oskolkov, meanwhile has, as one might expect, required some convincing that his work really needed to be rewritten by somebody else or that the new implementation is better. He expressed discomfort with some of the changes, most notably Zijlstra's switch from a single queue of runnable workers to per-server queues. In the end, though, he said "I'm OK with having it your way if all needed features are covered". So it seems fair to assume that Zijlstra's patch reflects the future of this work. Time will tell where it goes from here.

Comments (30 posted)

Zero-copy network transmission with io_uring

By Jonathan Corbet
December 30, 2021

When the goal is to push bits over the network as fast as the hardware can go, any overhead hurts. The cost of copying data to be transmitted from user space into the kernel can be especially painful; it adds latency, takes valuable CPU time, and can be hard on cache performance. So it is unsurprising that the developers working with io_uring, which is all about performance, have turned their attention to zero-copy network transmission. This patch set from Pavel Begunkov, now in its second revision, looks to be significantly faster than the MSG_ZEROCOPY option supported by current kernels.

As a reminder: io_uring is a relatively new API for asynchronous I/O (and related operations); it was first merged less than three years ago. User space sets up a pair of circular buffers shared with the kernel; the first buffer is used to submit operations to the kernel, while the second receives the results when operations complete. A suitably busy process that keeps the submission ring full can perform an indefinite number of operations without needing to make any system calls, which clearly improves performance. Io_uring also implements the concept of "fixed" buffers and files; these are held open, mapped, and ready for I/O within the kernel, saving the setup and teardown overhead that is otherwise incurred by every operation. It all adds up to a significantly faster way for I/O-intensive applications to work.

One thing that io_uring still does not have is zero-copy networking, even though the networking subsystem supports zero-copy operation via the MSG_ZEROCOPY socket option. In theory, adding that support is simply a matter of wiring up the integration between the two subsystems. In practice, naturally, there are a few more details to deal with.

A zero-copy networking implementation must have a way to inform applications when any given operation is truly complete; the application cannot reuse a buffer containing data to be transmitted if the kernel is still working on it. There is a subtle point that is relevant here: the completion of a send() call (for example) does not imply that the associated buffer is no longer in use. The operation "completes" when the data has been accepted into the networking subsystem for transmission; the higher layers may well be done with it, but the buffer itself may still be sitting in a network interface's transmission queue. A zero-copy operation is only truly done with its data buffers when the hardware has done its work — and, for many protocols, when the remote peer has acknowledged receipt of the data. That can happen long after the operation that initiated the transfer has completed.

So there needs to be a mechanism by which the kernel can tell applications that a given buffer can be reused. MSG_ZEROCOPY handles this by returning notifications via the error queue associated with the socket — a bit awkward, but it works. Io_uring, instead, already has a completion-notification mechanism in place, so the "really complete" notifications fit in naturally. But there are still a few complications resulting from the need to accurately tell an application which buffers can be reused.

An application doing zero-copy networking with io_uring will start by registering at least one completion context, using the IORING_REGISTER_TX_CTX registration operation. The context itself is a simple structure:

    struct io_uring_tx_ctx_register {
	__u64 tag;
    };

The tag is a caller-chosen value used to identify this particular context in future zero-copy operations on the associated ring. There can be a maximum of 1024 contexts associated with the ring; user space should register them all with a single IORING_REGISTER_TX_CTX operation, passing the structures as an array. An attempt to register a second set of contexts will fail unless an intervening IORING_UNREGISTER_TX_CTX operation has been done to remove the first set.

Zero-copy writes are initiated with the new IORING_OP_SENDZC operation. As usual, a set of buffers is passed to be written out to the socket (which must also be provided, obviously). Additionally, each zero-copy write must have a context associated with it, stored in the submission queue entry's user_data field. The context is specified as an index into the array of contexts that was registered previously (not as the tag associated with the context). These writes will use the kernel's zero-copy mechanism when possible and will "complete" in the usual way, with the usual result in the completion ring, perhaps while the supplied buffers are still in use.

To know that the kernel is done with the buffers, the application must wait for the second notification informing it of that fact. Those notifications are not (by default) sent for every zero-copy operation that is submitted; instead, they are batched into "generations". Each completion context has a sequence number that starts at zero. Multiple operations can be associated with each generation; the notification for that generation is sent once all of the associated operations have truly completed.

It is up to user space to tell the kernel when to move on to a new generation; that is done by setting the IORING_SENDZC_FLUSH flag in a zero-copy write request. The flag itself lives in the ioprio field of the submission queue entry. The presence of this flag indicates that the request being submitted is the last of the current generation; the next request will begin the new generation. Thus, if a separate done-with-the-buffers notification is needed for each write request, IORING_SENDZC_FLUSH should be set on every request.

When a given generation completes, the notification will show up in the completion ring. The user_data field will contain the context tag, while the res field will hold the generation number. Once the notification arrives, the application will be able to safely reuse the buffers associated with that generation.

The end result seems to be quite good; benchmarks included in the cover letter suggest that io_uring's zero-copy operations can perform more than 200% better than MSG_ZEROCOPY. Much of that improvement likely comes from the ability to use fixed buffers and files with io_uring, cutting out much of the per-operation overhead. Most applications won't see that kind of improvement, of course; they are not so heavily dominated by the cost of network transmission. If your business is providing the world with cat videos, though, zero-copy networking with io_uring is likely to be appealing.

For now, the new zero-copy operations are meticulously undocumented. Begunkov has posted a test application that can be read to see how the new interface is meant to be used. There have not been many comments on this version (the second) of this series. Perhaps that will change after the holidays, but it seems likely that this work is getting close to ready for inclusion.

Comments (48 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

Briefs: Fast kernel headers; Gentoo 2021; Darktable 3.8; GIMP 2021; GnuPG; Krita 5.0; systemd 250; Quotes; ...
Announcements: Newsletters, conferences, security updates, patches, and more.

Next page: Brief items>>