Kernel topics on the radar

By Jonathan Corbet
August 2, 2021

The kernel-development community is a busy place, with thousands of emails flying by every day and many different projects under development at any given time. Much of that work ends up inspiring articles at LWN, but there is no way to ever cover all of it, or even all of the most interesting parts. What follows is a first attempt at what may become a semi-regular LWN feature: a quick look at some of the work that your editor is tracking that may or may not show up as the topic of a full article in the future. The first set of topics includes memory folios, task isolation, and a lightweight threading framework from Google.

Memory folios

The memory folios work was covered here in March; this patch set by Matthew Wilcox adds the concept of a "folio" as a page that is guaranteed not to be a tail page within a compound page. By guaranteeing that a folio is either a singleton page or the head of a compound page, this work enables the creation of an API that adds some useful structure to memory management, saves some memory, and slightly improves performance for some workloads.

While the memory-management community is still not fully sold on this concept (it looks like a lot of change for a small benefit to some developers), it looks increasingly likely that it will be merged in the near future. Or, at least, the merging process will start; one does not swallow a 138-part (at last count) memory-management patch series in a single step. In mid-July, Wilcox presented his plan, which involves getting the first 89 patches merged for 5.15; the rest of the series would be merged during the following two development cycles. Nobody seems to be contesting that schedule at this point.

Later in July, though, Wilcox stumbled across the inevitable Phoronix benchmarking article which purported to show an 80% performance improvement for PostgreSQL with the folio patches applied to the kernel. He said that the result was "plausibly real" and suggested that, perhaps, the merging of folios should be accelerated. Other developers responded more skeptically, though. PostgreSQL developer Andres Freund looked at how the results were generated and concluded that the test "doesn't end up measuring something particularly interesting". His own test showed a 7% improvement, though, which is (as he noted) still a nice improvement.

The end result is that the case for folios seems to be getting stronger, and the merging process still appears to be set to begin in 5.15.

Retrying task isolation

Last year, the development community discussed a task-isolation mode that would allow latency-sensitive applications to run on a CPU with no interruptions from the kernel. That work never ended up being merged, but the interest in this mode clearly still exists, as can be seen in this patch set from Marcelo Tosatti. It takes a simpler approach to the problem — initially, at least.

This patch is focused, in particular, on kernel interruptions that can happen even when a CPU is running in the "nohz" mode without a clock tick. Specifically, he is looking at the "vmstat" code that performs housekeeping for the memory-management subsystem. Some of this work is done in a separate thread (via a workqueue) that is normally disabled while a CPU is running in the nohz mode. There are situations, though, that can cause this thread to be rescheduled on a nohz CPU, ending the application's exclusive use of that processor.

Tosatti's patch set adds a set of new prctl() commands to address this problem. The PR_ISOL_SET command sets the "isolation parameters", which can be either PR_ISOL_MODE_NONE or PR_ISOL_MODE_NORMAL; the latter asks the kernel to eliminate interruptions. Those parameters do not take effect, though, until the task actually enters the isolation mode, which can be done with the PR_ISOL_ENTER command. The kernel's response to entering the isolation mode will be to perform any deferred vmstat work immediately so that the kernel will not decide to do it at an inconvenient time later. The deferred-work cleanup will happen at the end of any system call made while isolation mode is active; since those system calls are the likely source of any deferred work in the first place, that should keep the decks clear while the application code is running.

The evident intent is to make this facility more general, guaranteeing that any deferred work would be executed right away. That led others (including Nicolás Sáenz) to question the use of a single mode to control what will eventually be a number of different kernel operations. Splitting out the various behaviors would, he said, be a way to move any policy decisions to user space. After some back-and-forth, Tosatti agreed to a modified interface that would give user space explicit control over each potential isolation feature. A patch set implementing this API was posted on July 30; it adds a new operation (PR_ISOL_FEAT) to query the set of actions that can be quiesced while the isolation mode is active.

Bonus fact: newer members of our community may not be aware that, 20 years ago, Tosatti was known as Marcelo the Wonder Penguin.

User-managed concurrency groups

In May of this year, Peter Oskolkov posted a patch set for a mechanism called "user-managed concurrency groups", or UMCG. This work is evidently a version of a scheduling framework known as "Google Fibers", which is naturally one of the most ungoogleable terms imaginable. This patch set has suffered from a desultory attempt to explain what it is actually supposed to implement, but the basic picture is becoming more clear over time.

UMCG is meant to be a lightweight, user-space-controlled, M:N threading mechanism; this document, posted after some prodding, describes its core concepts. A user-space process can set up one or more concurrency groups to manage its work. Within each group, there will be one or more "server" threads; the plan seems to be that applications would set up one server thread for each available CPU. There will also be any number of "worker" threads that carry out the jobs that the application needs done. At any given time, each server thread can be running one worker. User space will control which worker threads are running at any time by attaching them to servers; notifications for events like workers blocking on I/O allow the servers to be kept busy.

In the August 1 version of the patch set, there are two system calls defined to manage this mechanism. A call to umcg_ctl() will register a thread as an UMCG task, in either the server or the worker mode; it can also perform unregistration. umcg_wait() is the main scheduling mechanism; a worker thread can use it to pause execution, for example. But a server thread can also use umcg_wait() to wake a specific worker thread or to force a context switch from one worker thread to another; the call will normally block for as long as the worker continues to run. Once umcg_wait() returns, the server thread can select a new worker to execute next.

Or so it seems; there is little documentation for how these system calls are really meant to be used and no sample code at all. The most recent version of the series did, finally, include a description of the system calls, something that had been entirely absent in previous versions. Perhaps as a result, this work has seen relatively little review activity so far. Oskolkov seems to be focused on how the in-kernel functionality is implemented, but reviewers are going to want to take a long and hard look at the user-space API, which would have to be supported indefinitely if this subsystem were to be merged. UMCG looks like interesting and potentially useful work, but this kind of core-kernel change is hard to merge in the best of conditions; the absence of information on what is being proposed has made that process harder so far.

Index entries for this article
Kernel	Memory management/Folios
Kernel	Scheduler
Kernel	User-managed concurrency groups

+1 to making this "a semi-regular LWN feature"

Posted Aug 2, 2021 18:19 UTC (Mon) by knurd (subscriber, #113424) [Link] (4 responses)

Yeah, please make this a "a semi-regular LWN feature", otherwise that field is left to websites that don't get even close to the stellar quality reporting here at LWN.net – which is not in the interest of the Linux world, as those websites sometimes inadvertently get things wrong and thus spread misconceptions with their reporting.

+1 to making this "a semi-regular LWN feature"

Posted Aug 2, 2021 19:36 UTC (Mon) by mliuzzi (subscriber, #117685) [Link] (2 responses)

yes please!

+1 to making this "a semi-regular LWN feature"

Posted Aug 2, 2021 19:41 UTC (Mon) by linuxjacques (subscriber, #45768) [Link] (1 responses)

I too like this feature.

+1 to making this "a semi-regular LWN feature"

Posted Aug 3, 2021 9:46 UTC (Tue) by or (guest, #128060) [Link]

Yes, it would be very nice to get an overwiew like this more often or even regularly!

+1 to making this "a semi-regular LWN feature"

Posted Aug 4, 2021 18:32 UTC (Wed) by Nikratio (subscriber, #71966) [Link]

Yes, please do this regularly.

Kernel topics on the radar

Posted Aug 2, 2021 23:12 UTC (Mon) by itsmycpu (guest, #139639) [Link]

Another +1 for making this a (semi-)regular feature.

A +1 also for additional general and direct support for (more) complete task isolation.

Kernel topics on the radar

Posted Aug 3, 2021 5:40 UTC (Tue) by itsmycpu (guest, #139639) [Link]

So for task isolation with version 2 of the patch set, if I want to remove as much OS noise as possible, I would do the following on a nohz CPU:

assert( (prctl(PR_ISOL_FEAT, 0, 0, 0, 0) & ISOL_F_QUIESCE) != 0);
int max_silence = prctl(PR_ISOL_FEAT, ISOL_F_QUIESCE, 0, 0, 0);
assert( prctl(PR_ISOL_SET, ISOL_F_QUIESCE, max_silence, 0, 0) >= 0);
assert( prctl(PR_ISOL_CTRL_SET, ISOL_F_QUIESCE, 0, 0, 0) >= 0);

Correct? (Assuming I'm not afraid of regressions..what regressions?)

Kernel topics on the radar

Posted Aug 3, 2021 7:13 UTC (Tue) by unixbhaskar (guest, #44758) [Link] (3 responses)

Thanks,Jon. This will certainly helpful for the quick view of what is churnning in the oven(aka kernel). I mean, those who do not follow the lkml rigorously can get hugely benefitted by this.

Kernel topics on the radar

Posted Aug 3, 2021 11:51 UTC (Tue) by cwhitecrowdstrike (guest, #153291) [Link]

Agreed! +1 for this as a regular feature.

Kernel topics on the radar

Posted Aug 3, 2021 21:39 UTC (Tue) by eacb (guest, #134663) [Link]

I definitely join this whishers list, and note that would beautifuly line up with lwn's moto.

Kernel topics on the radar

Posted Aug 9, 2021 15:48 UTC (Mon) by ClumsyApe (guest, #134752) [Link]

Sticking my head out to give another shiny +1 :-) A kernel newbie like me sees such content as distilled gold !

Kernel topics on the radar

Posted Aug 3, 2021 13:32 UTC (Tue) by hak8or (subscriber, #120189) [Link] (1 responses)

And yet another person chiming in here to say they are a huge fan of a "view of the lands" posts like this. It helps us who work with the kernel but not lurk the mailing lists keep a finger on the pulse of upstream.

While the Google Fibers patch seems interesting, I have serious qualms on anything from Google being subject to the kernels requirement of never breaking userspace, and therefore requiring that API be supported forever. Google is known to throw stuff at the wall and quickly abandon it if it falls out of grace, which is an extremely incompatible perspective on what the Linux kernel does. And to submit it without any example code is just, well, to me seems like a bad faith submission and makes me feel it should be looked with even more scrutiy than other such feature additions (in similar scope).

Kernel topics on the radar

Posted Aug 4, 2021 8:00 UTC (Wed) by khim (subscriber, #9252) [Link]

The whole server infrastructure of Google is built around fibers nowadays thus one can be reasonably sure they are not going anywhere.

This being said the chances are high that what was actually presented is what Kubernetes was in relation to Borg: reimplementation of the same, tried and tested, idea but with an entirely different code and API.

Thus obviously API discussion is very much needed.

Kernel topics on the radar

Posted Aug 3, 2021 16:52 UTC (Tue) by jakogut (guest, #137318) [Link]

I resubscribed just for this post. Please make it a regular thing.

Kernel topics on the radar

Posted Aug 3, 2021 17:34 UTC (Tue) by null_byte (guest, #153086) [Link]

Very glad to see this. It's so hard to keep up with all of the emails, so I especially appreciate the overview.

Bring back the old kernel page ...

Posted Aug 4, 2021 19:12 UTC (Wed) by Wol (subscriber, #4433) [Link] (2 responses)

Yes I know you'll need a page editor and people to write the articles, but it would be nice to see it back.

Cheers,
Wol

Bring back the old kernel page ...

Posted Aug 4, 2021 21:12 UTC (Wed) by corbet (editor, #1) [Link] (1 responses)

We can consider that, but ... everything that was on the kernel page before is still to be found in the weekly edition; from the kernel point of view no content has been lost at all. So bringing back that page wouldn't necessarily help much.

Bring back the old kernel page ...

Posted Aug 5, 2021 19:46 UTC (Thu) by Wol (subscriber, #4433) [Link]

It might - hopefully would - encourage more people to write more articles, though :-)

Cheers,
Wol