Kernel topics on the radar
Memory folios
The memory folios work was covered here in March; this patch set by Matthew Wilcox adds the concept of a "folio" as a page that is guaranteed not to be a tail page within a compound page. By guaranteeing that a folio is either a singleton page or the head of a compound page, this work enables the creation of an API that adds some useful structure to memory management, saves some memory, and slightly improves performance for some workloads.
While the memory-management community is still not fully sold on this concept (it looks like a lot of change for a small benefit to some developers), it looks increasingly likely that it will be merged in the near future. Or, at least, the merging process will start; one does not swallow a 138-part (at last count) memory-management patch series in a single step. In mid-July, Wilcox presented his plan, which involves getting the first 89 patches merged for 5.15; the rest of the series would be merged during the following two development cycles. Nobody seems to be contesting that schedule at this point.
Later in July, though, Wilcox stumbled
across the inevitable Phoronix
benchmarking article which purported to show an 80% performance
improvement for PostgreSQL with the folio patches applied to the kernel.
He said that the result was "plausibly real
" and suggested
that, perhaps, the merging of folios should be accelerated. Other
developers responded more skeptically, though. PostgreSQL developer Andres
Freund looked at how the results were generated and concluded
that the test "doesn't end up measuring something particularly
interesting
". His own test showed a 7% improvement, though, which
is (as he noted) still a nice improvement.
The end result is that the case for folios seems to be getting stronger, and the merging process still appears to be set to begin in 5.15.
Retrying task isolation
Last year, the development community discussed a task-isolation mode that would allow latency-sensitive applications to run on a CPU with no interruptions from the kernel. That work never ended up being merged, but the interest in this mode clearly still exists, as can be seen in this patch set from Marcelo Tosatti. It takes a simpler approach to the problem — initially, at least.
This patch is focused, in particular, on kernel interruptions that can happen even when a CPU is running in the "nohz" mode without a clock tick. Specifically, he is looking at the "vmstat" code that performs housekeeping for the memory-management subsystem. Some of this work is done in a separate thread (via a workqueue) that is normally disabled while a CPU is running in the nohz mode. There are situations, though, that can cause this thread to be rescheduled on a nohz CPU, ending the application's exclusive use of that processor.
Tosatti's patch set adds a set of new prctl() commands to address this problem. The PR_ISOL_SET command sets the "isolation parameters", which can be either PR_ISOL_MODE_NONE or PR_ISOL_MODE_NORMAL; the latter asks the kernel to eliminate interruptions. Those parameters do not take effect, though, until the task actually enters the isolation mode, which can be done with the PR_ISOL_ENTER command. The kernel's response to entering the isolation mode will be to perform any deferred vmstat work immediately so that the kernel will not decide to do it at an inconvenient time later. The deferred-work cleanup will happen at the end of any system call made while isolation mode is active; since those system calls are the likely source of any deferred work in the first place, that should keep the decks clear while the application code is running.
The evident intent is to make this facility more general, guaranteeing that any deferred work would be executed right away. That led others (including Nicolás Sáenz) to question the use of a single mode to control what will eventually be a number of different kernel operations. Splitting out the various behaviors would, he said, be a way to move any policy decisions to user space. After some back-and-forth, Tosatti agreed to a modified interface that would give user space explicit control over each potential isolation feature. A patch set implementing this API was posted on July 30; it adds a new operation (PR_ISOL_FEAT) to query the set of actions that can be quiesced while the isolation mode is active.
Bonus fact: newer members of our community may not be aware that, 20 years ago, Tosatti was known as Marcelo the Wonder Penguin.
User-managed concurrency groups
In May of this year, Peter Oskolkov posted a patch set for a mechanism called "user-managed concurrency groups", or UMCG. This work is evidently a version of a scheduling framework known as "Google Fibers", which is naturally one of the most ungoogleable terms imaginable. This patch set has suffered from a desultory attempt to explain what it is actually supposed to implement, but the basic picture is becoming more clear over time.
UMCG is meant to be a lightweight, user-space-controlled, M:N threading mechanism; this document, posted after some prodding, describes its core concepts. A user-space process can set up one or more concurrency groups to manage its work. Within each group, there will be one or more "server" threads; the plan seems to be that applications would set up one server thread for each available CPU. There will also be any number of "worker" threads that carry out the jobs that the application needs done. At any given time, each server thread can be running one worker. User space will control which worker threads are running at any time by attaching them to servers; notifications for events like workers blocking on I/O allow the servers to be kept busy.
In the August 1 version of the patch set, there are two system calls defined to manage this mechanism. A call to umcg_ctl() will register a thread as an UMCG task, in either the server or the worker mode; it can also perform unregistration. umcg_wait() is the main scheduling mechanism; a worker thread can use it to pause execution, for example. But a server thread can also use umcg_wait() to wake a specific worker thread or to force a context switch from one worker thread to another; the call will normally block for as long as the worker continues to run. Once umcg_wait() returns, the server thread can select a new worker to execute next.
Or so it seems; there is little documentation for how these system calls
are really meant to be used and no sample code at all. The most recent
version of the series did, finally, include a
description of the system calls, something that had been entirely
absent in previous versions. Perhaps as a
result, this work has seen relatively little review activity so far.
Oskolkov seems to be focused on how the in-kernel functionality is
implemented, but reviewers are going to want to take a long and hard look at the
user-space API, which would have to be supported indefinitely if this
subsystem were to be merged. UMCG looks like interesting and potentially
useful work, but this kind of core-kernel change is hard to merge in the
best of conditions; the absence of information on what is being proposed
has made that process harder so far.
| Index entries for this article | |
|---|---|
| Kernel | Memory management/Folios |
| Kernel | Scheduler |
| Kernel | User-managed concurrency groups |
Posted Aug 2, 2021 18:19 UTC (Mon)
by knurd (subscriber, #113424)
[Link] (4 responses)
Posted Aug 2, 2021 19:36 UTC (Mon)
by mliuzzi (subscriber, #117685)
[Link] (2 responses)
Posted Aug 2, 2021 19:41 UTC (Mon)
by linuxjacques (subscriber, #45768)
[Link] (1 responses)
Posted Aug 3, 2021 9:46 UTC (Tue)
by or (guest, #128060)
[Link]
Posted Aug 4, 2021 18:32 UTC (Wed)
by Nikratio (subscriber, #71966)
[Link]
Posted Aug 2, 2021 23:12 UTC (Mon)
by itsmycpu (guest, #139639)
[Link]
A +1 also for additional general and direct support for (more) complete task isolation.
Posted Aug 3, 2021 5:40 UTC (Tue)
by itsmycpu (guest, #139639)
[Link]
assert( (prctl(PR_ISOL_FEAT, 0, 0, 0, 0) & ISOL_F_QUIESCE) != 0);
Correct? (Assuming I'm not afraid of regressions..what regressions?)
Posted Aug 3, 2021 7:13 UTC (Tue)
by unixbhaskar (guest, #44758)
[Link] (3 responses)
Posted Aug 3, 2021 11:51 UTC (Tue)
by cwhitecrowdstrike (guest, #153291)
[Link]
Posted Aug 3, 2021 21:39 UTC (Tue)
by eacb (guest, #134663)
[Link]
Posted Aug 9, 2021 15:48 UTC (Mon)
by ClumsyApe (guest, #134752)
[Link]
Posted Aug 3, 2021 13:32 UTC (Tue)
by hak8or (subscriber, #120189)
[Link] (1 responses)
While the Google Fibers patch seems interesting, I have serious qualms on anything from Google being subject to the kernels requirement of never breaking userspace, and therefore requiring that API be supported forever. Google is known to throw stuff at the wall and quickly abandon it if it falls out of grace, which is an extremely incompatible perspective on what the Linux kernel does. And to submit it without any example code is just, well, to me seems like a bad faith submission and makes me feel it should be looked with even more scrutiy than other such feature additions (in similar scope).
Posted Aug 4, 2021 8:00 UTC (Wed)
by khim (subscriber, #9252)
[Link]
The whole server infrastructure of Google is built around fibers nowadays thus one can be reasonably sure they are not going anywhere. This being said the chances are high that what was actually presented is what Kubernetes was in relation to Borg: reimplementation of the same, tried and tested, idea but with an entirely different code and API. Thus obviously API discussion is very much needed.
Posted Aug 3, 2021 16:52 UTC (Tue)
by jakogut (guest, #137318)
[Link]
Posted Aug 3, 2021 17:34 UTC (Tue)
by null_byte (guest, #153086)
[Link]
Posted Aug 4, 2021 19:12 UTC (Wed)
by Wol (subscriber, #4433)
[Link] (2 responses)
Cheers,
Posted Aug 4, 2021 21:12 UTC (Wed)
by corbet (editor, #1)
[Link] (1 responses)
Posted Aug 5, 2021 19:46 UTC (Thu)
by Wol (subscriber, #4433)
[Link]
Cheers,
+1 to making this "a semi-regular LWN feature"
+1 to making this "a semi-regular LWN feature"
+1 to making this "a semi-regular LWN feature"
+1 to making this "a semi-regular LWN feature"
+1 to making this "a semi-regular LWN feature"
Kernel topics on the radar
Kernel topics on the radar
int max_silence = prctl(PR_ISOL_FEAT, ISOL_F_QUIESCE, 0, 0, 0);
assert( prctl(PR_ISOL_SET, ISOL_F_QUIESCE, max_silence, 0, 0) >= 0);
assert( prctl(PR_ISOL_CTRL_SET, ISOL_F_QUIESCE, 0, 0, 0) >= 0);
Kernel topics on the radar
Kernel topics on the radar
Kernel topics on the radar
Kernel topics on the radar
Kernel topics on the radar
Kernel topics on the radar
Kernel topics on the radar
Kernel topics on the radar
Bring back the old kernel page ...
Wol
We can consider that, but ... everything that was on the kernel page before is still to be found in the weekly edition; from the kernel point of view no content has been lost at all. So bringing back that page wouldn't necessarily help much.
Bring back the old kernel page ...
Bring back the old kernel page ...
Wol
