Leading items
Welcome to the LWN.net Weekly Edition for June 3, 2021
This edition contains the following feature content:
- Growing pains for Fedora CoreOS: how well does CoreOS fit into the Fedora family?
- Making CPython faster: a new initiative to massively speed up the CPython interpreter.
- printk() indexing: creating a catalog of kernel messages so that monitoring utilities can notice changes.
- eBPF seccomp() filters: will seccomp() finally be able to use extended BPF features?
- Top-tier memory management: schemes for migrating data between different memory types.
This week's edition also includes these inner pages:
- Brief items: Brief news items from throughout the community.
- Announcements: Newsletters, conferences, security updates, patches, and more.
Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.
Growing pains for Fedora CoreOS
When last we looked in on Fedora CoreOS back in December, it was under consideration to become an official Fedora edition. That has not happened, yet at least, but it would seem that the CoreOS "emerging edition" is still undergoing some difficulties trying to fit in with the rest of Fedora. There are differences between the needs of a container operating system and those of more general-purpose distributions, which still need to be worked out if Fedora CoreOS is going to "graduate".
Catching up
In mid-May, Dusty Mabe posted
an announcement that the stable stream of Fedora CoreOS was being
updated to Fedora 34. In it, he noted a few caveats
(e.g. "systemd-resolved is still enabled but not used yet [1]
"),
some recently added features, and some new features that are coming
soon. All pretty normal stuff except that Fedora 34 was released at
the end of April and Mabe's post showed that Fedora CoreOS has not really kept up.
In fact, as Tomasz Torcz pointed
out, the systemd-resolved change was made for Fedora 33, while
an upcoming feature ("Move to cgroup v2 by default [5]
")
was originally made for Fedora 31, which was released in October 2019. That seems to indicate that Fedora
CoreOS is lagging the main distribution, which may cause confusion for
users, he said. "Should Fedora CoreOS use the same
version number while not containing all the changes from main Fedora Linux?
"
But Fedora CoreOS does not have version numbers like those of the editions, Clément Verna said:
I think this is the fundamental difference here, Fedora CoreOS does not have a version number. It has 3 streams, stable, testing and next, these streams are based on a version of Fedora Linux but that should just be a detail that most end users should not have to care about.
In addition, Fedora CoreOS has automatic updates, which need to be
"rock solid
" so that users will trust (and enable) them. But,
up until recently,
Docker has not had support for version 2 of control groups (cgroups),
so a container distribution, which has many users dependent
on Docker, could not roll out that change without major disruption. Verna
suggested that user confusion might actually be "a good thing
" if it
leads them to investigate Fedora CoreOS and to learn more about how it works.
Neal Gompa said
that Verna's response was "a cop-out and a bad answer
". The
problem, he said, is that the Fedora CoreOS (or FCOS as he and others
abbreviate it) working group has historically not participated in the
development of Fedora, and the Changes
process in particular. Instead of adapting to the feature changes made for
Fedora, FCOS generally just rolls them back, "which has frustrated pretty much
everyone
". Beyond that, it is not
just FCOS that needs to have solid upgrades; breaking upgrades for Fedora
are not acceptable either, Gompa said.
But Verna believes that the working group is actually participating in the process. He pointed to four GitHub issues tracking changes for Fedora 32-35 (e.g. for Fedora 32 and for Fedora 35) that were (or need to be) incorporated into FCOS. Vít Ondruch replied that most or all of that work is not visible within the rest of Fedora, though. Verna agreed and suggested that the working group should be more vocal on mailing lists and the like.
Verna was also concerned about changes that are not backward-compatible. Regular Fedora can make those kinds of changes when the major version of the distribution changes, but there is no such opportunity for FCOS:
Breaking or non backward compatible changes are acceptable in Fedora Linux tho between major version bump. Again here the cgroups v2 is a good example, folks using Docker had to perform some manual steps to switch back to cgroups v1 to keep using their workflow working. This is fine when you have a major version bump but this does not happen in FCOS.
One of Verna's questions remained unanswered, though: what should happen if a new Fedora feature conflicts with the needs of another edition (or emerging edition for that matter)?. How are those differing needs going to be resolved?
[...] what happens when a Change proposals breaks FCOS (like cgroups v2 for example) ? Should that just be rejected ? AFAIK not all changes are adopted by every Editions or Spins.
As Fedora evolves and adds more official editions, those kinds of situations are likely to become more frequent. It may be difficult to be on the forefront of new features—part of Fedora's mission is to be "First" with Linux innovations, after all—if some environments and communities are unable to move as quickly. It is something that the Fedora project will need to resolve moving forward.
What's in a name?
Joe Doss disagreed
with Verna's initial reply as well. Since FCOS has the Fedora name in it, "it should have
the same fundamental features and changes that ship with each Fedora
release
". He found Verna's arguments "pretty
dismissive
". Verna was apologetic,
but acknowledged that he has a bias that may not be universally shared:
I am a developer and I don't have a strong interest in the OS, I just expect it to work and provide me the tools needed to do my job. To me that's the beauty of FCOS, I get a solid, tested OS that get automated updates and just works, I honestly don't care to know which version of Fedora Linux it is based on or which features it has. I want to spin-up an instance make sure that my application works and forget about it.I also understand that there are other type of users that will care much more about the base OS than me:-).
It is the inclusion of "Fedora" in the FCOS name that is causing much of
the problem, Ron Olson said.
"I was surprised when I learned Fedora CoreOS didn't support cgroups
v2 and that confused me; it's Fedora, of course it would have the
latest-n-greatest.
" He noted that he had used CoreOS before Red
Hat bought the company and did not have those kinds of expectations in
those days.
Though he recognized the likely futility of the idea, he suggested that a
name change might help:
I'm guessing this is laughably not possible, but I'm going to suggest anyway that maybe it be renamed either back to simply "CoreOS" or something new like "Bowler" or whatever that indicates that it is its own special thing and expectations can be set accordingly.
Verna acknowledged that the Fedora name brought along some expectations, he also noted that FCOS is less than two-years-old at this point, so it is to be expected that there will be some rough spots that need to be worked out:
FCOS has a different release model than Fedora Linux and I think it is fair to give it time to, on one hand continue to improve how features are making their way in FCOS, and on the other hand get people be more familiar with what FCOS is and what expectations to have about it.
The cgroups issue reared its head several times in the discussion, though Colin Walters thought that the issue had been beaten to death long before. In addition, as Mabe noted, FCOS does already support cgroups v2, it is just not the default. Over the next month, that will be changing so that v2 is the default going forward:
We're trying to make sure users have a good experience. Docker users are a big part of that. Changing the default before Docker supported cgroups v2 was really not an option for us at the time.
The proposal to make Fedora CoreOS into an edition was originally targeted for Fedora 34, but that was not to be. The Change entry has been pushed to Fedora 35 and the Fedora Engineering Steering Committee (FESCo) issue tracking the change proposal was closed at the end of February. So far, no change proposal has been submitted for Fedora 35, though there is still plenty of time to do so. This discussion might indicate that it is still a bit too early to make that change, but time will tell.
Making CPython faster
Over the last month or so, there has been a good bit of news surrounding the idea of increasing the performance of the CPython interpreter. At the 2021 Python Language Summit in mid-May, Guido van Rossum announced that he and a small team are being funded by Microsoft to work with the community on getting performance improvements upstream into the interpreter—crucially, without breaking the C API so that the ecosystem of Python extensions (e.g. NumPy) continue to work. Another talk at the summit looked at Cinder, which is a performance-oriented CPython fork that is used in production at Instagram. Cinder was recently released as open-source software, as was another project to speed up CPython that originated at Dropbox: Pyston.
There have been discussions on and development of performance enhancements for CPython going back quite a ways; it is a perennial topic at the yearly language summit, for example. More recently, Mark Shannon proposed a plan that could, he thought, lead to a 5x speedup for the language by increasing its performance by 50% in each of four phases. It was an ambitious proposal, and one that required significant monetary resources, but it seemed to go nowhere after it was raised in October 2020. It now seems clear that there were some discussions and planning going on behind the scenes with regard to Shannon's proposal.
Faster CPython
After an abbreviated retirement, Van Rossum went to work at Microsoft toward the end of 2020 and he got to choose a project to work on there. He decided that project would be to make CPython faster; to that end, he has formed a small team that includes Shannon, Eric Snow, and, possibly, others eventually. The immediate goal is even more ambitious than what Shannon laid out for the first phase in his proposal: double the speed of CPython 3.11 (due October 2022) over that of 3.10 (now feature-frozen and due in October).
The plan, as described in the report on the talk by Joanna Jablonski (and
outlined in Van Rossum's talk
slides) is to work with the community in the open on GitHub. The faster-cpython
repositories are being used to house the code, ideas, tools, issue tracking and
feature discussions, and so on. The work is to be done in collaboration
with the other core developers in the normal, incremental way that
changes to CPython are made. There will be "no surprise
6,000 line PRs [pull requests]
" and the team will be responsible for
support and maintenance of any changes.
The main constraint is to preserve ABI and API compatibility so that the extensions continue to work. Keeping even extreme cases functional (the example given is pushing a million items onto the stack) and ensuring that the code remains maintainable are both important as well.
While the team can consider lots of different changes to the language implementation, the base object type and the semantics of reference counting for garbage collection will need to stay the same. But things like the bytecode, stack-frame layout, and the internals of non-public objects can all be altered for better performance. Beyond that, the compiler that turns source code into bytecode and the interpreter, which runs the bytecode on the Python virtual machine (VM), are fair game as well.
In a "meta"
issue in the GitHub tracker, Van Rossum outlined the three main pieces of the plan
for 3.11. They all revolve around the idea of speeding up the
bytecode interpreter through speculative
specialization, which adapts the VM to run faster on some code because
the object being operated on is of a known and expected type (or has some
other attribute that can be determined with a simple test).
Shannon further described what is being proposed in
PEP 569
("Specializing Adaptive Interpreter
"), which he
announced
on the python-dev mailing list in mid-May. The "Motivation" section of
the PEP explains the overarching idea:
Typical optimizations for virtual machines are expensive, so a long "warm up" time is required to gain confidence that the cost of optimization is justified. In order to get speed-ups rapidly, without [noticeable] warmup times, the VM should speculate that specialization is justified even after a few executions of a function. To do that effectively, the interpreter must be able to optimize and deoptimize continually and very cheaply.By using adaptive and speculative specialization at the granularity of individual virtual machine instructions, we get a faster interpreter that also generates profiling information for more sophisticated optimizations in the future.
In order to do these optimizations, Python code objects will be modified in a process called "quickening" once they have been executed a few times. The code object will get a new, internal array to store bytecode that can be modified on-the-fly for a variety of optimization possibilities. In the GitHub issue tracking the quickening feature, Shannon lists several of these possibilities, including switching to "super instructions" that do much more (but more specialized) work than existing bytecode instructions. The instructions in this bytecode array can also be changed at run time in order to adapt to different patterns of use.
During the quickening process, adaptive versions of instructions that can benefit from specialization are placed in the array instead of the regular instructions; the array is not a Python object, but simply a C array containing the code in the usual bytecode format (8-bit opcode followed by 8-bit operand). The adaptive versions determine whether to use the specialization or not:
CPython bytecode contains many bytecodes that represent high-level operations, and would benefit from specialization. Examples include CALL_FUNCTION, LOAD_ATTR, LOAD_GLOBAL and BINARY_ADD .By introducing a "family" of specialized instructions for each of these instructions allows effective specialization, since each new instruction is specialized to a single task. Each family will include an "adaptive" instruction, that maintains a counter and periodically attempts to specialize itself. Each family will also include one or more specialized instructions that perform the equivalent of the generic operation much faster provided their inputs are as expected. Each specialized instruction will maintain a saturating counter which will be incremented whenever the inputs are as expected. Should the inputs not be as expected, the counter will be decremented and the generic operation will be performed. If the counter reaches the minimum value, the instruction is deoptimized by simply replacing its opcode with the adaptive version.
The PEP goes on to describe two of these families (CALL_FUNCTION and LOAD_GLOBAL) and the kinds of specializations that could be created for them. For example, there could be specialized versions to call builtin functions with one argument or to load a global object from the builtin namespace. It is believed that 25-30% of Python instructions could benefit from specialization. The PEP only gives a few examples of the kinds of changes that could be made; the exact set of optimizations, and which instructions will be targeted, are still to be determined.
Other CPythons
In Dino Viehland's summit talk about Cinder, he described
a feature called "shadow bytecode" that is similar to what is being
proposed in PEP 659. Cinder, though, is not being run as an
open-source project; the code is used in production, however, and is being made
available to potentially adapt parts of it for upstream CPython. Some
parts of Cinder have already been added to CPython, including two
enhancements (bpo-41756
and bpo-42085) that
simplified coroutines to eliminate the use of the StopIteration
exception: "On simple benchmarks, this was 1.6 times faster, but it
was also a 5% win in production.
"
Pyston takes a somewhat different
approach than what is being proposed, but there are overlaps
(e.g. quickening). As described in its
GitHub repository, Pyston
uses the dynamic assembler
(DynASM) that comes from the LuaJIT project. That results in a
just-in-time (JIT) compiler with "very low overhead
" as one of
the techniques used. Using Pyston provides around 30% better performance on web
applications.
Both Cinder and Pyston are based on Python 3.8, so any features that are destined for upstream will likely need updating. The intent of the PEP 659 work is to work within the community directly, which is not something either of the other two projects were able to do; both started as internal closed-source "skunkworks" projects that have only recently seen the light of day. How much of that work will be useful in the upstream CPython remains to be seen.
It will be interesting to watch the work of Van Rossum's team as it tries to reach a highly ambitious goal. Neither of the other two efforts achieved performance boosts anywhere close to the 100% increase targeted by the specializing adaptive interpreter team, though Shannon said that he had working code to fulfill the 50% goal for his first phase back in October. Building on top of that code makes 2x (or 100% increase) seem plausible at least and, if that target can be hit, the overall 5x speedup Shannon envisioned might be reached as well. Any increase is welcome, of course, but those kinds of numbers would be truly eye-opening—stay tuned ...
printk() indexing
When kernel developers want to communicate something about the state of a running kernel, they tend to use printk(); that results in a log entry that is intended — with varying success — to be human-readable. As it happens, though, the consumers of that information are often not human; the kernel's log output is also read by automated monitoring systems that are looking for problems. The result is an impedance mismatch that often ends with the monitoring system missing important messages. The printk() format indexing patch set is the latest of many attempts to improve this situation.Monitoring systems are installed by administrators who want to know when there is a problem with the systems they manage. So, for example, if the CPU bursts into flames, and the administrator doesn't happen to be in the room to witness the event, they would at least like to receive an alert telling them to call their hardware vendor and the fire department, probably in that order. To produce this alert, the monitoring system will be watching the kernel log for the "CPU on fire" message printed by the relevant code in the kernel. If all goes well, the message will escape before the CPU melts and the replacement system can be ordered in a timely manner.
Then, one day, along comes a well-meaning contributor of trivial patches who decides that the message would be more aesthetically pleasing if it read "CPU in flames" instead. The resulting patch will be duly merged by the maintainer after a rigorous review process and shipped in a stable kernel update; the administrator, upon seeing the improved message, will be overcome by the beauty of the kernel's expression.
But that will only happen if the administrator sees this message. Unfortunately, the monitoring system is still looking for the old "CPU on fire" message; it is not only unmoved by the new phrasing, but it also misses the message entirely. So no alarm is sent, and the administrator continues the background project of testing the latency of a favorite social-networking site until the sprinklers go off and make a huge mess. By then it is too late in the day to order a replacement system and the service goes down.
System administrators hate it when that happens.
The problem is that, while messages emitted by way of printk() (and all the functions layered on top of it) are consumed by user-space tools, those messages have never really been considered to be a part of the kernel's ABI, so kernel developers feel free to change (or delete) them at any time. That creates trouble for anybody who has created a massive collection of regular expressions meant to detect the messages of interest; it also thwarts any effort at translating messages into other languages.
There have been many attempts to address this problem over the years. The 2011 Kernel Summit featured a combative session around a proposal to add a 128-bit binary tag to every kernel message. That idea, like all others before and after, failed to get anywhere near the mainline. The kernel community has, quite simply, never felt the need to impose any structure on the logging information it puts out via printk(); the additional effort that would be required does not seem to be worth the cost.
The latest proposal, posted by Chris Down, does not attempt to impose any such structure. Instead, it simply collects every printk() format string built into the kernel and makes them all available via a file in debugfs. Specifically, printk() becomes a wrapper macro that declares a new structure to hold the format information:
struct pi_entry { const char *fmt; const char *func; const char *file; unsigned int line; const char *level; const char *pre_fmt; const char *post_fmt; };
This structure holds the format string passed to printk(), along with the log level, information on where in the source this call is to be found, and "pre" and "post" data that is added by various wrappers (such as the dev_printk() functions). This structure is placed into a special section (.printk_index) of the built kernel. The wrapper also calls the original printk() — now called _printk() — to output the message, of course.
When the kernel is built, all of those pi_entry structures are placed together in the .printk_index section. After mounting debugfs, the administrator can look at printk/index/vmlinux, which contains the entire set of formats; there are also files under printk/index for each loadable module. Paging through these files, which will contain thousands of format strings, is likely to be almost as interesting as the administrator's social-media feed, but that is not really the intended use. Instead, monitoring systems can use this information to make sure that all of their tests still match messages that the kernel will actually emit. If the "CPU on fire" test no longer finds a match, then either the CPU has been rendered fireproof or the message has been changed. Should the latter prove to be the case, the test can be updated accordingly.
There are a couple of interesting questions that have not, as of this writing, been raised in the minimal review that this patch set has seen. The first would be: why does this information need to be built into the kernel? It could be placed into a separate file that would not require memory on a running system. The answer, perhaps, is that this mechanism makes it easy to keep the correct set of format strings with the kernel as it is deployed across systems.
The other is that debugfs is explicitly not meant for production systems, but this feature looks like it is meant to be used with production kernels. Should this mechanism be accepted into the mainline, it may have to find a home in a more stable setting, such as /sys.
Whether it will be accepted remains to be seen, though. Since this mechanism does not require any additional effort from kernel developers, and since it imposes no cost when it is turned off, it might encounter less resistance than the previous efforts to ease automated parsing of kernel log messages. If so, monitoring systems will not be shielded from changing kernel log messages, but they will at least know when their tests need updating.
eBPF seccomp() filters
The seccomp() mechanism allows a process to load a BPF program to restrict its future use of system calls; it is a simple but flexible sandboxing mechanism that is widely used. Those filter programs, though, run on the "classic" BPF virtual machine, rather than the extended BPF (eBPF) machine used elsewhere in the kernel. Moving seccomp() to eBPF has been an often-requested change, but security concerns have prevented that from happening. The latest attempt to enable eBPF is this patch set from YiFei Zhu; whether it will succeed where others have failed remains to be seen.The purpose of a BPF program under seccomp() is to make a decision about whether a given system call should be allowed; to that end, these programs have limited access to the system-call arguments. There is also a notification mechanism by which decisions can be punted to a user-space daemon if needed. By using a filter program, tools like browsers or container-management systems can place limits on what they or their subprocesses can do.
There are a number of reasons for wanting to use eBPF to write these programs — essentially, all of the motivations that led to the creation of eBPF in the first place. Switching to eBPF would make a number of new features available to seccomp() filter programs, including maps, helper functions, per-task storage, a more expressive instruction set, and more. Programs for eBPF can be written in C, which is not possible for classic-BPF programs — a problem that has led to the creation of special languages like easyseccomp. There is a whole ecosystem of tools for eBPF that developers using seccomp() would like to use.
Given all of that, one might think that using eBPF with seccomp() would be uncontroversial; the roadblock in this case is security worries. The current mechanism is relatively simple and easy to verify; eBPF brings a whole new level of complexity to worry about. Applying a filter program with seccomp() is an unprivileged operation, and it would need to stay that way, but the BPF developers have given up on the idea of making eBPF safe for unprivileged use. Nobody is interested in turning seccomp() into a security problem in its own right.
Zhu hopes to avoid this pitfall by adding a number of restrictions to eBPF filter programs, to the point that, for the most part, users cannot do anything with eBPF that is not already doable with classic BPF and user-space notifiers. The biggest exception, perhaps, is that access to maps and the set of standard helpers is allowed; the reasoning here is that unprivileged users can gain access to those facilities via socket-filter programs now, so nothing new is being exposed here. The patch set does add a Linux security module hook controlling access to these features from seccomp() filter programs, though, allowing them to be disabled if desired. There is also a patch to the Yama security module allowing easy control over this functionality.
The additional set of eBPF helpers that is provided for tracing programs can be made available to filter programs as well, but only if the user who loads the filter program has the necessary privileges (CAP_BPF and CAP_PERFMON) to load tracing programs. These helpers, among other things, provide access to memory in ways that could be useful in the seccomp() context — looking more deeply at system-call arguments, for example. There is also a mechanism for storing state within a task for use across filters, but it requires privilege to be truly useful.
It is worth noting that the privilege checks for these features are done at the time that a BPF program of the new type (BPF_PROG_TYPE_SECCOMP) is loaded; attachment of a filter program to a process is always unprivileged. It is thus possible for a privileged daemon to load a set of approved programs and pass them to other users, who would then be able to use a more complete set of eBPF features.
Getting a patch series like this merged will require convincing two
different sets of people — the BPF maintainers and security-oriented
developers. The picture on the BPF side is unclear. Alexei
Starovoitov, the creator of eBPF, asserted in
May 2019 that "seccomp needs to start using ebpf
". Three
months later, instead, his position
was: "I'm absolutely against using eBPF in seccomp
". The
change was part of his general shift against making eBPF available to
unprivileged users; he feared that it could never be made secure at a
reasonable cost and there were few users for it in any case. In the
discussion of Zhu's patches, though, he has only asked about details of the
implementation and has not expressed opposition to the idea overall. So
perhaps the BPF side is ready to accept eBPF being used with
seccomp().
Convincing the security folks might be harder. Back in 2018, Kees Cook was
strongly
opposed to using eBPF in seccomp(); he said that it moves far
too quickly and has experienced too many security issues to be usable in
that setting. In the current discussion, Andy Lutomirski has let it be
known that the patch set would encounter a "very high bar
to acceptance
". He worried that it would be harder to verify that
the implementation is secure if the more complex eBPF system is used, and
the resistance to properly supporting
unprivileged use of eBPF is an ongoing problem:
That part of the discussion stopped there. This disagreement could prove fatal to the idea of integrating eBPF with seccomp(); the BPF developers do not want to try to support unprivileged use, while the seccomp() developers are requiring that support. In the absence of some sort of solution, the current eBPF-in-seccomp() work seems likely to end up in the same place as its predecessors — and not in the mainline. That is unfortunate, as this is a functionality that seccomp() users would like to have.
Top-tier memory management
Modern computing systems can feature multiple types of memory that differ in their performance characteristics. The most common example is NUMA architectures, where memory attached to the local node is faster to access than memory on other nodes. Recently, persistent memory has started appearing in deployed systems as well; this type of memory is byte-addressable like DRAM, but it is available in larger sizes and is slower to access, especially for writes. This new memory type makes memory allocation even more complicated for the kernel, driving the need for a method to better manage multiple types of memory in one system.
NUMA architectures contain some memory that is close to the current CPU, and some that is further away; remote memory is typically attached to different NUMA nodes. There is a difference in access performance between local and remote memory, so the kernel has gained support for NUMA topologies over the years. To maximize NUMA performance, the kernel tries to keep pages close to the CPU where they are used, but also allows the distribution of virtual memory areas across the NUMA nodes for deterministic global performance. The kernel documentation describes ways that tasks may influence memory placement on NUMA systems.
The NUMA mechanism can be extended to handle persistent memory as well, but it was not really designed for that case. The future may bring even more types of memory, such as High Bandwidth Memory (HBM), which stacks DRAM silicon dies and provides a larger memory bus. Sooner or later, it seems that a different approach will be needed.
Recently, kernel developers have been discussing a possible solution to the problem of different memory types: adding the notion of memory tiers. The proposed code extends the NUMA mode to include features like migrating infrequently used pages to slow memory, migrating hot pages back to fast memory, and a proposal for a control mechanism for this feature. The changes to the memory-management subsystem to support different tiers are complex; the developers are discussing three related patch sets, each building on those that came before.
Migrating from fast to slow memory
The first piece of the puzzle takes the form of a patch set posted by Dave Hansen. It improves the memory reclaim process, which normally kicks in when memory is tight and pushes out the content of rarely used pages. Hansen said that, in a system with persistent memory, those pages could instead be migrated from DRAM to the slower memory, maintaining access to them if they are needed again. Hansen noted in the cover letter that this is a partial solution, as migrated pages will be stuck in slow memory with no path back to faster DRAM. This mechanism is optional and users will be able to enable it on-demand with the sysctl vm.zone_reclaim_mode or in /proc/sys/vm/zone_reclaim_mode with the bitmask set to 8.
The patch set received some initial positive reviews, including one from Michal Hocko, who noted that the feature could also be useful in traditional NUMA systems without memory tiers.
...and back
The second part of the puzzle is a migration of frequently used pages from slow to fast memory. This has been proposed in a patch set by Huang Ying.
In current kernels, NUMA balancing works by periodically unmapping a process's pages. When there is a page fault caused by access to an unmapped page, the migration code decides whether the affected page should be moved to the memory node where the page fault occurred. The migration decision depends on a number of factors, including frequency of access and the availability of memory on the accessing node.
The proposed patch takes advantage of those page faults to make a better estimation of which pages are hot; it replaces the current algorithm, which considers the most recently accessed pages to be hot. The new estimate uses the elapsed time between the time the page was unmapped and the page fault as a measure of how hot the page is, and offers a sysctl knob to define a threshold: kernel.numa_balancing_hot_threshold_ms. All pages with a fault latency lower than the threshold will be considered hot. Correctly setting this threshold may be difficult for the system administrator, so the final patch of the series implements a method to automatically adjust it. To do that, the kernel monitors the number of pages being migrated with the user-configurable balancing rate limit numa_balancing_rate_limit_mbps, then it increases or decreases the threshold to bring the rate closer to that value.
Controlling memory tiers
Finally, Tim Chen submitted a proposal for management of the configuration of memory tiers, and the top-level tier containing the fastest memory in particular. The proposal is based on control groups version 1 (Chen noted that support of version 2 is in the works), and monitors the amount of top-tier memory used by the system and by each control group individually. It uses soft limits and enables kswapd to move pages to slower memory in control groups that exceed their soft limit. Since the limit is soft, a control group may be allowed to exceed the limit if top-tier memory is plentiful, but may be quickly cut back to the limit if that resource is tight.
In Chen's proposal, the soft limit for a control group is called toptier_soft_limit_in_bytes. The code also traces the global usage of top-tier memory, and if the free memory falls under toptier_scale_factor/10000 of the overall memory of the node it is attached to, kswapd will start memory reclaim focused on control groups that exceed their soft limit.
Hocko disliked the idea of using soft limits:
In the past we have learned that the existing implementation is unfixable and changing the existing semantic impossible due to backward compatibility. So I would really prefer the soft limit just find its rest rather than see new potential use cases.
The likely reasons for Hocko's dislike for soft limits come from the previous attempts to change the interface (LWN covered the discussions in 2013 and 2014). The default soft limit is "unlimited", and this cannot be changed without a risk of backward compatibility issues.
Further into the discussion, Shakeel Butt asked about a use case where high-priority tasks would obtain better access to the top-tier memory, which would be more strictly limited for low-priority tasks. Yang Shi pointed to earlier work that divided fast and slow memory for different tasks, and concluded that the solution was hard to use in practice, as it required good knowledge of the hot and cold memory in the specific system. The developers discussed more fine-grained control of the type of memory used, but did not reach a conclusion.
Before the discussion stopped, Hocko offered some ideas on how the interface could work: differing types of memory would be configured into separate NUMA nodes, and tasks could indicate their preferences for which nodes should host their memory. Some nodes might be reclaimed ahead of others when memory pressure hits. He also further noted that this mechanism should be generic, not based on the location of persistent memory in the CPU nodes:
I haven't been proposing per NUMA limits. [...] All I am saying is that we do not want to have an interface that is tightly bound to any specific HW setup (fast RAM as a top tier and PMEM as a fallback) that you have proposed here. We want to have a generic NUMA based abstraction.
Next steps
None of the patch sets have been merged at the moment of this writing, and it looks like it is not going to happen soon. Changes in memory management take time and it seems that the developers need to agree on the way to control the usage of fast/slow memory in different workloads before a solution will appear. The top-tier management patches are explicitly intended as discussion fodder and are not intended for merging in their current form in any case. We will likely see more discussion on the subject in the coming months.
Page editor: Jonathan Corbet
Next page:
Brief items>>