LWN.net Weekly Edition for September 21, 2023
Welcome to the LWN.net Weekly Edition for September 21, 2023
This edition contains the following feature content:
- Using the limited C API for the Python stdlib?: the cPython interpreter offers a limited API for the creation of extensions that are portable across Python releases, but the standard library does not use it. Some developers would like to change that situation; not all are agreed.
- The European Cyber Resilience Act: an overview of proposed European legislation that could create liability for free-software developers.
- Shrinking shrinker locking overhead: a proposal to improve the locking around shrinkers in the kernel.
- Moving physical pages from user space: a new system call to manage pages in the physical address space.
- Why glibc's fstat() is slow: a misunderstanding over the kernel's system-call ABI causes a GNU C Library system-call wrapper to be slower than it needs to be.
This week's edition also includes these inner pages:
- Brief items: Brief news items from throughout the community.
- Announcements: Newsletters, conferences, security updates, patches, and more.
Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.
Using the limited C API for the Python stdlib?
The "limited" C API for CPython extensions has been around for well over a decade at this point, but it has not seen much uptake. It is meant to give extensions an API that will allow binaries built with it to be used for multiple versions of CPython, because those binaries will only access the stable ABI that will not change when CPython does. Victor Stinner has been working on better definition for the API; as part of that work, he suggested that some of the C extensions in the standard library start using it in an effort for CPython to "eat its own dog food". The resulting discussion showed that there is still a fair amount of confusion about this API—and the thrust of Stinner's overall plan.
The limited API comes from PEP 384 ("Defining a Stable ABI"), but that is largely a historical document at this point. The C API Stability document and developers guide both have more up-to-date information. There are several APIs available that extensions can use, but only the limited API provides ABI stability between major releases of CPython (e.g. from 3.11 to 3.12); packages using the other APIs will need to be rebuilt in order to ensure that they work with a new major (or even minor, in the case of the unstable API) release.
At the end of August, Stinner wondered
about switching some of the C-based extensions in the standard library to
use the limited API. The goal is to more extensively test the API and to
promote it by example: "Using private C API functions and the internal C
API should be the exception, not the default in Python stdlib
". While
the standard library itself is rebuilt and packaged with a new CPython
release, other
extensions will benefit from moving to the stable ABI (also known
as "abi3"), which comes from
using the limited API:
The stable ABI makes the distribution of package binaries easier. For example, binaries are already available before the new Python is being released! It makes newer Python usable since the first day of its release, because it's simply the same binary for all Python versions. (One binary per platform+architecture is still needed.)
It turns out that at least a few standard library modules are already using the limited API, so all that is needed in order to "convert" them is a line that declares that the module uses the limited API:
#define Py_LIMITED_API 0x030d0000 /* value is version 3.13 */Defining Py_LIMITED_API hides symbols (functions and other interface elements) that are not part of the limited API, so that they cannot be used in the extension code. Other modules can be converted with only minor changes to them, Stinner said. There are some standard library modules that will not be changed, yet, because their performance suffers from being unable to use the internal API. He reported a performance regression in the C extension for the statistics module as part of closing his pull request to switch it to the limited API; in the end, he decided that doing so made little sense. But, for many other modules, "
there is no significant impact on performance"
Stinner tried to start converting standard library modules back in 2020, but ran into a few different problems, which have now largely been resolved. Beyond performance degradation, he also has encountered API calls that are not part of the limited API, but perhaps should be considered for inclusion. He wondered what other core developers thought about converting some standard library modules.
Barry Warsaw liked the idea as a way for the project to test out the limited API itself. He thought that doing so might also help if it was decided to move some modules out of the standard library, since a Python Package Index (PyPI) replacement could then have a binary wheel ready and waiting for new CPython releases. Alex Gaynor was also in favor:
I like the idea of eating our own dog food. I'm also the author of several of those packages that use abi3 wheels, so I have a strong [interest] in the limited API becoming better :)I'm also sympathetic to the people who will say, "eating our own dog food isn't a good enough reason to lose performance", so I think it would be a very good outcome of this process if, wherever we identify areas for improvement by eating our own dog food, we make the dog food taste better.
But Guido van Rossum was
less enthused with the idea; he thought it would lead to a lot of churn
and a bunch of pull requests (PRs) "that few people care to review, and
that will increase everybody's frustration (not just yours) with how hard
it is to get people to review PRs
". The standard library modules are
not broken, so he wondered why they were being "fixed":
"Eat your own dogfood" is a fine idea, and I think it's great to apply it to new modules. Just like we sometimes add [type] annotations to new code, despite our general reluctance to add annotations to existing code (especially stdlib code). I feel the same ought to apply here: let's not try to "fix" existing modules, because they aren't broken, and ultimately there is no reason for the stdlib to use the limited API.
Stinner replied with a list of links to commits and issue discussions as background about the effort, going back to 2018, but Van Rossum was concerned that the underlying motivation was somewhat suspect:
The argument seems to be "dogfooding is good" and possibly that stdlib modules are used widely as "example code" so best practices should be followed? Those aren't technical reasons though – IMO this smells like technical solutions for social problems.
But Stinner said
that there is an underlying technical reason as well: "the limited C API
is badly tested by Python itself
". That has led to finding bugs after
a release had already been made; if some parts of the standard library were
built and tested with the limited API, those bugs could be found and fixed
well before a
release is made. In addition, converting real extensions will help show
any gaps in the API functions available in the limited API.
Van Rossum suggested moving slowly with any changes to the standard library and wanted to discuss the C API at the upcoming core developer sprint in October. He also outlined his understanding of the different APIs that are available for CPython, along with what the guarantees are for extensions that use them. It is a somewhat complicated picture that Stinner is trying to clarify as part of his work.
But the stable ABI has been around for quite some time at this point,
Marc-André Lemburg said,
and has seen limited adoption, so perhaps the effort should
be redirected into helping extensions
remain compatible with a range of Python versions. Those extensions
would need to be recompiled for new CPython major versions, "but that's
easily done using
cibuildwheel
". That tool
can build and test binary wheels for multiple operating systems and Python
versions as part of a project's continuous-integration (CI) process.
The tooling for building extensions has not helped with adoption of the limited API, though, Paul Moore said. Currently, the tools default to using the full C API for extensions, so that is generally what extension authors do; if that changed, adoption might grow substantially. Beyond that, Petr Viktorin pointed out that cibuildwheel only helps with extensions that are on PyPI; applications that use CPython as a way to create their own plugins and extensions want to use the stable ABI so they can work with multiple CPython versions. A Vim commit outlined the situation well, he said.
In a lengthy message, Stinner described
the overall problem he is trying to help solve: having extensions be more
quickly available at the time a new CPython is released. He works on
Fedora, which will be shipping the newly minted CPython 3.12 (due in early
October) in Fedora 39, which is slated for mid-October; the hope is to have
up-to-date versions of most of the popular extensions available by that
time. He sees the limited API (thus stable ABI) as being a key facilitator
of that for future CPython releases. "If we can help maintainers to move
towards the limited C API, you can expect having more C extensions to be
usable at day 1 of Python 3.13 release.
" That will also help the
maintainers of the extensions, since users will not be clamoring for them
to update their extension as soon as a new Python is available.
In Van Rossum's mind, it is "the requirement that once 3.x.0 is released
all 3rd party packages should be instantly available
" that is the root
cause of the problem; he suggested resetting user expectations since that
is never going to be achievable. Viktorin wondered
what kind of time frame would be reasonable to expect most third-party
extensions. Van Rossum replied
that it has generally taken a few months after the release to get to that
point, but thought that package maintainers should be encouraged to start
putting together wheels for their modules once the first release candidate
of a new CPython is released. He is also "unhappy about the pressure I
am currently feeling to make it our fault if not every 3rd party package
works on day one
".
As might be guessed, Stinner disagreed
with much of that. He has no silver bullet, but getting more packages to
use the limited API will lead to more of them being available on day one.
Meanwhile, maintainers should not be subjected to additional pressure to
update their builds; they "prefer to work on new features, or fix their
own bugs, rather than following Python C API changes
". Van Rossum tired
of the discussion, however, and wanted to wait until they could talk
about the issue face-to-face in October.
There were other sub-topics in the thread, of course, but the question of
what to do for the standard library, if anything, will presumably be the
subject of a
lively discussion at the sprint. Van Rossum seems unconvinced that the
stable ABI has much to offer ("I still feel that the Stable ABI is a
solution largely in search of a problem
"), but other core developers
(and extension authors) disagree. In the end, it seems unlikely that there
will be any movement away from supporting the limited API, though the effort to
broaden its reach—in CPython itself at least—is still up in the air.
The European Cyber Resilience Act
The security of digital products has become a topic of regulation in recent years. Currently, the European Union is moving forward with another new law, which, if it comes into effect in a form close to the current draft, will affect software developers worldwide. This new proposal, called the "Cyber Resilience Act" (CRA), brings mandatory security requirements on all digital products, both software and hardware, that are available in Europe. While it aims at a worthy goal, the proposal is causing a stir among open-source communities.
There is a reason why the open-source world has concerns: the legislation indirectly defines who is responsible for the security of open source and who should pay to improve the current state. In addition, it puts the responsibility on individual developers and foundations hosting open-source projects instead of the manufacturers of goods embedding the software. It could have important consequences for open source across the globe.
The original proposal of the CRA (the main text
and the annexes)
brings several requirements (enumerated in Annex I) for "products on
the market
" (more on definitions a little later). Most requirements
are generally accepted best-practices, such as a secure default
configuration, providing security updates, protection from unauthorized
access, and releases free of known vulnerabilities; others could be
considered somewhat vague
like "(g) minimise their own negative impact on the availability of
services provided by other devices or networks
".
Each product (which means every release for software) would need to provide
"an
assessment of the cybersecurity risks
" along with the release-related
documentation (listed in Annex II).
The security assessment covers the software itself and all its
dependencies; the way to perform the assessment could fall under one
of three categories,
from self-assessment to the involvement of a third party.
Self-assessment is the default. However, the regulation requires a
stricter approach
when the product's "core function
" falls into the category of a
"critical product with digital elements
" (listed in Annex III),
which is further
divided into
Class I (less critical) and Class II (more critical). Depending
on their class, products must undergo a mandatory external security assessment;
all Class II products are required to do so, while Class I
products must
only if they do not
follow the (not yet defined) "harmonised standards, common
specifications or cybersecurity certification schemes
".
The release-related documentation needs to cover (among other items) the
product's
expected uses, its support time frame, and the way to install security
updates.
All manufacturers must have a vulnerability-reporting procedure and
release security updates free of charge
for users. If a manufacturer learns about an
"actively exploited vulnerability
", it is expected to notify
authorities rapidly (24 hours
in many cases; followed by a complete analysis one month later). Finally, a
party
not complying with the requirements may be subject to fines of up to
€15-million or
up to 2.5 percent of the worldwide annual turnover (gross revenue),
whichever is higher.
Commercial activity
As the devil is in the details, definitions
have an important impact on understanding the CRA. The most
important term
is, without a doubt, "commercial activity
" as found in the following sentence:
In order not to hamper innovation or research, free and open-source software developed or supplied outside the course of a commercial activity should not be covered by this Regulation.
The definition for "commercial activity
" comes from a document called "The
Blue Guide",
which has examples on page 21 that provide a little more
explanation. The text
states that commercial activity means providing goods in "a business
related context
"
and the decision if a particular activity is commercial or not should
be done:
on a case by case basis taking into account the regularity of the supplies, the characteristics of the product, the intentions of the supplier, etc. In principle, occasional supplies by charities or hobbyists should not be considered as taking place in a business related context.
Without further clarification, one might consider many open-source projects as commercial activities, especially mature ones that are used widely and make regular releases. Those with developers hired to work on the project might also qualify.
The definition of commercial activity is the main point affecting open-source projects. It will not affect hobby projects run by unpaid volunteers only. It could (depending on the any future judgments) affect all others. If that wording is not changed in the final version of the CRA, it could cause uncertainty, leading to lesser involvement in open source.
Who is the manufacturer?
The discussion of commercial activity brings us to the other
primary term of the CRA: the "manufacturer". Again, according to the
"Blue Guide" (page 34):
"The manufacturer is any natural or legal person who manufactures a
product or has a product designed or manufactured, and places it on
the market under his own name or trademark
".
The CRA also defines other roles like the distributor or importer, but
"manufacturer" will be the most important one in our case. The
manufacturer is critical
because they are responsible for fulfilling requirements and could
face the fines mentioned above.
Let us take the example of a project hosted by an open-source foundation. That foundation could be considered the manufacturer, even if it might have limited impact on the development process and concentrates on governance and organizational aspects. The definition of the manufacturer might be even more complicated when there is no formal legal entity. Is the person tagging a release the manufacturer in this case? The person with the most commits? All the maintainers together? Their employers?
If foundations or other supporting organizations are to be classified as manufacturers, they will be required to put additional constraints on projects and how they develop and release. That could cause significant tension, especially for projects with no established security culture.
Another point is the need for a budget in case an external assessment is required; currently, operating systems and networking software both require an external assessment. A typical budget for a security assessment is in tens of thousands of dollars (or euros). The exact scope of the audit for CRA requirements will be defined further (the general description appears in Annex VI), but we might assume a need for a similar budget for every non-bugfix release. It will be vital for many projects to figure out who will pay for assessments. Many of them, including virtually all without an organization backing them, will not be able to pay that fee.
Even if a project falls under self-assessment only, its release must be accompanied by several documents, including a risk analysis. Writing and verifying this information would mean additional work for projects, though the exact content of this documentation still needs to be fully defined.
Vulnerability and exploitation reporting
Under the CRA, each manufacturer needs to have a vulnerability-reporting process and, in most cases, provide security updates. Logically, they would also have to provide updates of all dependencies, especially when they are linked into the main product. A recent discussion in the Linux kernel community showed that timely delivery of security fixes is not yet a solved problem; regulation like the CRA might actually push device vendors to publish fixes more rapidly.
One clause also requests all manufacturers to report each
"actively exploited vulnerability
" whenever they learn about
it. The report must happen
within strict time limits, such as a first notification in 24 hours
(though it is not required for small companies). The company should provide
a detailed
analysis with a fix within one month. These reports are sent to the European Union Agency for
Cybersecurity (ENISA). With that pile of 0-days, ENISA will become
a juicy target for attacks (though one might argue that a service like GitHub's
private advisories is already
such a target).
The obligation to notify about all issues also breaks normal disclosure processes. These days, vendors disclose vulnerabilities only after a fix is available. Also, the one-month limit for a complete analysis might sometimes be hard to meet. The industry typically uses 90 days, but some vulnerabilities (notable examples include hardware issues like speculative execution bugs) take months from the discovery to the fix.
The reaction
After the original proposal was published in September 2022, the open-source community started rapidly responding to it. One of the first reactions came from a foundation, NLnet Labs, in a blog post describing the impact on its domain: network and routing protocols. Many of the projects the foundation works on fall into the "critical products" category and would require significant additional work, possibly including an external security audit. Also, they note that some of these tools, which may have been available for dozens of years, could be considered "commercial" so they would fall under the regulation even though the organization gets no income from that software.
Other organizations have done their own analysis as well. A blog post from Mike Milinkovich of the Eclipse Foundation lists all of the documentation and assessment that the Foundation would have to do for each released software version; he also mentions the uncertainty of which tools would be classified as critical or highly critical. During FOSDEM 2023, a panel (WebM video) took place where European Commission representatives answered questions from the community. This session mostly concentrated on the impact of the definition of "commercial activity" that is a condition for the product to fall under the scope of the CRA. It was said that even for charities, it is going to be case-by-case determination. Also, a Commission representative said that the goal of the regulation is to force manufacturers to do more due diligence on the components that they include; they will be obligated to do so as part of their security assessment.
In April 2023, multiple organizations released
an open letter
asking the Commission to "engage with the open source community and
take our concerns into account
". In addition, some specialized communities
such as Content Management Systems (CMSes) wrote
their own
open letter, which was signed by representatives of WordPress, Joomla, Drupal, and TYPO3. Interested
readers may also want to look at the list
of reactions that is
maintained by the Open Source
Initiative (OSI).
Possible response
If the regulation is put in place in its current form, some projects may not want to risk being classified as "commercial activity" and may decide to state that their code will not be available to the EU (enforced by technical means or not). That also raises interesting licensing questions; for example, it is not clear that GPL-covered code can have such a restriction. When code is restricted in that way, the security assessment for any downstream projects using that project in the EU will become more complicated — it could even mean that each downstream project needs to perform that audit independently. Or remove the dependency.
Some projects might decide to stop accepting contributions from developers employed by companies, or not take donations from companies, in order to not be classified as commercial. That could seriously impact those projects, reducing both their funding and the development base. But, even if such a project falls under the non-commercial category, downstream users might be using it otherwise.
Finally, there may be an impact on the number of new open-source projects. Convincing a (big) company to open-source new work is a daunting task; if there is more burden related to liability and documentation, fewer companies will release projects that are not crucial to their goals. This could affect tools like those for testing, continuous integration, for programming embedded devices, and so on. An increase in the number of forks is also likely; companies may want a version of some project's code with changes for CRA compliance. That could in fact decrease overall security, instead of improving it.
The current state
In addition to the initial version, as of August 2023 there are currently two sets of amendments, one from the EU Council, and another from the EU Parliament, that resulted from the work of Parliament committees. The most recent vote in the EU Parliament committees took place in July 2023. The next step is negotiations (called a "trilogue") between the Council, Commission, and Parliament to come up with a final version.
The two set of amendments change certain details, but not the
general thrust of the regulation. Both of them change lists of the
"critical product with digital elements
" (those that might require an
external audit). The amendments from the Council shorten both lists,
while ones from the Parliament move all routers
to Class II and add new categories to Class I (home automation, smart
toys, etc.).
They also both modify the "open-source exception". The Parliament
version seems to move requirements
to companies and gives a set of examples of commercial and non-commercial
activities. An indication of a non-commercial activity is a
fully distributed model and lack of control by one company.
On the other hand, if "the main contributors to free and open-source
projects are developers employed by commercial entities and when such
developers or the employer can exercise control
as to which modifications are accepted in the code base
", that is an
indication of commercial activity, the same as regular donations from
companies.
It also states that individual
developers "should not be subject to obligations pursuant to this
Regulation
".
The Council's version seems to cover more products by
the exception:
this Regulation should only apply to free and open-source software that is supplied in the course of a commercial activity. Products provided as part of the delivery of a service for which a fee is charged solely to recover the actual costs directly related to the operation of that service [...] should not be considered on those grounds alone a commercial activity.
The start of the negotiation on the final version is likely to happen after the summer break (which means in September 2023). Note that European Elections will happen in early June 2024 which means that the process is likely to be rushed to completion before that date.
The outcome?
Improving the security state of the digital world is a worthy goal, and many ideas the CRA brings are reasonable best practices. However, the impact of the current form of the regulation is difficult to predict. In the open-source world, it could be putting all the burden on upstream projects. These projects are frequently underfunded, so they might not have the resources to perform all the required work; that analysis and documentation work is worth doing, but funding has to be available in order to make it happen. FOSS developers, especially those working in the embedded space, should be paying attention to this legislation, as there is more to it than our summary above covers. Readers in the EU may want to contact their representatives about the CRA, as well.
Shrinking shrinker locking overhead
Much of the kernel's performance is dependent on caching — keeping useful information around for future use to avoid the cost of looking it up again. The kernel aggressively caches pages of file data, directory entries, inodes, slab objects, and much more. Without active measures, though, caches will tend to grow without bounds, leading to memory exhaustion. The kernel's "shrinker" mechanism exists to be that active measure, but shrinkers have some performance difficulties of their own. This patch series from Qi Zheng seeks to address one of the worst of those by removing some locking overhead.Kernel subsystems that maintain caches should register a shrinker that can be called when the kernel needs to free memory for other uses. A shrinker is described by struct shrinker; among other things, it contains a pair of callbacks that the kernel can use to query how many cached objects could be freed, and to ask that they actually be freed. Shrinkers can be asked to focus on a specific NUMA node or memory control group, but not all shrinkers implement that functionality. Since shrinkers are called from the reclaim path when memory is tight, they should be quick and refrain from allocating memory themselves.
Shrinkers can be registered and deleted as the system runs, creating a concurrency problem: a shrinker should not be deleted while it is running, and the list of shrinkers must be changed carefully given that other CPUs may be traversing it at the same time. In current kernels, the shrinker list is protected by a reader/writer semaphore (rwsem); traversing the list to run shrinkers requires read access, while changing the list requires exclusive write access. This was meant to be a fast solution; frequent traversals of the list (reads) can run concurrently, while changes to the list that would require write access are relatively rare.
This rwsem, it turns out, can be a performance bottleneck on busy systems. It is a global lock, so frequent acquisitions and releases can create a lot of cache-line bouncing, slowing the system even if the lock itself is not contended. Things can get worse if a shrinker runs (or is blocked) for a long time. If a writer comes along, it will request a write lock, which will have to wait until all existing read locks are dropped; meanwhile, the write-lock request blocks any additional read locks from being granted. In this situation, a long-running shrinker can clog up the works for some time.
Performance problems of this type come up often in the kernel, and the path to their solution is reasonably well-worn at this point; it almost inevitably involves using read-copy-update (RCU) to defer changes to existing structures until all users are gone.
In this case, the patch series starts by changing the shrinker registration interface so that all shrinkers are allocated dynamically — even those that are present from boot and cannot be removed. This change allows all shrinkers to be treated uniformly, getting rid of special cases, and sets the stage for changing how shrinker registration is handled. As seen in this patch, a new shrinker instance is created with shrinker_alloc(), made active with shrinker_register(), and released with shrinker_free().
There are a couple of implications here. One, as noted in the cover letter, is that this change will break all out-of-tree modules that implement shrinkers; they will have to be converted to the new API or they will fail to load. This is a deliberate change to ensure that, in kernels implementing the new mechanism, no old-style shrinkers are in use. A more quiet change is that, while the existing register_shrinker() interface is exported to all modules, the new functions are exported as GPL-only. As a result, proprietary kernel modules that implement shrinkers will not be fixable at all.
The bulk of this 45-part patch series is focused on converting all in-kernel shrinkers to the new API, after which the old one is deleted. The real purpose of the patch set is only achieved in patch 42, where the lockless algorithm is introduced. The shrinker structure gains three new fields: a reference count, a completion to be used for removals, and an rcu_head structure.
When a shrinker is registered, its reference count is set to one, and it is added (in an RCU-safe manner) to the shrinker list; it is then available to be called when the memory-management subsystem needs to find some memory. The traversals of the shrinker list are performed with the RCU lock held, meaning that the entries in the list will not disappear at an inconvenient time. To invoke a shrinker, the kernel will first attempt to increment its reference count; that attempt will only succeed if the count is already greater than zero. The RCU lock will then be dropped, and the shrinker invoked. Once its work is done, the RCU lock will be reacquired, and the reference count decremented. Since the reclaim code held a reference, the shrinker will not have disappeared while the lock was dropped.
When the time comes to remove a shrinker, shrinker_free() will drop the reference acquired at registration time, then use the completion to wait until all other references (if any) are also dropped. At this point, the fact that the reference count is zero means that shrinker will not acquire any more users, since an attempt to increment the reference count only succeeds if that count is greater than zero. But there may still be threads traversing the shrinker list and seeing this shrinker's entry there, so its removal has to be handled with care. That, of course, is what RCU is for; the entry is taken off the list, but then handed to RCU until a grace period passes, after which it is known that the shrinker structure can be safely freed.
With these changes made, the shrinker rwsem is no longer used during the invocation of shrinkers; it is only taken for write access when changes are being made to the shrinkers themselves. The final patch in the series turns the rwsem into a lower-overhead mutex, and the work is done.
This series is in its sixth revision, and the stream of comments appears to be slowing down. Benchmark results show no regressions from this change, unlike previous attempts to address the locking bottleneck that created problems elsewhere. Unless new problems turn up somewhere — always a possibility with this kind of low-level code — it looks like lockless shrinking may be reaching a point where it is ready for wider testing in linux-next.
Moving physical pages from user space
Processes in a Linux system run within their own virtual address spaces. Their virtual addresses map to physical pages provided by the hardware, but the kernel takes pains to hide the physical addresses of those pages; processes normally have no way of knowing (and no need to know) where their memory is located in physical memory. As a result, the system calls for memory management also deal in virtual addresses. Gregory Price is currently trying to create an exception to this rule with a proposal for a new system call that would operate on memory using physical addresses.
When physical placement matters
Most of the time, user space is entirely happy to let the kernel worry about where memory should be mapped; all physical pages are alike, so it really does not matter which ones are used by a given process. That situation can change, though, in situations where all physical pages are not alike. Non-uniform memory-access (NUMA) machines are a case in point; these machines are split into multiple nodes, each of which normally contains one or more CPUs and some physical memory. For a process executing on a given node, memory attached to that same node will be faster than memory on other nodes, so the placement of memory matters.Kernel developers have been working on the NUMA problem for years, and have developed a number of mechanisms to try to keep processes and their memory together. System calls can be used to bind processes to specific nodes, to ask that memory be allocated on specific nodes, and to move pages of memory from one node to another when needed. There is always room for improvement, but NUMA systems work well most of the time.
Hardware engineers are creative folks, though, and they have been busily working on other ways to create different types of memory. Contemporary systems still have traditional RAM, but memory might also be located on a peripheral device, in a non-volatile RAM array, in a bank of high-bandwidth memory, or in an external CXL device. In each of those cases, the memory involved will have different performance characteristics than ordinary RAM, once again making the physical placement of a process's pages into an important concern.
Since the NUMA concept already exists and is able to represent different classes of memory, it has been extended to handle these newer memory types as well. Each bank of "different" memory is normally organized into its own CPU-less NUMA node. The existing system calls controlling memory allocation can then be used to locate pages within these special nodes. That solves the management problem at a low level, but it is really only the beginning.
In many cases, the desired result for systems with multiple memory types is some form of memory tiering, where pages are migrated between memory types depending on how heavily they are used. Ideally, heavily used pages should be located in the system's fastest memory, while rarely used pages can be put out to pasture in slower memory. Finding an optimal way to move pages between memory tiers is an area of active development, and a number of questions remain open.
Tiering in user space
In this context, Price is seeking to add a new system call to allow some of those migration decisions to be made in user space. There are some existing interfaces that can be used to determine which physical pages are (or are not) in active use; devices providing memory can also, sometimes, provide this information. Using this data, a user-space management process could decide to move pages into the type of memory that is best suited to their current usage profile.
There is a problem, though. That information, being tied to physical pages, lacks any connection to the processes using those pages. A user-space program wanting to force page migrations based on this information would first have to convert the physical page addresses into (process, virtual-address) pairs for use with the existing system calls. That is a non-trivial and expensive task. Price is looking for a way to move pages between memory types without the need for an awareness of which processes are using those pages.
The result is a new system call, move_phy_pages(), that is patterned after the existing move_pages() call (which uses virtual addresses); it is otherwise completely undocumented at this point. The interface appears to be:
int move_phy_pages(unsigned long count, void **pages, int *nodes, int *status, int flags);
This call will attempt to move count pages, the physical addresses of which are stored in the pages array; each page will be moved to the NUMA node indicated by the appropriate entry in nodes. The status array will be filled in with information about what happened to each page; on success, the status entry will contain the page's new node number. The only relevant flags value appears to be MPOL_MF_MOVE_ALL, which instructs the call to move pages that are mapped by multiple processes; otherwise only singly mapped pages are moved.
If the nodes array is NULL, the system call will, instead, just store the status of each of the indicated pages in status. There are limits to how useful that is, since the node number of physical pages is already described by their physical address and does not normally change over time.
In reviewing the patch, Arnd Bergmann questioned the use of the void * type for the pages array. The values provided there are not actually pointers that can be dereferenced in any context; instead, they are used by the kernel to obtain the page-frame numbers (PFNs) for the pages of interest. Since, in some 32-bit configurations, full physical addresses may not fit within a normal pointer type, Bergmann suggested using the __u64 type instead.
That conversation also raised the question of whether, instead, user space should be providing PFNs to move_phy_pages(). As Bergmann pointed out, there are no system calls that accept PFNs now, so that would be breaking new ground. That, though, reflects the fact that, until now, system calls do not normally deal with physical addresses at all. If this work goes forward, finding a consensus on the best way to refer to such addresses, for move_phy_pages() and anything that might follow — will be important.
Whether this work will actually move forward remains to be seen. It is, almost by definition, an interface to move pages around without knowing which processes are using them; otherwise, move_pages() could be used instead. Perhaps the information regarding physical memory and its utilization that is available to user space (Price provided a list of information sources in this message) is sufficient to make useful decisions, but that would probably need to be demonstrated somehow. This patch provides access to functionality that is normally kept deeply within the memory-management subsystem; developers will want to see that the benefits it provides justify that intrusion.
Why glibc's fstat() is slow
The fstat() system call retrieves some of the metadata — owner, size, protections, timestamps, and so on — associated with an open file descriptor. One might not think of it as a performance-critical system call, but there are workloads that make a lot of fstat() calls; it is not something that should be slowed unnecessarily. As it turns out, though, the GNU C Library (glibc) has been doing exactly that, but a fix is in the works.Mateusz Guzik has been working on a number of x86-related performance issues recently. As part of that work, he stumbled into the realization that glibc's implementation of fstat() is expressed in terms of fstatat(). Specifically, a call like:
result = fstat(fd, &stat_buf);
is turned into:
result = fstatat(fd, "", &stat_buf, AT_EMPTY_PATH);
These calls are semantically equivalent; by the POSIX definition, a call to fstatat() providing an empty string for the path and the AT_EMPTY_PATH flag operates directly on the provided file descriptor. But the difference in the kernel is significant; implementing fstat() in this way is significantly slower, for a couple of reasons.
One of those is that fstatat() is a more complex system call, so it does preparatory work that is not useful for the simple fstat() case. Once alerted to the problem, Linus Torvalds posted a patch that detects this case and avoids that extra work. But the result is still, according to Guzik's measurements, about 12% slower than calling fstat() directly.
That performance loss is the result of the second problem: fstatat() must check the provided path and ensure that it is empty. One might think that it makes no sense to even look at the path when the user has provided a flag (AT_EMPTY_PATH) that says there is nothing to be seen there but, as Al Viro pointed out, POSIX mandates this behavior. Checking the path means accessing user-space data from the kernel; that, in turn, can require disabling guardrails like supervisor mode access prevention. It all adds up to a significant amount of overhead to check an empty string.
Torvalds made it clear that he thought glibc's behavior made no sense and wondered why things were done that way. A bit later, though, he found a plausible explanation for this choice. On an x86-64 system, the kernel exports a number of related system calls, including fstat() (number 5) and newfstatat() (number 262). Torvalds concluded:
The glibc people found a "__NR_newfstatat", and thought it was a newer version of 'stat', and didn't find any new versions of the basic stat/fstat/lstat functions at all. So they thought that everything else was implemented using that 'newfstatat()' entrypoint.But on x86-64 (and most half-way newer architectures), the regular __NR_stat *is* that "new" stat.
The "new" fstat(), after all, came about in the 0.97 release in 1992, so there was no reason for the x86-64 architecture (which arrived rather later than that) to use anything else. But, if Torvalds's explanation reflects reality, the glibc developers were fooled by the "new" part of the newfstatat name and passed over the entry point they should have used to implement fstat().
There are a few observations that one could make from this little bit of confusion:
- The system calls (and their names) provided at the kernel boundary are not the same as those expected by user-space programmers. The glibc people know this better than anybody else, since part of their job is to provide the glue between those two interfaces, but confusion still seems to happen.
- The fact that the kernel's documentation of the interface it presents to user space is ... mostly nonexistent ... certainly does not help prevent confusion of this type.
- Using qualifiers like "new" in the naming of functions, products, or one's offspring tends to be unwise; what is new today is old tomorrow.
Be that as it may, even with Torvalds's change (which was merged for the 6.6-rc1
release and will presumably show up in a near-term stable update),
fstat() is slower than it needs to be when glibc is being used.
In an attempt to improve the situation, Guzik raised
the issue on the libc-alpha list. Adhemerval Zanella Netto responded
that the library developers are trying to simplify their code by using the
more generic system calls whenever possible, that the
AT_EMPTY_PATH problem is likely to affect all of the
*at() system calls, and that, as a consequence, the problem would
be "better fixed in the kernel
".
Torvalds pointed out that, while other system calls have to handle AT_EMPTY_PATH, fstatat() is the only one that is likely to matter from a performance perspective; none of the others should be expected to show up as problems in real-world programs. Meanwhile, despite the misgivings expressed previously, Zanella put together a patch causing glibc to use ordinary fstat() when appropriate. Torvalds agreed that it looked correct, but complained that the implementation was messy; he seemed to prefer an alternative implementation that Zanella posted later.
As of this writing, neither version of the patch has found its way into the glibc repository; the latter version is under consideration. It is probably safe to assume that a version of this patch will be applied at some point; nobody has an interest in glibc being slower than it needs to be. This particular story has a happy ending, but it does stand as an example of what can happen in the absence of clarity around the interfaces between software components.
Page editor: Jonathan Corbet
Inside this week's LWN.net Weekly Edition
- Briefs: RIP Abraham Raji; JDK 21; PostgreSQL 16; FOSSY videos; 40 years of GNU; Quote; ...
- Announcements: Newsletters, conferences, security updates, patches, and more.