Leading items

Welcome to the LWN.net Weekly Edition for January 12, 2023

This edition contains the following feature content:

PyTorch and the PyPI supply chain: malware using a known vulnerability highlights problems with the Python module supply chain.
Formalizing f-strings: a proposal for a formal specification for Python's f-strings that would remove some longstanding limitations.
Per-extent encrypted keys for fscrypt: adding fscrypt support to the Btrfs filesystem requires some low-level changes.
A vDSO implementation of getrandom(): some disagreements over how the performance of getrandom() should be improved — and how much improvement is needed.
Memory-management short topics: page-table sharing and working sets: a look at two page-table-sharing patches, along with an enhancement to the multi-generational LRU that produces working-set-size estimates.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

PyTorch and the PyPI supply chain

By Jake Edge
January 11, 2023

The PyTorch compromise that happened right at the end of 2022 was rather ugly, but its impact was not widespread—seemingly, at least. The incident does highlight some of the perils of relying on an external "supply chain" for the components that are used to build one's software. It also would appear to be another case of "security researchers" run amok, though perhaps that part of the story is only meant to cover the tracks—or ass—of the perpetrator.

Beyond that, the incident shows that the Python Package Index (PyPI) and the pip package installer act in ways that arguably assisted the compromise. That clearly comes as a surprise to many, though those behaviors are well-known and well-established in the Python Package Authority (PyPA) community. There is, at minimum, a need for education on that topic.

Compromise

People (or continuous-integration bots) who installed the nightly build of the PyTorch machine-learning framework using pip between December 25 and 30 got an unwelcome surprise. A binary program was installed with a dependent module that was triggered when that module was imported into a PyTorch-using code base. That binary gathers system information (e.g. name servers, host names) and the contents of various interesting files (e.g. /etc/passwd, $HOME/.ssh/*, the first 1000 files in $HOME), then uploads that information to an attacker-controlled server using encrypted DNS queries.

In order to build PyTorch, multiple dependencies of various sorts are required. Some are regular PyPI packages that should be downloaded from that repository, while others are PyTorch-specific packages that should come from the PyTorch nightly repository. A single pip command is used to install from both PyPI and the PyTorch nightly repository given on the command line, but pip does not distinguish between the two repositories; it treats them both as equal possibilities for fulfilling the need for a given package.

If there is a dependency on, say, torchtriton from some other part of PyTorch and there is a package by that name available on PyPI, pip can choose it to install instead of the one by the same name in the PyTorch repository. That is exactly what happened, of course; an attacker registered the torchtriton PyPI package and uploaded a version of that code that functioned exactly the same as the original—except that it added the malicious payload that is executed when it is imported. It is unknown exactly how many sites were actually affected, but the malicious torchtriton package was downloaded from PyPI around 2,800 times, according to a lengthy analysis of the compromise by Tzachi Zorn.

Once the PyTorch project was alerted to the malware at PyPI on December 30, it took immediate steps to fix the problem. The torchtriton package name was removed as a dependency from PyTorch and replaced with pytorch-triton; a placeholder project called pytorch-triton was registered at PyPI so that the problem could not recur. In addition, PyTorch nightly builds that referred to torchtriton as a dependency were removed from the repositories so that any cached versions of the malicious package would not be picked up inadvertently. The PyPI administrators were also alerted and they promptly removed the malicious package. On December 31, the project put out the advisory linked above.

The analysis by Zorn (and another by Ax Sharma at BleepingComputer) describe efforts by the perpetrator of the attack to explain their actions. At first, the domain used for DNS lookups in order to exfiltrate the information put up a short text message [archive link] claiming that the information was gathered simply to identify the companies affected so that they could be alerted. Another, longer message was apparently sent to various outlets with similar claims, including that all of the data gathered by malicious payload had been deleted, which can be seen in those articles. It is pretty much impossible to verify one way or the other; it could be truthful and heartfelt—or it could simply be damage control.

Dependency confusion

The type of problem being exploited here is called "dependency confusion"; the technique was popularized by Alex Birsan in 2021, but the pip bug report linked above makes it clear that the problem was known in that community back in 2020. When the ‑‑extra‑index‑url option for pip is used, it consults that index and adds all of the packages it provides to its internal list. When it comes time to install a package, pip chooses the one with the highest version (or highest version that satisfies any version constraints that were specified) regardless of which repository it comes from.

PEP 440 ("Version Identification and Dependency Specification") governs how pip chooses which version to install. One might think pinning a dependency to a specific version would be sufficient, but, as Dustin Ingram pointed out in a recent discussion, that is not true. pip and other installers "prefer wheels with more specific tags over less specific tags". That makes it relatively easy for an attacker to shadow even a version-pinned dependency.

As Ingram noted in another message, the way to truly pin a dependency is by specifying the hash values of the binary artifacts to be installed as described in the pip documentation. That thread is interesting in other ways, however.

It starts with request for help in convincing the security administrators at a company to unblock PyPI. Kirk Graham ran into a problem at his company, which had wholesale blocked the repository "because there were '29 malwared malicious modules' at the site". Those modules had long been removed from PyPI but the reputation for unreliability lingered on. Brett Cannon pointed out that there are lots of other places where malicious code can sometimes be obtained:

My first question would be whether they block every project index out there (e.g., npm, crates.io, etc.), as they all have the same problem? Or what about GitHub? I mean where does the line get drawn for protecting you from potentially malicious code?
My follow-up is how do they expect you to do use any open source Python code? If so, how are you supposed to get that code? Straight from the repositories? I mean I know lots of large companies that ban pulling directly from code indexes like PyPI, but then these are large companies with dedicated teams to get the source, store it internally, do their own builds of wheels, etc. If you block access to using what the projects provide you have to be up for doing all the work they provide in getting you those files.

Several in the thread pointed to various services and tools for managing dependencies of open-source components, which might help solve the problem at the company. Graham was clearly frustrated with the situation and his company, but once he found out about PyTorch, he changed his tune to certain extent:

Over the holidays there was malicious code added to PyTorch module on PyPi. That makes me think our Security Director is right. If there isn't better security from PyPi and GitHub those sites will be blocked by more and more companies. Open Source needs to be more secure. /sigh

That is not an entirely accurate picture of what happened, which was pointed out in the thread, but the larger point still stands. To outsiders it looks like PyTorch itself was compromised on PyPI, when what actually happened is more nuanced than that.

The pip bug report came up in the thread as well. Reading through that report makes it clear that the problem does not lend itself to a simple or straightforward fix. The root of the problem is that people do not understand that using the PyPI repository is not without risks and they fail to fully evaluate what those risks are—and what they mean for their software supply chain. As Paul Moore put it when the bug was resurrected after the Birsan posting in 2021: "But I do think that we're trying to apply a technology solution to a people problem here, and that never goes well :-("

Much of what Moore and other PyPA developers have to say in the report is worth reading for those interested in the problem. So far, the most straightforward "solution" is to remove the ‑‑extra‑index‑url option entirely, but that has its own set of problems, as Moore noted:

There really is no "good" way of securing ‑‑extra‑index‑url if you look at it that way. Allow pip to look in 2 locations and you have to accept that all of your packages are now being served as securely as the least secure index. And the evidence of the "dependency confusion" article is that people simply aren't aware of that reality. So what the pip developers need to decide is whether our responsibility ends with having documented how multiple indexes work, or whether we should view the ability to have multiple indexes as an "attractive nuisance" and remove it to ensure that people aren't tempted to use it in an insecure manner.
The clamour of voices arguing "this is a security flaw", plus the sheer stress on the maintainers that would be involved in arguing that this isn't our problem, suggests that we should remove the feature. But there's no doubt that it would penalise people who use the ability correctly - and it feels wrong to be penalising those people for the sake of the group who didn't properly assess the risks.

The bug report thread was brought to life again after the PyTorch mess, naturally. Moore describes some concrete steps that could be taken to address the problem, but it still requires someone (or some organization) willing to take on that work, make a proposal, and push it through to completion. So far there has been a lot of talk about the problem, but little in the way of action to fix it.

It really should come as no surprise that grabbing random code from the internet sometimes results in less than ideal outcomes. The flipside of that is that, usually, "sometimes" is extremely rare, which in some ways leads directly to the "attractive nuisance" argument. These kinds of problems are not new and are seemingly not going away anytime soon either. Each time we have an event like this PyTorch compromise, it gives open-source software another black eye, which is perhaps not entirely fair, but also not entirely surprising.

Comments (46 posted)

Formalizing f-strings

By Jake Edge
January 10, 2023

Python's formatted strings, or "f-strings", came relatively late to the language, but have become a popular feature. F-strings allow a compact representation for the common task of interpolating program data into strings, often in order to output them in some fashion. Some restrictions were placed on f-strings to simplify the implementation of them, but those restrictions are not really needed anymore and, in fact, are complicating the CPython parser. That has led to a Python Enhancement Proposal (PEP) to formalize the syntax of f-strings for the benefit of Python users while simplifying the maintenance of the interpreter itself.

Some history

F-strings got their start in 2015, when PEP 498 ("Literal String Interpolation") was accepted for Python 3.6. The PEP added a new way to specify strings that would have values interpolated into them:

    answer = 42
    reply = f'Obviously, the answer is {answer}'
    # reply is now "Obviously, the answer is 42"

But f-strings are far more than just that, because any arbitrary Python expression can be placed between the curly brackets; the expression will be evaluated and its value interpolated into the string:

    reply = f'More answers: { [ answer+i for i in range(5) ] }'
    # "More answers: [42, 43, 44, 45, 46]"

The expression can, of course, contain other strings, including f-strings, but PEP 498 imposed a limitation due to the (then) existing parser. Whatever type of quote was used to start the f-string could not be used inside the expression portions (i.e. the parts inside curly brackets). So, simply cutting and pasting code into an f-string may not work:

    foo = a['x'] + a['y']
    f'{foo}'     # works, of course
    f'{a['x']}'  # fails with SyntaxError
    f'{a["x"]}'  # workaround

The current implementation for f-strings simply used the existing machinery for handling other kinds of specialized strings, such as r'' for raw strings or b'' for byte strings. But f-strings are fundamentally different from the others because of the arbitrary expressions that are allowed. The advent of a new CPython parser for Python 3.9 in 2020 opened up some other possibilities for implementing f-strings.

In 2021, Pablo Galindo Salgado posted to the python-dev mailing list that he was working on moving the parsing of f-strings into the CPython parser. That would mean some of the restrictions could potentially be removed and "we can drop a considerable amount of hand-written code". He was asking for opinions about the idea and on the various options for restrictions that could be lifted. That resulted in a fairly brief discussion (by Python standards at least) that was generally favorable toward the idea.

At the 2022 Python Language Summit in April, Galindo Salgado gave a presentation on the idea, which was greeted with enthusiasm from various core developers, including Eric V. Smith who developed f-strings and authored PEP 498. So Galindo Salgado teamed up with Batuhan Taskaya and Lysandros Nikolaou to create PEP 701 ("Syntactic formalization of f-strings"), which was announced on the Python discussion forum in mid-December.

PEP 701

The new PEP sets out "to lift some of the restrictions originally formulated in PEP 498 and to provide a formalized grammar for f-strings that can be integrated into the parser directly". It notes that the restrictions were set to be removed in PEP 536 ("Final Grammar for Literal String Interpolation") but that PEP, written in 2016, has never been implemented and was deferred in 2019. In addition to removing the restriction on reusing the f-string quote delimiter within expressions, as mentioned above, the new PEP would dispense with a few other restrictions: escape sequences using backslashes would be permitted in the expressions as would comments in multi-line f-strings. The following examples would work as expected if PEP 701 gets adopted:

    >>> a = [ 'hello', 'world' ]
    >>> f'{"\n".join(a)}'
      File "<stdin>", line 1
	f'{"\n".join(a)}'
			 ^
    SyntaxError: f-string expression part cannot include a backslash

    >>> f'''foo {
    ... bar # a comment about bar
    ... }'''
      File "<stdin>", line 3
	}'''
	    ^
    SyntaxError: f-string expression part cannot include '#'

PEP 701 points out that other languages (such as Ruby, JavaScript, Swift, and C#) that have string interpolation mechanisms allowing expressions also allow arbitrary nesting of said expressions. The current limitations are more or less just annoyances, but they are unnecessary—removing them would also substantially simplify the code that parses f-strings.

The main objection to the PEP centers around the ability to reuse the quotes within the expression, and the arbitrary nesting it allows, due to the ability to abuse the feature in various ways. Steven D'Aprano took exception to two of the examples given in the PEP:

    f"These are the things: {", ".join(things)}"

    f"{source.removesuffix(".py")}.c: $(srcdir)/{source}"

    [...]
The first two might be perfectly understandable to the parser, but as a human reader, they make it more complicated and error-prone to work out which quotes delimit the f-string and which do not.
[...] I consider the first two examples terrible code which should be discouraged and the fact that your PEP allows it is a point against it, not in favour.
Especially since we can get the same effect by just changing one of the pairs of quotes to '. So in this regard, the PEP doesn't even add functionality. It just encourages people to write code which is harder to read and more error prone.

Galindo Salgado acknowledged those concerns and started a poll to try to gauge the sentiments of the participants in the discussion. Currently, the poll is around two-thirds in favor of allowing quote reuse in the expressions. Paul Moore said that he had voted in favor because he had encountered the problem along the way. "And I think consistency in allowing whatever can be in an expression is easier to explain and understand." He did suggest that the PEP add a warning about overusing the feature, however.

Barry Warsaw agreed with D'Aprano that the "join()" example was "challenging for me to parse", but he can see the consistency argument as well. "But maybe for consistency, the answer should be to let people write terrible, unreadable code!" Galindo Salgado pointed out that there is a cost beyond inconsistency, though:

[...] limiting quote reuse raises quite a lot the complexity. This is because when parsing the expression part now the parser needs to be aware that is parsing an expression inside an f-string with a given [quote], and that becomes even more tricky when f-strings are nested with different quotes.
This doesn't mean that this invalidates the "code smell" argument by any means: I just want to give some context on the maintenance point.

Mark Shannon thought that adding the quote-reuse restriction back in would be irritating:

Personally, I found the prohibition on reusing the same quote mildly annoying. f"You have the following in your basket: {", ".join(items)}." seems perfectly fine to me.
But I think I would find the restriction much more annoying if I knew that it was unnecessary, and that extra work had been put in just to stop me.

There are some horrific f-string abuses that are already possible, as Marc-André Lemburg demonstrated. "Should we really head on in the same direction even more ?" He supports the PEP because it simplifies the implementation and "removes some annoying bits (e.g. the backslash limitation)", but keeping the current quote restriction removes the possibility of arbitrary nesting of f-strings, which is a good thing in his mind.

"Shashwat" suggested that the possibility of misuse is not a good enough reason to disallow quote reuse, thus nesting, in the parser; "such restrictions belong in linters and code formatters rather than in the language grammar itself". Lemburg agreed with that point, but also thought the PEP should be changed to discourage that usage. "At the moment, it reads in a way which promotes reusing the same quotes inside f-string expressions as a feature."

There were also concerns expressed about difficulties in doing syntax highlighting in editors and other tools, but the consensus seems to be that those tools can and will be taught to handle the quote reuse. In fact, any of those tools that support multiple languages have probably already had to deal with this issue since its existence is fairly widespread. In light of the poll and the discussion, the PEP authors decided to keep the lifting of the quote-reuse restriction, though a new, lengthy "Considerations regarding quote reuse" section was added to the PEP. The thread also contained some detailed discussion of the guts of the implementation for CPython and other Python dialects. The results of that have been incorporated as well.

Much of the argument around quote reuse seems to boil down to readability, which is highly subjective. But unreadable code can (and will) be written, even if people differ in their view of which particular constructs fail their criteria. As we saw in the discussions about None-aware operators, readability is often simply in the eye of the beholder.

The discussion has pretty much wound down at this point, so it would not be a surprise to see the PEP make its way to the steering council for pronouncement before long, which means it could be coming in Python 3.12 in October. It seems a foregone conclusion that the idea of formalizing f-strings and replacing the hand-written parser code will be accepted; that will reduce the code maintenance and could lead to better error messages for f-strings, like those that have been added elsewhere for CPython. The council could perhaps require that the quote constraint be retained, but that seems unlikely given the general reception; discouraging abuses of the feature via the PEP and various tools may well be enough.

Comments (20 posted)

Per-extent encrypted keys for fscrypt

By Jonathan Corbet
January 5, 2023

The kernel's fscrypt subsystem enables filesystems to store files and directories in encrypted form, protecting them against offline attacks. A few filesystems support encryption with fscrypt currently, but Btrfs is an exception, despite a number of attempts to add this feature. The problem is that, as so often seems to be the case, Btrfs works differently and does not fit well with one of the key assumptions in the design of fscrypt. With this patch series, Sweet Tea Dorminy is working to enhance fscrypt to be a better fit for filesystems like Btrfs.

Fscrypt got its start in 2015 as an ext4-specific encryption feature, but it was later generalized to be able to support other filesystems as well, with the second user being F2FS. To enable encryption, an administrator must start with an empty directory (which can be the root directory ) on a filesystem and set a "master key" for that directory, after which all files and subdirectories created below the top-level directory will be encrypted. To be able to access the contents of that directory, the master key must be stored in the kernel's keyring. One master key can be used with multiple directory hierarchies, or different keys can be used with different hierarchies as needed.

Each file within a filesystem has its own encryption key separate from the master key; this is needed to prevent two identical (plain-text) files from having the same encrypted representation. That key is stored with the file's inode data and can only be decrypted using the master key associated with the directory hierarchy. The result is secure encryption of the directory tree, which will be inaccessible in the absence of the master key. That should prevent access to the data in an offline attack, but will not prevent access from a compromised system where the master key is present.

Btrfs is, of course, a copy-on-write filesystem, and many of its features depend on that functionality. A filesystem snapshot, for example, is simply a second reference to the filesystem itself, sharing all data between both "copies". If a given file is written to after the snapshot is made, the affected extents of that file will be copied before modification, leaving the snapshot unchanged. Extents can also be explicitly shared between files using the Btrfs "reflink" functionality which, despite numerous attempts to upstream a reflink() system call, is currently implemented as a pair of ioctl() calls.

This sharing is efficient in terms of storage space, but it creates an interesting problem when fscrypt enters the picture: if an extent is shared between two files, which key should be used to encrypt its data? Those two files will have different keys, after all. One might consider adding another layer so that either key would work, but that runs into another difficulty; any solution to this problem will need to avoid adding data to the extent for each file that references it, since the number of such references can grow without bound. So approaches that add multiple keys are not going to work.

The solution that was chosen was to move the encryption key from a file's inode to each extent contained within the file. As any given file is read, the necessary decryption keys are obtained from each extent; the keys can vary from one extent to the next. As a result, a single file can contain data encrypted by multiple keys, and a given encrypted extent can appear in different (encrypted) files. Even different master keys could be used, as long as all of the required keys have been loaded into the kernel keyring.

This mechanism solves the data-sharing problem, and enables some additional interesting use cases. For example, a directory's master key could be changed after a snapshot is made without re-encrypting all of the data contained underneath that directory. New data would, thereafter, be written using the new key. As a result, possession of the master key(s) needed to read the snapshot would not give access to any data created after the snapshot is made. Perhaps more usefully, a system in possession of (only) the current key would be able to write to an encrypted filesystem without being able to read any of the pre-snapshot data. According to this design document for fscrypt support in Btrfs, "this is an important use case for Meta".

A scheme like this will bring some limitations of its own, of course. While it is theoretically possible to load all of the per-extent keys for a file prior to accessing the file, that would be problematic in practice. Files can be large and contain a huge number of extents, which would require the kernel to allocate memory for an equally large number of keys. An attacker who could create a large, highly fragmented file could thus run the kernel out of memory. So keys are loaded as extents are accessed; this works — as long as the master keys needed to obtain each per-extent key are all present. Otherwise, access to a file will fail partway through, which could be a surprising result.

There are also limitations on the sharing of encrypted extents. Perhaps most obviously, it is not possible to reflink an encrypted extent into an unencrypted file. Inline encryption, where the actual encryption and decryption work is offloaded to a suitably capable storage device is also not supported. That is not a fundamental limitation of this approach; Dorminy just hasn't figured out how to implement it yet.

The current patch set does not add fscrypt features to Btrfs; it is, instead, a subset of a larger patch series (first posted in August 2022 and most recently in November) that is focused on the fscrypt changes. It makes sense to put that work out separately, since it may affect filesystems beyond Btrfs. Once this work clears the bar, though, the full Btrfs patch set seems certain to follow quickly.

Comments (2 posted)

A vDSO implementation of getrandom()

By Jonathan Corbet
January 6, 2023

Most developers probably do not see the generation of random numbers as being a performance bottleneck for their programs, but there are seemingly exceptions. Over the last few years, Jason Donenfeld has brought a new level of energy to the development of the kernel's random-number generator; he is now directing his efforts toward improving performance for user space with this patch series that provides an implementation of the getrandom() system call in the kernel's "virtual dynamic shared object" (vDSO) area. The result is, indeed, better performance, but not all developers see this benefit as being worth the additional complexity required to achieve it.

Traditionally, user-space processes on Linux systems have obtained random data by opening /dev/urandom (or /dev/random) and reading data from it. More recently, the addition of getrandom() simplified access to random data; a call to getrandom() will fill a user-space buffer with random data from the kernel without the need to open any files. This random data is provided with all of the guarantees that the kernel can make, including doing its best to ensure that the data is actually random and preventing repeated data sequences when, for example, a virtual machine forks.

It's worth noting that, in the BSD world, it is more common to call the arc4random() library function. The 2.36 release of the GNU C Library included an implementation of arc4random() that, in its pre-release form, included a fair amount of its own logic for the generation and management of random data. In July 2022, Donenfeld questioned the need for this function, noting that "getrandom() and /dev/urandom are extremely fast". Supporting arc4random() makes code more portable, though, so that function stayed in the library. The version that was eventually released was significantly simplified by Donenfeld, to the point that it essentially a wrapper around getrandom() when that system call is available. As a result, the performance of getrandom() also determines how fast arc4random() will be.

The vDSO API

As "extremely fast" as getrandom() may be, some users have apparently complained that it still is not fast enough. In response, Donenfeld is now working to create a vDSO implementation. The vDSO is a special range of kernel memory that is mapped into each user-space process; it contains code that can be called to perform system-call-like functions that can be carried out without actually entering the kernel, thus avoiding the cost of a context switch. System calls implemented in this way can include getcpu() and clock_gettime() — both of which come down to simply reading some data out of the kernel.

Moving getrandom() into the vDSO has the potential to improve performance for applications that need a lot of random data, but it will clearly need to involve more than just reading some kernel data if the guarantees made by that system call are to be upheld. As a result, the patch series implementing this functionality touches 59 files and adds some 1,200 lines of code. It also complicates the interface somewhat, in that there are now two (virtual) system calls involved. The first must be called at least once prior to requesting any random data:

    void *vgetrandom_alloc(unsigned int *num, unsigned int *size_per_each,
                           unsigned long addr, unsigned int flags);

The caller (likely to be the C library) should specify, in *num, the number of "opaque states" needed; that number is likely to correspond to the number of threads that are expected to run. This call will allocate a range of memory, returning its address. The number of states actually allocated will be stored in *num, and the size of each state in *size_per_each. The other two arguments are currently unused and must be zero.

The library is expected to assign one of the returned states to each thread in a process; the pointer to any given state can be had by offsetting the base address by a multiple of the value returned in *size_per_each. The kernel uses this space to track the state of the random-number generator for the thread; user space should not access its contents. Should more state-storage space be needed, more calls can be made to vgetrandom_alloc().

With the state space in hand, random data can be obtained with:

    ssize_t vgetrandom(void *buffer, size_t len, unsigned int flags,
    		       void *opaque_state);

The first three arguments are the same as for getrandom(), while opaque_state points to one of the states returned by vgetrandom_alloc(). The intent is for this function to behave just as getrandom() would, only faster. Applications would not normally call it directly; instead, it would be invoked from the C library's getrandom() wrapper. Donenfeld said (in the cover letter) that "performance is pretty stellar (around 15x for uint32_t generation), and it seems to be working".

`VM_DROPPABLE`

While some developers are clearly unconvinced about the need to optimize getrandom(), most of the patch series is relatively uncontroversial at this point. There is one significant exception, though: the management of the "opaque state" data. This data is used by the kernel to ensure that each caller to vgetrandom() gets a unique stream of random data, that each thread's random-number generator is reseeded as needed, and so on; it can be thought of as a sort of per-thread entropy storage that can be consulted without entering the kernel. For the curious, each state entry is defined (in the kernel) as:

    struct vgetrandom_state {
	union {
	    struct {
		u8	batch[CHACHA_BLOCK_SIZE * 3 / 2];
		u32	key[CHACHA_KEY_SIZE / sizeof(u32)];
	    };
	    u8		batch_key[CHACHA_BLOCK_SIZE * 2];
        };
    	vdso_kernel_ulong	generation;
    	u8			pos;
    	bool 			in_use;
    };

This structure occupies 256 bytes, which is not huge, but a system running a lot of threads could have quite a few of them. This is kernel-allocated memory that, in effect, must be locked into RAM; writing it out to secondary storage could expose the state of the random-number generator, potentially creating security problems. The state memory must thus be treated specially.

Earlier versions of the patch set used mlock() to ensure that this memory would stay in place. There were some problems with that approach, though, starting with the fact that locked memory is subject to resource limits. If the kernel and C library start using some of a process's allowed locked memory for random-number generation, once-working programs could start failing. Locked memory will also become unlocked in the child process after a fork. So using mlock() is not an ideal solution.

The memory used for states has another interesting characteristic, though: it can simply vanish at any time with minimal consequences. Since it is, at its core, a cache of random data, it can be replaced with more random data. This can already happen if, for example, the thread forks. In the worst case, should the state memory not be available at all, the vDSO function can just make a proper call to getrandom() to get its job done (albeit more slowly). So, while the state memory should be locked in the sense of not being written to swap, it does not need to be absolutely nailed down in RAM; it can be thrown away if need be.

Donenfeld decided to take advantage of the disposable nature of this memory. Rather than using mlock(), the current patch set adds a new (kernel-internal) memory flag called VM_DROPPABLE. Memory marked with this flag will never be written to swap or a core dump, and the memory-management subsystem can, if memory is tight, just reclaim it for other uses. VM_DROPPABLE memory is demand-paged, in that it is not actually allocated until an attempt is made to access it; normally the failure to allocate a page in this circumstance will be fatal for the process involved. With VM_DROPPABLE memory, instead, failures are simply ignored and any attempted writes are simply dropped — an outcome that is effected with some low-level, architecture-specific magic that simply skips the instruction attempting the write.

Dropping `VM_DROPPABLE`

Ingo Molnar objected strongly to this patch, saying that it adds complexity to a subsystem that needs a high level of trust. It would be better, he said, to just make mlock() work as needed or simply let the vDSO allocate a few pages outside of the existing resource limits. Donenfeld reacted poorly, questioning Molnar's motives. Molnar then vetoed the patch, leading to more complaints from Donenfeld, who did point out, though, that a process can make an arbitrary number of calls to vgetrandom_alloc(), so care needs to be taken to prevent it from being used to circumvent the limits on locked memory. Simply tweaking mlock() would not be a sufficient solution.

Andy Lutomirski also disliked VM_DROPPABLE, saying that Donenfeld was "trying to shoehorn all kinds of bizarre behavior into the core mm". He suggested that Donenfeld could, instead, create a special type of mapping, with its own local implementation, to obtain the needed semantics; the kernel provides the infrastructure needed to do this. There would then be no need to modify the core memory-management subsystem. Donenfeld had a relatively positive reaction to this suggestion and seemed ready to proceed in that direction. Linus Torvalds, though, argued that none of this effort was worth it. Speeding up random-number generation, beyond a point, is not the kernel's job, he said; instead, that should be left to the C library. After some discussion, Torvalds made his position clear:

I'm NAK'ing making invasive changes to the VM for something this specialized. I really believe that the people who have this issue are *so* few and far between that they can deal with the VM forking and reseeding issues quite well on their own.

Donenfeld has also been clear in his disagreement with Torvalds's assessment of the need for this feature, though; he claimed that it just isn't possible to create a safe random-number generator in user space (the patch cover letter also covers that ground in depth). He later went on to say:

I think the moral of yesterday's poo-poo'ing all over this cool new idea is that the Linux innercircle doesn't really care for "security things" as a motivator and just takes the shortest and easiest route toward swatting it away like a gadfly, assuming that the concerns are unreal or niche or paranoid or whatever.

It will probably come as little surprise that Torvalds disagreed with this view.

In the end, though, there do appear to be valid arguments, from both performance and security points of view, for a vDSO getrandom() implementation. So work on the vDSO patches is likely to continue, but the VM_DROPPABLE flag is clearly not going to clear the bar. Donenfeld will almost certainly return with an implementation based on Lutomirski's suggestions; that should address the main concerns that have been raised about this work. At that point, its chances of getting upstream are probably reasonably good, even if some developers remain unconvinced about how necessary it really is.

Comments (56 posted)

Memory-management short topics: page-table sharing and working sets

By Jonathan Corbet
January 9, 2023

The kernel's memory-management developers have been busy before and during the holidays; the result is a number of patch sets making significant changes to that subsystem. It is time for a quick look at three of those projects. Two of them aim to increase the sharing of page tables between processes, while the third takes advantage of the multi-generational LRU to create a better picture of what a process's working set actually is.

Revisiting msharefs

Some applications are structured as a large set of independent processes, all sharing a (potentially large) region of memory. Each of those processes will have its own set of page tables for that shared region. Duplicating page tables imposes a relatively small cost when the number of processes is low, but when that number gets large, the memory occupied by page tables may exceed the size of the memory region they refer to. In many cases, this duplication of page tables brings no extra value.

For some time, Khaled Aziz has been working on a mechanism to allow cooperating processes to share page tables referring to a shared memory area; this work has, at times, taken the form of the mshare() system call and the msharefs filesystem. There have been concerns raised with both solutions, so now Aziz is back with yet another attempt. This implementation does away with new system calls and filesystems and, instead, just adds a new flag (MAP_SHARED_PT) to the mmap() system call. If a process maps a shared segment (implying that MAP_SHARED must also be provided) with this new flag, then the page tables mapping this segment will also be shared with the other users, saving the overhead of making an independent copy of those tables.

As with the other versions, there are some interesting semantics and limitations associated with shared page tables. Any address-space changes (such as an mprotect() call) made to the shared region by one process will apply to every process sharing the page tables; that is seen as an advantage for some use cases. The memory segment must be aligned at the PMD level (2MB on many architectures), and it must be mapped at the same virtual address in all processes. The same-address requirement could perhaps be removed, Aziz said, if there is a reason to do so.

Underneath the API, the implementation of page-table sharing follows the same lines as before. A separate mm_struct structure is created to manage the shared region as if it were a separate address space.

There have been no comments on the new version so far. One might expect that using mmap() would address most of the concerns about the user-space API for this feature. But this kind of page-table sharing, with its unique semantics, represents a significant memory-management change to serve a relatively rare use case. It is not yet clear that the case has been made that this functionality is worth the cost.

Copy-on-write page tables

A different, and somewhat more transparent, approach to page-table sharing can be found in this patch set from Chih-En Lin. When a process calls fork(), the new child process will share its memory with the parent. Any writable pages are marked copy-on-write (COW); should either process write to a COW page, that page will first be copied (breaking the sharing) so that the other process does not see the change. Sharing memory in this way saves a lot of copying, especially if the child process will not actually use much of the parent's memory.

While the parent's memory is not copied into the child on fork(), the parent's page tables are copied. If the parent process has a large address space, that copying can still create a significant cost, and it may be entirely useless if the child does not access that memory. Lin seeks to reduce that cost by, instead, extending the COW mechanism to the bottom (PTE) level of the page-table hierarchy.

A process must opt into the COW behavior with a new prctl() command (PR_SET_COW_PTE). Once that has been done, any new child processes will be created with shared page tables. The usual COW behavior applies here; should either process make a change to a PTE page, that page will be copied and the sharing will be broken. An mprotect() call, for example, would end up copying the affected page-table pages. Thus, COW page tables should not result in any behavioral changes visible to either side, other than fork() calls running a bit more quickly and requiring less memory.

Of course, that is not quite true. While a fork() may be a bit faster, other operations, including page-fault handling, may be slower due to the need to break the sharing of the page-table pages. Whether this sharing is beneficial overall may thus vary depending on the workload; benchmark results included in the cover letter show a 3-5% performance increase for some workloads, and a slight decrease for others. This variability of results explains the need to opt into the COW behavior; for most workloads it probably will not make enough of a difference to be worth the trouble.

Here, too, the implementation adds a certain amount of complexity to the core memory-management code. The sharing of page-table pages requires the addition of a reference count to each of those pages so the kernel knows when they are no longer in use. There are numerous operations that can require the sharing to be broken, including transparent huge-page collapse, kernel same-page merging, madvise() calls, and more. The code also has to properly handle page-table pages that cannot be shared, including those referring to pages that are pinned or mapped to device memory. The first posting of this work drew some questions about whether the added complexity was worth it (and a side discussion on better alternatives to fork()). There have been few responses to the current version, but it seems likely that this discussion has not yet reached its conclusion.

Working-set estimation

A process's "working set" is the subset of its pages that it is actually using at any given time. Identifying the working set is a key part of effective memory management; if a process's working set can be kept in RAM, that process will perform much better than if it must continually fault pages in from secondary storage. Giving a process more memory than is needed to hold the working set, though, is wasteful. So a lot of effort goes into trying to give each process just enough memory — but not too much.

In this patch set, Yuanchu Xie notes that the multi-generational LRU (MGLRU) work that was merged for the 6.1 kernel provides much of the infrastructure needed to create better working-set-size estimates. The MGLRU organizes a process's pages into "generations", with recently-used pages being placed into the youngest generation. Over time, unused pages age into the older generations, until they are eventually reclaimed.

The working set should thus be found in the youngest generations. The only problem is that the generational aging does not happen on any sort of set schedule; instead, it is done when memory pressure increases and the kernel needs to find pages to reclaim. As a result, the younger generations can accumulate pages that have not been used in some time, while pages that are part of the working set may remain stuck in the older generations; this situation can persist for some time if memory pressure is not high.

As a way of getting better working-set-size estimates out of the MGLRU, Xie adds a new mechanism to force aging to happen regularly. It takes the form of a new knob, memory.periodic_aging, that is implemented in the memory control-group controller, but for the root group only. It holds the aging interval in seconds; setting it to a non-zero value will enable periodic MGLRU aging system-wide. There is a new kernel thread, called kold, that does this aging work.

If memory.period_aging is set to, for example, 60 seconds, then the youngest generation for any process should contain the pages that are known to have been used within the last minute, while the second-youngest generation will hold pages that have been idle for more than one minute, but less than two. The kernel could use this information to adjust the amount of memory available to each process, but it could also be of use to user-space memory-management mechanisms. Processes could use their own working-set information to optimize their behavior and avoid using more memory than is available to them.

Before user space can use this information, though, it needs to be made available, which is not currently the case. So the patch set adds another memory-controller file called memory.page_idle_age to export generational data to user space. Reading this file will produce a table with counts of the number of pages in each of a set of fixed age ranges (ranging from one second to just over one hour), with separate lines for file-backed and anonymous pages. This information seems like it could be useful in a number of situations, including simply better understanding how the generational-aging algorithm is working.

This patch series is on its first posting, and has not yet drawn any review comments. It is far less invasive than the other patches examined here and seems like it should be less controversial. If nothing else, though, this work could benefit from some documentation so that potential users of the new functionality do not need to reverse-engineer its interface from the source.

Comments (6 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>