|
|
Log in / Subscribe / Register

Leading items

Welcome to the LWN.net Weekly Edition for May 26, 2022

This edition contains the following feature content:

This week's edition also includes these inner pages:

  • Brief items: Brief news items from throughout the community.
  • Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Improved error reporting for CPython 3.10—and beyond

By Jake Edge
May 24, 2022

PyCon

In a fast-paced talk at PyCon 2022 in Salt Lake City, Utah, Pablo Galindo Salgado described some changes he and others have made to the error reporting for CPython 3.10. He painted a picture of a rather baffling set of syntax errors reported by earlier interpreter versions and how they have improved. This work is not done by any means, he said, and encouraged attendees to get involved in making error reporting even better in future Python versions.

Galindo Salgado prefaced his talk with something of a warning that he has been told that he speaks rather quickly; with a chuckle, he suggested attendees prepare themselves for the ride. He introduced himself as a CPython core developer and a member of the steering council; beyond that, he is also the release manager for versions 3.10 and 3.11 of the language.

He began with a story of his days as a PhD student in physics, where he was using Python as a tool for his research. One day a friend showed him a Python syntax-error message that they could not figure out. They showed it to another student and all three of them were stumped; three physics students who were studying to try to solve the mysteries of the universe were unable to find a simple syntax error. The error message looked something like:

      File "ex.py", line 13
	def integrate(method, x, y, sol):
	^
    SyntaxError: invalid syntax
As can be seen, there is nothing obviously wrong with the def statement, but looking at the (presumably simplified) code makes it clear where the actual problem lies:
    configuration = {
	'integrator' : 'rk4',
	'substep' : 0.0001,
	'butcher_table' : {
	    1 : 1/6,
	    2 : 1/3,
	    3 : 1/3,
	    4 : 1/6,
    }

    def integrate(method, x, y, sol):
	...
A closing brace was left out in the definition of configuration, but the CPython error message is pointing to the following statement, which is not particularly helpful.

He showed several more examples of where the parser misleads programmers with its messages. Leaving out a comma in a list definition or omitting a closing square bracket can lead to an error on the following statement, which may be far from where the problem actually arose. These "not good" messages can confuse veteran Python coders, but they are really problematic for those trying to learn the language.

Galindo Salgado put up a slide with the "worst one of all", which is the dreaded "SyntaxError: unexpected EOF while parsing". It "helpfully" refers to a line number one past the end of the file and has a caret ("^") pointing to nothing at all. He asked for hands of people who have seen that error and most of the room raised theirs. Beyond the lack of helpfulness of the error message, "how many times do you have to explain to someone what 'EOF' means?" That is not particularly friendly to new programmers.

New parser

The poor error messages are not there "because we are lazy", he said. It was difficult to get the information needed for better messages in the parser—until Python 3.9. A new parser for CPython was introduced in that version of the language. That parser "allowed us to start thinking about how we can improve these things; can we improve the experience of people writing Python who make syntax errors?"

It is not just important for those learning the language, but also for those who use it regularly; developers at all levels have difficulty understanding many of the syntax-error messages generated by the interpreter. He has friends who are working with the language, so he has been "extremely happy that I have fixed many of these error messages" for them and others, Galindo Salgado said.

The new parser is based on a parsing expression grammar (PEG), instead of the old one based on an LL(1) parser. He was part of the group, with Guido van Rossum and Lysandros Nikolaou, who wrote PEP 617 ("New PEG parser for CPython") and implemented the parser. He noted that the commits for the original parser and the PEG parser were made almost exactly 30 years apart, in 1990 and 2020.

There were some shortcomings of the old parser, which was part of why it was replaced, but the new parser also allows new features that the old one could not support. For example, it allows multiple context managers in a with statement without having to resort to backslashes. It also allows the new match statement syntax. "This is only possible with the new parser", he said.

New error messages

The PEG parser also allows for a bunch of improved error messages. Many of the examples he gave in the talk came from a section in the "What's new in Python 3.10". There are "quite a lot of them", Galindo Salgado said, so he would only be talking about a subset. A common mistake for new Python programmers now has a much friendlier message:

    # Python < 3.10
    >>> if x > y
      File "<stdin>", line 1
	if x > y
	       ^
    SyntaxError: invalid syntax

    # Python 3.10+
    >>> if x > y
      File "<stdin>", line 1
	if x > y
		^
    SyntaxError: expected ':'
Forgetting the colon at the end of if, for, and other similar constructs happens frequently, so getting a clearer message that specifies what is missing and where it should go will help. Users who forget to specify a value in a dictionary used to just get a generic syntax error message pointing to the closing brace, but things have improved:
	values={ 'a' : 1, 'b' : }
			      ^
    SyntaxError: expression expected after dictionary key and ':'

Beyond that, using "=" in an if statement instead of "==" will get a suggestion about what the mistake is rather than the generic "invalid syntax" complaint:

	if x = y:
	   ^^^^^
    SyntaxError: invalid syntax. Maybe you meant '==' or ':=' instead of '='?
Similarly, a forgotten comma in a dictionary definition, which might not be obvious in a complicated initialization expression, will get "Perhaps you forgot a comma?" with the caret pointing to where it likely should go. "This has saved me at least ten times already", he said.

The IndentationError messages have been improved as well. Now the message tells the programmer the line number of the if (or other statement) that is causing the need for indentation. In addition, the dreaded "EOF" message, which can happen when a dictionary or other similar construct is missing its closing punctuation, now gives an actually useful error message:

	vals = { 'a' : 3, 'b' : 4
	       ^
    SyntaxError: '{' was never closed
"This is probably one of the ones that people like the most", he said.

Difficulties

[Pablo Galindo Salgado]

"It turns out that adding error messages is quite hard", Galindo Salgado said. For example, when looking to add the missing comma test, the first step was to teach the parser to recognize it. The first attempt at a rule might be that when it sees an expression followed by another expression, without anything in between, it is a missing comma. But that test is way too simplistic. It will trigger on a missing "in" for a for loop or, even, the new match statement; "no good, right?" He showed a bunch of bug reports that resulted from the change, all of which have been fixed at this point.

A raw PEG parser does its work in exponential time, which means that it takes an amount of time proportional to an exponent of the length of the input. In order to avoid that, the CPython parser uses "packrat parsing", which uses a cache to do the parsing in linear time.

But sometimes it can still get into some ugly exponential backtracking. "This is pretty funny", Galindo Salgado said. An obvious syntax error of 21 open braces, followed by a colon (and EOF), took Python 3.10 around two seconds to parse, while making it 42 open braces, ran for more than an hour. He fixed the problem back in February for Python 3.10.3. "Now they [users] don't need to spare one hour to find out that's a syntax error", he said with a chuckle.

Adding error messages is difficult because it is stretching the parser in directions where it is less tested. "We have validated the real grammar of the language" many times, Galindo Salgado said, "we know it's fast, we know it works". A parser likes to know about what is correct in the language, but, when testing these error conditions, it is being used to investigate the "infinitely big world of things that are not Python".

It takes a lot more effort to validate the parser once these incorrect constructs are being recognized as well. More recently, a lot of improvements have been made to the parser and the tools used to validate it, but sometimes things still slip through. Then people make fun of them on Twitter; "Please don't make fun of us in Twitter", he said to laughter.

Suggestions and tracebacks

More syntax error message improvements are planned for 3.11 and beyond. There are, of course, other kinds of errors in Python programs, including errors at run time. Another new feature in 3.10 adds suggestions when a user misspells an identifier in their program. For example:

    >>> import collections
    >>> collections.namedtuplo
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    AttributeError: module 'collections' has no attribute 'namedtuplo'. Did you mean: 'namedtuple'?

    >>> schwarzschild_black_hole = None
    >>> schwarschild_black_hole
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    NameError: name 'schwarschild_black_hole' is not defined. Did you mean: 'schwarzschild_black_hole'?
This facility "works with everything": modules, custom classes, things in the standard library, and so on. This is a highly useful feature, he said, which actually helped him as he was developing it; "it's quite cool".

Galindo Salgado went on to explain how the feature is implemented. First, they extended the AttributeError exception to add two pieces of information: the name being looked up and the target object where Python tried (and failed) to find it. Then a "word distance" function is used to try to find the closest match to the name. All of the possibilities in the object are checked and the one with the smallest word distance from the name is suggested.

But there is a problem: those kinds of exceptions (and others, like NameError, where suggestions are made) can happen in the normal functioning of a program and finding the closest match is an expensive operation. If it were computed every time the exception is raised, "it would make Python much slower". So, instead, the match is only computed when the exception is about to be printed; the exception has bubbled all the way up to the top level and the interpreter is about to exit anyway. This is part of what makes adding error messages hard, he said; it is important to ensure that the non-error paths are not penalized when adding extra information to help in the error case.

Something he is excited about that is coming is better tracebacks for Python. The feature comes from PEP 657, which "has a horrible name 'Include Fine Grained Error Locations in Tracebacks'", he said, but it is much better than it sounds. He authored the PEP with Ammar Askar and Batuhan Taskaya; the feature will be added in 3.11. He gave an example similar to the following in the PEP:

    Traceback (most recent call last):
      File "test.py", line 2, in <module>
	x['a']['b']['c']['d'] = 1
    TypeError: 'NoneType' object is not subscriptable
One of those things is None, "but which one it is, you don't know". In Python 3.11, though, the traceback will show exactly where the problem lies:
    Traceback (most recent call last):
      File "test.py", line 2, in <module>
	x['a']['b']['c']['d'] = 1
	~~~~~~~~~~~^^^^^
    TypeError: 'NoneType' object is not subscriptable
In addition, tracebacks from failing programs will show which function call failed, so multiple calls on the same line are no longer mystifying as to which caused the error:
    Traceback (most recent call last):
    ...
      File "query.py", line 24, in add_counts
	return 25 + query_user(user1) + query_user(user2)
	            ^^^^^^^^^^^^^^^^^
      File "query.py", line 32, in query_user
	return 1 + query_count(x['a']['b']['c']['user'])
	                       ~~~~~~~~~~~^^^^^
    TypeError: 'NoneType' object is not subscriptable
Likewise, multiple divisions on a line will indicate which caused the division by zero, multiple uses of the same attribute name on different objects will indicate the guilty one, and so on.

This is done by adding extra information to every bytecode instruction about where the operation is in the program. Each operation stores the starting and ending line numbers along with the starting and ending column positions. The offending line is reparsed and combined with the information from the failing bytecode to produce the far more useful traceback.

Helping out

"We would love for you to help us" in making the error messages for Python even better, Galindo Salgado said. Python bugs have recently migrated to GitHub, so he recommended people go there to suggest error-message improvements. Sometimes the developers will say that it is difficult or impossible to implement them, but other times, they have been able to improve an error message based on an issue of that nature. He encouraged new Python programmers, in particular, to point out error messages that have caused them problems; similarly, Python teachers should point out the errors that are causing their students the most trouble.

For those who want to work on implementing better error messages, he suggested starting with the "Guide to CPython's Parser" in the Python Developers Guide. It is "very technical but I think it reads quite nicely", he said; it will allow readers to understand how the parser works in great detail. There is a section at the end on adding and validating new error messages. That will allow developers to add an error message and a "bunch of test cases", which can hopefully then go into the CPython mainline.

There is a growing group of people working on these improvements, "which is super great". Many of the improvements were proposed and implemented by the community and not by the core developers. It is important to keep an open mind when proposing improvements, however, since some of them may not be possible or may lead to problems elsewhere. Sometimes, even if they fix the target message, the core developers have to turn them down because of other things that break.

The "moral of the story" is that if you are working on your PhD and lose your battle with syntax errors, you can study for a few years about parsers and grammars, then help replace "the parser for one of the most popular languages in the world". After that, you can become a core developer and help improve the situation with those syntax errors. Or you can wait for someone else to have that experience "and then you can use it", he concluded—to laughter and applause.

[I would like to thank LWN subscribers for supporting my trip to Salt Lake City for PyCon.]

Comments (12 posted)

Statistics from the 5.18 development cycle

By Jonathan Corbet
May 23, 2022
The 5.18 kernel was released on May 22 after a nine-week development cycle. That can only mean that the time has come to look at some of the statistics behind this release, which was one of the busiest in a while. Read on for a look at the 5.18 kernel, where the code in this release came from, and how it found its way into the mainline.

The 5.18 development cycle saw the addition of 14,954 non-merge changesets from 2,024 developers, 289 of whom made their first kernel contribution during this time. None of these numbers are records, though the number of developers came close to the maximum seen so far (2,062 for 5.13). This work resulted in the addition of 756,00 lines of code to the kernel.

The top contributors to 5.18 were:

Most active 5.18 developers
By changesets
Krzysztof Kozlowski 2141.4%
Matthew Wilcox1641.1%
Christoph Hellwig 1541.0%
Geert Uytterhoeven 1400.9%
Ville Syrjälä 1350.9%
Jonathan Cameron 1190.8%
Andy Shevchenko 1180.8%
Lorenzo Bianconi 1170.8%
Vladimir Oltean 1110.7%
Hans de Goede 1100.7%
Martin Kaiser 1100.7%
Colin Ian King 1040.7%
Sean Christopherson 1000.7%
Jakub Kicinski 1000.7%
Christophe JAILLET 890.6%
Michael Straube 870.6%
Jani Nikula 860.6%
Trond Myklebust 810.5%
Eric Dumazet 800.5%
Christophe Leroy 800.5%
By changed lines
Leo Li 22767619.4%
Qingqing Zhuo 19775716.9%
Ian Rogers 720086.1%
Alan Kao 158141.3%
Ming Qian 121761.0%
Linus Walleij 88810.8%
Krzysztof Kozlowski 88440.8%
Dimitris Michailidis 87910.7%
Christoph Hellwig 71650.6%
Matt Roper 71140.6%
Jakub Kicinski 70400.6%
Jacob Keller 68770.6%
Geert Uytterhoeven 60390.5%
Ranjani Sridharan 57680.5%
Evan Quan 52320.4%
Guodong Liu 49440.4%
Mauro Carvalho Chehab 48160.4%
Vladimir Oltean 47760.4%
Brett Creeley 46600.4%
Adrian Hunter 46510.4%

Krzysztof Kozlowski is the developer who contributed the most patches to 5.18; this work consisted mainly of device-tree updates. Matthew Wilcox managed to get another set of folio patches merged. Christoph Hellwig continues to massively refactor the block and filesystem layers. Geert Uytterhoeven contributed a large set of Renesas pin-control improvements, and Ville Syrjälä did a lot of work on the Intel i915 graphics driver.

In the "changed lines" column, Leo Li added over 200,000 lines with just five patches adding register definitions for the AMD graphics driver — and Qingqing Zhuo added nearly 200,000 more. Ian Rogers made a number of improvements to the perf tool, Alan Kao contributed a single patch removing the nds32 architecture, and Ming Qian contributed a set of Amphion media drivers.

The top testers and reviewers of patches were:

Test and review credits in 5.18
Tested-by
Daniel Wheeler 15511.7%
Damien Le Moal 775.8%
Konrad Jankowski 544.1%
David Howells 534.0%
Mike Marshall 534.0%
Gurucharan 382.9%
Marc Zyngier 322.4%
Vladimir Murzin 322.4%
Randy Dunlap 211.6%
Jiri Olsa 171.3%
Julian Grahsl 161.2%
Yihang Li 151.1%
Reviewed-by
Rob Herring 2172.7%
Christoph Hellwig 2042.6%
Andy Shevchenko 1431.1%
AngeloGioacchino Del Regno 1101.4%
Stephen Boyd 1031.3%
Pierre-Louis Bossart 1031.3%
Alex Deucher 981.2%
Krzysztof Kozlowski 961.2%
Hans de Goede 911.1%
Péter Ujfalusi 881.1%
Jani Nikula 861.1%
Himanshu Madhani 851.1%

Daniel Wheeler continues to receive the most test credits, having applied Tested-by tags to many AMD graphics-driver patches. It's worth noting that Wheeler posts occasional summaries describing the testing that has been done. Damien Le Moal tested many of the folio patches. and Konrad Jankowski regularly tests Intel network-driver patches.

Turning to the review column, Rob Herring routinely reviews device-tree patches. Christoph Hellwig reviewed patches in the block and filesystem subsystems — and a number of the folio patches as well. Andy Shevchenko reviewed many driver patches, mostly in the I2C, GPIO, and pin-control subsystems.

In the past it has been easy to be cynical about these numbers; they didn't capture much of the test and review activity happening in the community and were easily gamed. There is still surely a lot of work going on that is not reflected above, but it would be hard to argue that the testers and reviewers on these lists don't belong there. Perhaps this reflects a greater understanding of the value of these activities on the part of developers and (especially) their employers.

Whether the same can be said for bug reporting will be left for the reader to decide:

Top bug-report credits for 5.18
kernel test robot 23219.3%
Zeal Robot 766.3%
Syzbot726.0%
Abaci 625.2%
Dan Carpenter 292.4%
Hulk Robot 272.2%
Stephen Rothwell 262.2%
Igor Zhbanov 191.6%
Randy Dunlap 121.0%
Rob Herring 90.7%

Bug reporting is clearly a job for robots these days. But note that, while 2,249 5.18 patches were backported to the 5.17 stable updates (so far), only 1,075 contained Reported-by tags. That would suggest that that just over half of the fixes being applied do not carry those tags and that, probably, a number of bug reports are going without credit.

The employers contributing most actively to this development cycle were:

Most active 5.18 employers
By changesets
Intel170811.4%
(Unknown)11557.7%
Red Hat9586.4%
Google8865.9%
(None)8185.5%
AMD7815.2%
Linaro5603.7%
Huawei Technologies4713.1%
Facebook4463.0%
NVIDIA3962.6%
(Consultant)3632.4%
SUSE3442.3%
IBM3342.2%
Oracle3252.2%
Arm2942.0%
Renesas Electronics2621.8%
MediaTek2491.7%
NXP Semiconductors2361.6%
Canonical2271.5%
Microchip Technology2011.3%
By lines changed
AMD46764239.9%
Intel1070819.1%
Google1038018.8%
(Unknown)496694.2%
Linaro296312.5%
Red Hat288072.5%
(None)279892.4%
NXP Semiconductors214181.8%
NVIDIA192031.6%
MediaTek189801.6%
Facebook160361.4%
Andes Technology158141.3%
(Consultant)143141.2%
Huawei Technologies134831.1%
IBM119601.0%
Microchip Technology118531.0%
Renesas Electronics114271.0%
SUSE101280.9%
Canonical89840.8%
Fungible87910.7%

As usual, there are few surprises here.

[Patch-flow plot]

Patch flow and signed tags

The illegible plot on the right (click to be able to actually read it) shows the paths taken by patches into the mainline kernel. Each box represents a Git repository, with the vectors showing the movement of patches from one repository to the next. This plot, which was generated by the treeplot utility from the gitdm collection of hacks (available from git://git.lwn.net/gitdm.git), provides an overall picture of how code moves through the maintainer community.

That picture remains relatively flat; most maintainers push their changes directly to Linus Torvalds. There is, however, a steady growth in the role of intermediate repositories, with the biggest ones handling the networking, graphics, system-on-chip, and character driver subsystems. The plot is a schematic diagram of the machine that has allowed the kernel process to scale to its current size — and, presumably, beyond.

The color of each vector indicates whether that repository is using signed tags on patches being pushed to the next level in the hierarchy; red lines indicate the lack of such a tag. The use of GPG signatures on tags allows a receiving maintainer to verify that a pull request was created by the person it claims to be from. If all pull requests include signed tags, it will be significantly harder for an attacker to convince a maintainer to pull from a malicious branch.

As has been documented here over the years, that universal use of signed tags has been slow to happen. Recently, though, Torvalds has become more insistent, with explicit requests to recalcitrant maintainers to get with the program. The end result is that, for 5.18, only 714 patches did not come from a signed tag — and 565 of those were directly applied by Torvalds and didn't arrive via a Git repository at all. So, at the top level of the tree, the switch to using signed tags is nearly complete — a mere 11 years after the practice was adopted. Some of the mid-level maintainers are still clearly not requiring signed tags on pull requests, though, so there are still some holes in the process.

Older bugs

Many of the patches applied to 5.18 fix bugs; how old are those bugs? One way of approximating an answer to that question is to look at how many fixes showing up in the stable updates were first applied to 5.18. A bug fix, one would expect, will not be backported beyond the release that introduced the bug in the first place. The results for 5.18 are:

ReleaseBackports
5.17 (Mar 2022) 2,249
5.15 (Oct 2021) 1,762
5.10 (Dec 2020) 1,185
5.4 (Nov 2019) 756
4.19 (Oct 2018) 532
4.14 (Nov 2017) 422
4.9 (Dec 2016) 331

As can be seen above, 331 fixes (so far) have been ported from 5.18 all the way back to the 4.9 kernel, which was released over five years ago. In other words, after more than five years of intensive fixing (stable updates to 4.9 have added nearly 22,000 fixes), we are still fixing nearly five bugs in 4.9 every day. We'll get that kernel right one of these years, probably just before its end of life date.

To summarize, the kernel machine continues to move at high speed. Lots of bugs are being fixed and, beyond doubt, lots more are being introduced. The end result continues to be the kernel that we all rely on.

Comments (3 posted)

Sharing memory for shared file extents

By Jake Edge
May 24, 2022

LSFMM

On the second day of the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM), Goldwyn Rodrigues led a combined filesystem and memory-management session on saving memory when reading files that share extents. That kind of sharing can occur with copy-on-write (COW) filesystems, reflinks, snapshots, and other features of that sort. When reading those files, memory is wasted because multiple copies of the same data is stored in the page cache, so he wanted to explore adding a cache specifically to handle that.

When two files share an extent, their inodes point at the same data blocks on the disk, though they seem to be completely independent files. When those files are read, each gets copied separately into the page cache. That wastes memory, but there are also other costs: reading from the disk, computing checksums, decompressing, and so on.

His idea is to create a device cache ("not a buffer cache" because that would cause nightmares, he said) within the page cache that would only store a single copy of these pages. His RFC implementation back in October used the inode of the device special file of the underlying device, rather than that of the file in the filesystem, to store the shared extents in the page cache. He described how the cache would work for multiple scenarios (buffered read, buffered write, direct I/O, and mmap()), starting with the simplest.

[Goldwyn Rodrigues]

A buffered read would check the page cache for the file and, if the page is not found there, it would calculate the device offset from the read offset in the file and look in the shared-extent cache to see if the page lives there. If not, it would read the data from the disk and add it to the shared-extent cache. Buffered writes would always go to the page cache, because any write ends the sharing of the extent. For writes of a partial page, though, the shared-extent cache would be checked for the rest of the data for that page.

The harder problem is for direct I/O (DIO), because a shared-extent cache kind of defeats the purpose of DIO, which is to circumvent the page cache. But if the shared-extent cache were used, DIO writes would need to check that cache and remove pages from it since the extents would no longer be shared. But Matthew Wilcox cautioned that even for reads, DIO needs to actually go to the disk because of shared storage, where some other machine may have written to the device. In addition, there are applications that are trying to save CPU cycles and want the DMA from the device to occur; the alternative is to copy the data out of the shared-extent cache using the CPU and "to that application, CPU is more important than bus bandwidth". Rodrigues said that changes in shared storage will require invalidating the caches across the cluster.

Supporting mmap() is "sort of a gray area for me", he said; he is not sure that his way to do so is the right one. There is a read-only mapping for the shared pages and any writes to those pages will result in a page fault that can handle the COW operation.

He wondered if there should be some kind of differentiation for reads that are targeting shared extents or whether all reads should go through the new cache. Josef Bacik said that he thinks it should just be a new kind of inode that, for Btrfs, maps to its logical byte-number addressing, rather than to anything device-specific. From there it is just treated like any other inode, so, for example, the memory-management (MM) subsystem can ask the filesystem to shrink its inode cache and these cache objects would just be handled normally.

Beyond that, Bacik does not want to see this as a mount option as it was in the RFC patch set, "death to all mount options"; it should just always work, he said. For Btrfs, he thinks that all reads should go to the new cache because a snapshot could happen at any time. For DIO, the page cache entries should just be invalidated and applications using DIO will not get the benefit of this feature

There is a question of how the cache gets flushed, since closing the file does not mean that others are not using the pages or won't soon, Rodrigues said. Maybe it makes sense to wait until the inode is evicted. But Bacik said that the starting point should be to not flush these pages at all and let the MM subsystem evict pages as needed. It will not reach the point of an out-of-memory (OOM) condition because the MM will tell the filesystem to invalidate pages before that happens.

There are some questions, he said, about how to share a single page across multiple mappings for different inodes; how does the system ensure that the COW happens when writes are done and how does the page get reclaimed properly when there are a lot of inodes referencing it? Wilcox said that made for a good opportunity to talk about some plans he has for splitting struct page and struct folio apart, since currently they are aliases. He covered some of that in his LSFMM session on the previous day.

Right now, you can simply cast a folio pointer to a page pointer and vice versa; "it's a bad code smell, but it works". The page structure has a pointer in disguise called "memdesc" that points to a folio structure. But there will need to be a way to get the page frame number (PFN) of the memory referred to by the folio once this 3-5 year "gargantuan project" of switching over to folios has finished.

So there will need to be a way to go from a folio structure to the memory it is describing. Once that is working, there could be multiple folios allocated for the same PFN, but with a different mapping and index. That could lead to a solution to the problem of tracking the inodes associated with the cache entries; there could be multiple folios in different address spaces that all refer to the same memory. It is only a long-term solution, he said, because all of the filesystems will need to be changed to use folios before it can happen.

Bacik said that he liked the idea of having a folio per inode that was sharing extents. But he wondered if that solution would be unpopular with the MM developers because pages with lots of references will seem like unattractive targets for reclaim, but these pages are simply in a cache that can be reclaimed. Kent Overstreet said that there needs to be a way to get a clear understanding of what a given chunk of memory is. The page structure cannot point to multiple folios, but it could point to a special kind of shared folio type that lists all of the folios that refer to the page. That shared folio could be put onto the least-recently-used (LRU) lists. Wilcox said that made sense to him.

It is in some ways like kernel same-page merging (KSM), Johannes Weiner said; a page structure is what appears on the LRU and the MM code consults the container of that page to reclaim all of the mappings to it. But if every filesystem has to deal with walking the list of folios when reclaim needs to be done, that will make it harder to implement. Wilcox said that he originally thought it made sense for the filesystems to keep track of that information, but he was coming around to the idea that it should be done in the MM subsystem. Bacik said he would be happy as long as Wilcox did all the work to make it happen, which elicited laughter; Wilcox seemed to agree, however.

Weiner said that KSM could use that facility as well, which Wilcox said "would be fantastic". KSM would just become another example of "something that is shared between multiple files". There was general agreement in the room on that approach. "OK, we solved it, thanks everyone for coming", Wilcox said with a laugh; there is, of course, a lot of work to be done to get there.

Comments (5 posted)

Change notifications for network filesystems

By Jake Edge
May 25, 2022

LSFMM

Steve French led a discussion on change notifications for network filesystems in a session at the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM). He is part of the Samba team and noted that both Windows and macOS clients get notified of new and changed files in a shared directory immediately, while on Linux that does not happen. He wanted to explore what it would take to add that functionality.

On Windows and macOS, a file browser automatically shows changes to files in shared network filesystems, but at some point that broke for Linux clients. The inotify mechanism (and its predecessor, dnotify) were added to the kernel to support the Samba server, he said. Remote systems that are talking to a Samba server on Linux can see those kinds of changes, but remote Linux clients cannot.

The client API changed at some point so that network filesystems have no easy way to register to receive these kinds of events. For SMB, he added an ioctl() command that can be used wait on notifications of these changes. But in order to use that, all of the client programs would need to change to make a filesystem-specific call in to get that information.

[Steve French]

The underlying problem is that the filesystem servers are not told that a Linux client wants to be notified of changes. That means Linux file browsers do not have the functionality that Windows and Mac users have come to expect. The inotify functionality does not have a hook into Ceph, AFS, or SMB to make them aware that a client wants notifications, he said. Chuck Lever noted that NFS has the notification capability in the protocol, but, like the others, it is not implemented for Linux.

There is also the fanotify API, French said, but he does not know if it would be useful for what he is looking for. Amir Goldstein said that fanotify was originally created by antivirus vendors but that, more recently, work has gone into it to add more functionality. As of about Linux 5.10, fanotify provides almost a superset of the inotify functionality.

One big feature that inotify lacks has been implemented in fanotify: watching an entire filesystem. There are not many applications that use it, because it is new, Goldstein said. He has added fanotify support to inotify-tools and its library, so there are now user-space tools that can be used to watch a filesystem or set of files using the fanotify API.

There are many types of events that an SMB client can get from the server to tell it about changes to timestamps, file creation, file name changes, file deletion, and so on, French said. Those all seem to map reasonably well to fanotify/inotify events; changes to the access-control lists (ACLs) is not supported but might need to be, he said. Goldstein said that if there is enough interest, event types can be added to fanotify.

On Linux, David Howells said, the file notifications are mostly used by desktop file managers. KDE starts a daemon to monitor changes and GNOME does something similar, he said; if notifications are not available, then they poll for the information. Goldstein said that it is not that notifications are not available, just that they are not granular enough and that there may be some kinds of changes that do not have notification events, so polling is used for those cases.

Goldstein said that French had been asking for this feature for a long time. The FUSE developers "took a shot at implementing something", he said; it added inotify support for virtiofs. On the Zoom link, Vivek Goyal, who was involved in that work, said that inotify was chosen because it is simpler than fanotify. Whatever notification watches are placed on the local file are forwarded to the remote file server, which sets up inotify and forwards events back to the local filesystem. Based on the feedback on those patches, Goyal said, he has been trying to rework the patches to use fanotify but ran into a number of difficulties. There may be more limitations when using fanotify. French said that it is important to get a handle on what exactly can be supported because the alternative is "really painful": polling.

Jan Kara, also via Zoom, said that it should be fairly straightforward to add the hook for filesystems to inform them that a watch has been added; in the simplest case, the filesystem just says that it does not support the feature. The more difficult part is that when the filesystem receives an event and wants to get it to the client filesystem in a way that user space can receive it via fanotify or inotify. For inotify, the inode number and file name are available to send to the client, but that is not true for fanotify, where you may only have the inode number. Goyal agreed that was the problem for virtiofs.

The important thing is to provide a generic mechanism for filesystems so that applications do not have to use multiple filesystem-specific interfaces to get this information, French said. He also wants to avoid polling, which is particularly expensive when done across the network. Josef Bacik said that it seemed reasonable to add the hook to let the filesystems know when a watch has been added; it is up to French and Goyal to work out the details on that.

Howells asked about subtree watches; on Windows you can get notified for changes within a subtree. He wondered if fanotify could add support for that. Goldstein said that it is something that everyone wants, but it is not trivial to do; several attempts have been made over the years, but nothing has been added.

French said that the feature he is looking for is an asynchronous, non-perfect mechanism. Some filesystems, such as SMB and NFS, have strict approaches using delegations or leases to ensure that all events are seen, but that is not usually worth the cost. Those could be used to implement these change notifications, but it should be left up to the filesystem to decide that, he said.

As time wound down, French also wanted to mention that he had not seen any tests for inotify and fanotify in xfstests (which are being renamed to "fstests"). It will be important to have tests to ensure that nothing breaks when the remote notifications are added. But Goldstein said that the tests for notifications are part of the Linux Test Project (LTP) tests. There is a test there for every new feature and regression tests for bugs that have been fixed. Ted Ts'o said that xfstests have historically been used by the developers of different filesystems, while features that were implemented in the virtual filesystem (VFS) layer were tested in LTP. That may need to change as the network filesystems add features to support notifications.

Comments (7 posted)

Making O_TMPFILE atomic (and statx() additions)

By Jake Edge
May 25, 2022

LSFMM

Right on the heels of his previous filesystem session at the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM), Steve French led a session on temporary files and their interaction with network filesystems. The problem is that creating temporary files is not always atomic, so he was proposing changing that, which would eliminate a possible race condition and be more efficient for network filesystems. Since the temporary-file discussion did not fill the 30-minute slot, however, French took the opportunity to discuss some attributes he would like to see get added for the statx() system call.

Calling open() with the O_TMPFILE flag creates a unnamed file that, by default, is deleted when it is closed. It is not a feature that was in Linux from the outset; it was added for the 3.11 kernel in 2013. Not all filesystems implement the functionality, but the most widely used ones do. There are two types of filesystems, he said, some that have a two-step process for creating a file and others that do it in one step. In the two-step case, the file is created and then, separately, opened, while the others do both of those things in a single step.

When those operations are performed for a network filesystem like SMB, there is a problem. If there are two operations to create the temporary file, the network filesystem has to do something special or the file created will be removed before the open can occur. For some filesystems, the create operation returns an open file, which is normally closed when the create operation completes. But if the file created is a temporary file, the close will, of course, delete the file. In that case, that close operation that would normally be done at the end of the create step has to be deferred so that the open operation can succeed.

There is a small possibility of a race between the create and open operations, but it is also inefficient to make two calls across the network when one should suffice, he said. Combining the two operations, similar to what atomic_open() does, would be a better approach. He suggested adding a directory inode operation called atomic_tmpfile() that filesystems could implement if they want to support the feature.

David Howells wondered if it made sense to simply use atomic_open() and add code to it for the temporary-file case. French said he looked at that and it is possible to do it that way, but that raises an issue that he would like to discuss at next year's LSFMM. He said that the open and create paths in the virtual filesystem (VFS) code are "kind of ugly" and confusing. Beyond that, there are places where unnecessary stat operations are being performed, which causes a costly network round-trip for network filesystems. So he sees some cleanup that he thinks needs to be done in those code paths.

Christian Brauner said that it would be better, if possible, to make the change for atomic temporary files at the VFS level so that all filesystems could benefit without needing to add code. French thought that sounded like a good idea, but Howells was concerned that some filesystems might not be able to support the atomic temporary-file creation, so VFS might not be the right place. Forcing filesystems to open the temporary file at the same time they create it might be problematic for, say, overlayfs, he said. It is worth experimenting with the idea, French said.

statx()

Since there was time left in the slot, French shifted gears to talk about another idea he would like to see implemented. There are already a number of flags that are returned by statx(), he said, but he can see a need for a few more. He put up a slide listing nine attribute flags that currently can be returned for a file, but there are four additional attributes that "jump out at me" for addition, he said.

For example, it is relatively common these days for people to have "local" files that are actually stored in the cloud somewhere, so an "offline" attribute would be useful. On the flipside, a "pinned" attribute could be used to indicate a file that is backed by cloud storage but is hosted locally, so it should not be removed because of the time required to get it back. These are not attributes that network filesystems, such as SMB, would need to handle, they would simply report them. These "seem like no-brainers", he said.

The other two are "integrity" and its opposite, to indicate some kind of scratch file where file integrity is not important, which he called "no scrub" on his slides. These would ask the filesystem to either do the best it can in terms of integrity protection or to do nothing in that regard. Chuck Lever questioned whether a single bit is enough to encompass all of the complexity of Linux integrity protection, which has various configuration options and policies. But statx() already has "encrypted" and "compressed" attributes, so French sees "integrity" in the same light; it would be requesting the strongest integrity protection the filesystem can provide.

Howells wondered which of these attribute bits would actually be used by applications. Putting them into statx() implies that applications will use them frequently. He can see that "offline" might make sense, since it would provide a useful hint to desktop environments, but the others seem questionable. The filesystem may need to know about them, but it is less clear that applications need them.

Ted Ts'o said that he was hearing an assumption that there is a way to set these attributes, but that is not the case. statx() only reports them and there is no Linux system call that would allow an administrator to set them. The attribute flags originated in an ext2-specific ioctl() command, he said, that eventually got adopted by other filesystems and moved into the VFS. But the original 32-bit flag field was the actual on-disk representation for the ext filesystems so there were ext-specific flags that other filesystems were not interested in.

statx() came about to report a filesystem-independent set of attributes to user space. But there is no way for someone to change the value of those bits in a filesystem-independent way. There are various mechanisms to set them, using ioctl() commands, but no system call to set, for example, the statx() "integrity" attribute for any filesystem.

There was some discussion of what a "setinfo" facility might look like. Kent Overstreet suggested that the extended attribute (xattr) interface could be used; a special namespace would actually refer to these file attributes and statx() would be the fast path to access them. French thought that sounded reasonable, and did not think it was urgent to add the ability to set the values in a generic way.

Comments (none posted)

CXL 2: Pooling, sharing, and I/O-memory resources

By Jonathan Corbet
May 19, 2022

LSFMM
During the final day of the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM), attention in the memory-management track turned once again to the challenges posed by the upcoming Compute Express Link (CXL) technology. Two sessions looked at different problems posed by CXL memory, which can come and go over the operation of the system. CXL offers a lot of flexibility, but changes will be needed for the kernel to be able to take advantage of it.

Pooled and shared memory

Hongjian Fan, who led one of Tuesday's CXL sessions returned on Wednesday (via videoconference) for a discussion that was dedicated to pooled and shared memory. These are concepts that apply to memory appliances, where the goals are to share memory across multiple systems, improve memory utilization and, naturally, to reduce costs. Sharing memory from a central appliance can reduce the need to put large amounts of memory into every server; when a given machine needs more, it can get a temporary allocation from the appliance.

Pooled memory is partitioned on the appliance and allocated in chunks to servers, which only have access to the memory that has been given to them. Requesting memory from a pooled appliance creates a hotplug event, where new memory suddenly becomes addressable. Supporting pooled memory requires the ability to generate and manage the hotplug events, as well as a virtual-device driver that monitors memory use and requests or releases memory as appropriate.

Shared memory is, instead, shared across all servers, though it will probably not be possible for any given server to allocate it all. With a shared appliance, the memory is always in each server's physical address space, but it may not all be usable. The kernel can provide a sysfs file that indicates which memory is available at any given time; tracking of allocations can done by the appliance or via communication between servers, though the latter mode can create a lot of traffic.

Dave Hansen said that CXL memory behaves a lot like RAM today, but it requires some extra care. There may be cache-coherency issues not present with RAM, and the kernel can't keep any of its own data structures in this memory since those structures cannot be moved and would thus block removal. Fan said that cache coherency is part of the CXL protocol and shouldn't be a problem. Hansen added that there is little that is new with CXL memory appliances, they are much like how memory is managed with virtualization. But now it is being done in hardware, which scares him a bit. Memory removal success is "a matter of luck" now, he said, and calling this memory "CXL" won't change that.

An attendee asked what the benefit of the shared mode was, given that all memory will still be used exclusively by one system at any given time. Fan answered that the problem with pooled access is fast and reliable hotplugging, while the problem with shared access is communication between the systems. Hansen asked how access to shared memory is cut off when memory is reallocated, but Fan was unable to answer the question.

Dan Williams said that access control is not really visible to the kernel, and that it was necessary to "trust the switch". He added that users want to be able to manage this memory with the existing NUMA APIs, but they also want hard guarantees that it will be possible to remove memory from a system; those two goals are in conflict. It will be necessary to reset expectations about removal, he said; it will be a learning experience for the industry. Hansen said that the use of hotplug will be no different in this scenario, but Williams said there will now be a whole new level of software behind hotplug to manage the physical address space. That is something that the firmware has always done, but now the kernel will have to deal with it; the CXL specification group is still trying to figure out the details of how that will work.

Fan said some other changes will be necessary as well. There will need to be a mechanism to warn about available capacity on the appliance. Since memory can be requested and added to the system on the fly, the out-of-memory handler should perhaps wait for more memory to materialize before it starts killing processes. David Hildenbrand said that the out-of-memory scenario scares him; people think that it's possible to just wait for memory to appear, but it's not true. If the system is going into the out-of-memory state, there will be other allocations failing at the same time. What is needed is a way to determine that the system is short of memory, then wait for more memory in a safe way, before running out. Hansen added that plugging in more memory is an act that, in itself, requires allocating memory, and an out-of-memory situation is not a good time to try to do that. Williams said, as the session came to a close, that the system cannot be reactionary, and that memory requirements should be handled in user space at the job-scheduling level.

Managing the resource tree

Management of the physical address space was the topic of the second CXL session of the day. The resource structure is one of the oldest data structures in the kernel; it was added in the 2.3.11 release in 1999. Its job is to track the resources available to the system and, in the form of the iomem_resource variable, the layout of the computer's physical address space. It forms a tree structure with some resources (a PCI bus, for example) containing other resources (attached devices) within their address ranges. This tree is represented in /proc/iomem, which must be opened as root to show the actual addresses involved.

[Ben Widawsky] The kernel's I/O-memory resource tree was not designed with CXL in mind; for Linus Torvalds to have been so short-sighted in 1999 is perhaps forgivable. But, said Ben Widawsky in his session, that shortcoming is threatening to create problems now. In current systems, iomem_resource is initially created from the memory map provided by the boot firmware; architecture-specific code and drivers then modify it and subdivide the resources there as needed. Once a given range of physical address space has been assigned to a specific use, it can never be reassigned — only subdivided.

The core of the problem is that CXL memory can come and go, and it may not be present at boot time. When this memory is added, it essentially overrides a piece of the physical address space, which is something that iomem_resource is not prepared to handle. If the space used by CXL were disjoint from local system resources, Widawsky said, there wouldn't be a problem; traditional resources could be put into one range, and CXL in another. But that is not how things are going to work. RAM added via CXL will overlap the space already described by iomem_resource. What, he asked, can be done to properly represent these resources?

Mike Rapoport questioned the need to put CXL memory into iomem_resource at all. The problem, Hansen explained, is that CXL memory might be the only memory in the system. People tend to see CXL as a sort of add-on card, but it is closer to the core than that. On a system using only CXL, it would not be possible to boot without having that memory represented in iomem_resource. David Hildenbrand said that iomem_resource should describe everything in the system.

Widawsky said that there is a need to keep device-private memory from taking address space intended for CXL; this is another reason to represent CXL memory in the resource tree. He suggested that attempts to take pieces of memory assigned to CXL should be blocked. Hildenbrand suggested creating the CXL region as a device and adding some special calls to allocate space from that region. This could be tricky, Widawsky said. System RAM may already be set up in the resource tree; making it part of a special device would involve reparenting that RAM, which, he said, has never been done. Matthew Wilcox contradicted the "never been done" claim, but without details on when it had been done.

John Hubbard said that the kernel should keep iomem_resource as "the one truth" about the layout of the physical address space. Williams said that struct resource is old; there are people around who love to add new structures to the kernel, perhaps the time has come to do that for this problem. Wilcox referenced a "20-year-old patch" in Andrew Morton's tree, but didn't identify it. Hildenbrand said that the structure as a whole is difficult to traverse and work with; any work to improve it would be appreciated.

Widawsky asked if there was a path to a solution that involved a bit less hard work. Williams suggested adding resources in smaller chunks, with a number of entries for the CXL CFMWS ("fixed memory window structures") areas. Some of those entries could later be removed, Widawsky added, if it turned out they weren't being used for CXL memory.

The session came to an end with Wilcox asking what would happen in response to a discovery that an assigned resource's range is too small. Could it be expanded somehow? Williams said it would be good to be able to update the address map as more information became available. All told, the session described a problem but did not get close to finding a solution. This is a problem that has been seen in numerous other contexts as computers have become more dynamic. Solutions have been found in the past and will surely be found this time too, but it may be challenging to find one that doesn't involve a fair amount of hard work.

Comments (15 posted)

Cleaning up dying control groups, 2022 edition

By Jonathan Corbet
May 19, 2022

LSFMM
Control groups are a useful system-management feature, but they can also consume a lot of resources, especially if they hang around on the system after they have been deleted. Roman Gushchin described the problems that can result at the 2019 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM); he returned during the 2022 LSFMM to revisit the issue, especially as it relates to the memory controller. Progress has been made, but the problem is not yet solved.

Modern systems, he began, can create and destroy vast numbers of control groups, especially if they are running systemd. The cost of creating a control group is low, but the destruction costs can be "brutal". Sometimes, the task of getting rid of an old control group never completes, leaving the system paying the cost of having a large number of dying control groups sitting around. [Roman Gushchin]

There are a number of difficulties involved in cleaning up a control group. If the memory controller is in use, the group cannot be deleted until the pages charged to it are reclaimed, and that is a costly process. The mem_cgroup structure used to represent a memory control group is large; it can occupy hundreds of kilobytes of space. On a large system, the amount of memory consumed by these structures can reach into the gigabyte range. These are old problems, he said, but they are still with us.

The problem is exacerbated by the inability to quickly find the memory that is charged to any given control group; there are statistics but otherwise the kernel has little visibility in this area, Gushchin said. Even worse, though, is when memory is shared between control groups. Then the system probably has living groups using resources that were created by (and are charged to) dying groups; the accounting will not be correct in this case. In general, the kernel has never handled memory shared between groups well; the first group to create any given page is charged for it. In a typical system, much of the working set will "belong" to older control groups; that messes up the statistics and prevents usage limits from working properly.

Some work has been done, he said, including a lot of plain fixes and optimizations. Slab reparenting, which he had described in 2019, has helped a lot by eliminating the problem of old groups being pinned by remaining slab-allocated objects. Slab accounting has been reworked in general, providing byte-resolution charging and reparenting; this work is being extended beyond the slab layer. Writeback of memory belonging to control groups has been cleaned up; it had been holding references that could keep an old group around. Statistics from the memory controller have been improved in general.

The biggest remaining question, he said, is what to do with the page cache. Memory in the page cache gets left behind when a control group exits. There is a reparenting patch set from Muchun Song in circulation, but Gushchin is not sure that the approach is correct. He wondered if reparenting page-cache pages makes sense, or whether page-cache pages need to hold a reference to the control group to which they are charged at all. There is also a patch from Waiman Long to force the early release of per-CPU memory, but Gushchin described it as a "band-aid" that adds more complexity. He mentioned, instead, the possibility of marking leftover page-cache pages with a special flag that would cause them to be charged to the next user that came along.

At another level, there is work being done in systemd to end the practice of creating and deleting control groups; that work may land soon, Gushchin said. Relying on that change is questionable, though, since it's delegating the problem to user space.

The session wound down without a lot of discussion. Johannes Weiner did remark, though, that the problem needs to be solved even if systemd changes to avoid triggering it. The problem will continue to pop up until it is fixed in the right place.

Comments (3 posted)

get_user_pages() and COW, 2022 edition

By Jonathan Corbet
May 20, 2022

LSFMM
The numerous correctness problems with the kernel's get_user_pages() functionality have been a fixture at the Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM) for some years. The 2022 event did not break that tradition. The first-day discussion on page pinning was covered here. On the final day, in the memory-management track, David Hildenbrand led a session on the current status of get_user_pages() and its interaction with copy-on-write (COW) memory.

COW pages, he began, are used to share anonymous memory between processes. The memory is marked read-only; should a process write to a page, the write fault will be trapped by the kernel, which will make a private copy for the writing process if more than one reference to the page exists. COW is all relatively easy to implement and understand, at least until get_user_pages() enters the picture. That function (along with its variants) will take a reference to the indicated pages, which will then be used to access the pages themselves. There are two modes used with get_user_pages(), depending on whether the contents of the pages are to be accessed, or only the page structure describing them; not every use requests the correct mode, though.

References taken by get_user_pages() are tracked in the page_count field of struct page — not in the mapcount field used to track mappings of the page (and to decide whether to copy a page when a COW fault happens). In general, he said, the kernel knows little about these references; they are not tracked separately from any other references to pages.

In 2020, a security problem involving the vmsplice() system call was reported and became known as CVE-2020-29374. It relied on a COW page that was ostensibly only mapped once (so mapcount was one), but a second reference had been created with get_user_pages(). The full story of this vulnerability can be found in this article. In short, the vulnerability was fixed with a commit that caused other problems and was quickly reverted; this happened several times. There is now a fix of sorts in place, though the hugetlbfs filesystem is still affected. But, Hildenbrand said, nobody cares much about hugetlbfs, which is not used to share data with unprivileged child processes.

The fix that went upstream looks at page_count and will force a copy of a COW page if the value is not one. The mapcount field is no longer used for this decision. As a result, the security problem can no longer happen, but the kernel might copy pages more often than it should. There is another side effect, though: when get_user_pages() is called on a COW page, page_count will be incremented; as a result, any write to the page will force a copy to be made. The caller of get_user_pages() will be left with the older copy, though, and will not observe any changes made by the writing process. That can lead to the corruption or loss of data.

Thus, Hildenbrand said, there are two potential problems with the current solution: the cost of unnecessary copies of COW pages, and the potential for data corruption when a get_user_pages() caller ends up with the wrong copy of a COW page. There is a solution being upstreamed now that relies on the new PG_anon_exclusive (abbreviated "PAE") page flag and page_count to avoid the wrong-copy problem. This flag, if present, indicates that the page is both anonymous and exclusive to a process; Hildenbrand described those pages as "PAE pages". If a page is not PAE, that page might be shared. The rules are that any page that is writable must be PAE, and those pages should never be copied in response to COW faults; additionally, pages can only be pinned (for access to their contents) if they are PAE. If there is a possibility that a given PAE page might be pinned, it will not be shared in settings where it otherwise would be — when a process forks, for example.

There are various cases that need to be considered here. If the kernel seeks to pin a writable, anonymous page, all is well, but if the page is marked read-only, the kernel must trigger a write fault first. In the case of a read-only, anonymous, non-PAE page, that page must be unshared prior to pinning. "Unsharing" in this case can be thought of as "copy on read"; if the page has a single reference it can be reused, otherwise it will need to be copied.

There are some other tricky cases, Hildenbrand continued. Transparent huge pages are "nasty", since they can be mapped as base (non-huge) pages as well. Temporary unmapping, as happens when a page is being swapped out or migrated, can create confusion. Concurrent get_user_pages() calls (gup_fast() in particular) must be handled carefully, since they don't take the page-table lock, which is used to synchronize access to the PG_anon_exclusive flag. Care must be taken when migrating pages to ensure that the PAE status is not lost.

The end result, Hildenbrand said at the end of the session, is not optimal. It works well in the absence of fork() calls or the use of kernel same-page merging (KSM). But attempts to avoid extra copies can fail at times even if there is only one mapping, and get_user_pages() is not always reliable when called concurrently with a process fork. But it is all a step in the right direction; be sure to tune into the 2023 LSFMM for the inevitable update.

Comments (none posted)

Fixing a race in hugetlbfs

By Jonathan Corbet
May 20, 2022

LSFMM
As the memory-management track at the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM) neared its conclusion, Mike Kravetz ran a session remotely to talk about page sharing with hugetlbfs, which is a special filesystem that provides access to huge pages. (See this article series for lots of information about hugetlbfs). Hugetlbfs can help to reduce page-table overhead when pages are shared between large numbers of processes, but there is a problem that he is trying to find a solution for.

One advantage to hugetlbfs, he said, is that processes can share ranges of memory at the PMD page-table level, though the size of the range must be at least 1GB. Sharing huge pages allows the kernel to dispense with the lowest-level page-table pages entirely, saving the memory that would have been used by those pages. This can make a big difference when there is a lot of sharing going on; with a 1GB shared mapping and 10,000 processes sharing it, he said, the result is 39GB of saved memory that would otherwise be used for page-table pages. Hugetlbfs, when used this way, is solving the same problem targeted by the mshare() proposal, but the mechanism is different; rather than sharing page-table pages, hugetlbfs just eliminates the need for those page-table pages.

That is nice, but there is a problem lurking therein. When a process's mapping to a hugetlbfs page is deleted, a call to huge_pmd_unshare() results. This can also happen when changing a mapped page's attributes with mprotect(). If a fault happens on a page while this unsharing is happening, though, the result is an "ugly race" that can create invalid page-table entries. The problem is easy to provoke from user space, he said.

This problem was fixed by commit c0d0381ade79 in 2020, which uses the i_mmap_rwsem semaphore to synchronize the unshare operation. It must also be held during page-fault processing, of course, to prevent the race from happening. This fix created a new problem, though, because i_mmap_rwsem is held for the duration of a number of potentially long-running operations, including truncation and hole punching. That can cause long delays, with latencies greater than two seconds in his testing.

To address this problem, he said, the previous fix should be reverted. Instead, a per-VMA reader/writer semaphore should be used to synchronize these operations. That limits the contention and makes the worst case a lot better.

He asked the assembled developers what they thought of this fix, and was greeted with resounding silence. After some time, Matthew Wilcox observed that Kravetz had "broken people's brains" with the presentation. Kravetz replied that he would post another RFC patch soon and the conversation could continue from there.

Comments (none posted)

Preserving guest memory across kexec

By Jonathan Corbet
May 20, 2022

LSFMM
The final session in the memory-management track at the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM) was run remotely by James Gowans and David Woodhouse. It was titled "user-space control of memory mappings", with a subtitle of "letting guest memory and state survive kexec". Some options were discussed, but the real work is clearly yet to be done.

The use case in question, Gowans began, is a live update of a hypervisor done with the kernel's kexec functionality. To carry this out, the state of all running virtual machines is serialized to persistent storage, then kexec is used to boot into the updated hypervisor. After that, the virtual machines can all be restarted. The desire is to preserve the state of guest memory over the reboot, which means this memory cannot be managed by the host kernel in the traditional way; instead, the kernel should stay away from that memory and let user space manage its allocation to virtual machines. They have been looking at "sidecar virtual machines" as a way to implement this functionality.

Most of guest memory, Gowans said, should not be touched by the new kernel, meaning that the kernel will only manage a small part of the memory given to guest systems. The userfaultfd() system call is used to manage the rest; this is a change, since userfaultfd() only works with anonymous memory currently. Future requirements will include keeping I/O memory-management unit (IOMMU) mappings in sync, keeping DMA operations running while the update happens, and improving the speed of kexec by passing more state to the new kernel.

John Hubbard asked if memory managed in this way needs to have associated page structures; the answer was that they are not needed.

A few implementation options were presented. The first was a full filesystem, implemented in the kernel, that is used to manage allocations of reserved ranges of memory. The kernel would reconstruct this filesystem after a kexec. The PKRAM mechanism, which preserves RAM contents over a kexec, would probably be used for this purpose; the PKRAM patches were posted last year, but have not been merged. How to handle other types of memory, such as PCI memory-mapped I/O (MMIO) registers, is an open question as well.

The next implementation option was a FUSE-based filesystem; mapping of guest memory to page-frame numbers could then be handled from user space. A special control process could handle many of the details, and this solution would support mapping to PCI MMIO spaces.

Finally, this feature could be implemented using a raw memory device, something along the lines of /dev/mem. The control process could use ioctl() calls to create and revoke mappings to pages in the guest process. User space would be charged with keeping mappings in place over the kexec call. There is evidently an implementation of this option running now.

Jan Kara observed that there are a number of other things that need to be restored after a kexec, including open files and more. This task resembles Checkpoint/Restore in User space (CRIU), which already exists. The response was that this solution does not try to recreate everything automatically; instead, hypervisor processes will be responsible for opening files again after the kexec. Woodhouse compared it to live migration to the same host. Gowans said that guests won't notice this happening; they will be paused and serialized, and their previous state pushed back into KVM by the new hypervisor.

Returning to the implementation options, Gowans said that the full-filesystem approach offers the best latency and introspection, but it's not clear how MMIO regions can be handled. The FUSE approach gives full control to user space and solves the MMIO problem. The raw-memory version is the most flexible, but it requires reconstructing everything after the kexec, and is the least transparent to introspection.

Next steps include figuring out how to handle IOMMU mappings, then picking an approach to pursue. The preferred approach looks like the FUSE version, so the plan is to put together an RFC patch implementing it and to have a polished version by the KVM Forum in September.

Dan Williams said that the FUSE and raw-memory options look like the least scary ones. That said, PKRAM does look scary; he asked about the status of those patches. David Hildenbrand answered that the last posting of that work "didn't inspire joy".

The attendees were tired and the session wound down fairly quickly. The final question had to do with the existence of other use cases for this functionality. Hildenbrand suggested that databases could be a candidate. Specifically, huge, in-memory databases can take hours to boot and load up all of the data; a mechanism like this could possibly accelerate the process.

Comments (15 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>


Copyright © 2022, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds