LWN.net Weekly Edition for June 1, 2023

Welcome to the LWN.net Weekly Edition for June 1, 2023

This edition contains the following content:

Julia 1.9 brings more speed and convenience new features in the latest release of the Julia language.
Ongoing coverage from the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit:

Zoned storage and filesystems: a discussion of a few issues around using zoned storage for filesystems.
Cloud-storage optimizations: discussion on I/O hints for cloud-based emulated block devices—and how to see them get implemented relatively quickly.
Atomic block-write operations: discussion on support for atomically writing data in sizes that are multiples of a device's block size.
Flexible-order anonymous folios: the page cache can handle folios of multiple sizes; work is underway to bring the same flexibility to anonymous memory.
Optimizing single-owner memory: a proposal to optimize the management of memory that will never be shared.
Mitigating vmap lock contention allocations in the kernel's vmap area can be subject to significant lock contention; this session looked at the problem and a proposed way to improve the situation.
Improving page-fault scalability the 2023 version of this regular LSFMM+BPF discussion.
Code tagging and memory-allocation profiling: for the second year in a row, the memory-management developers consider the code-tagging patches.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Julia 1.9 brings more speed and convenience

May 29, 2023

This article was contributed by Lee Phillips

Version 1.9 of Julia, which is an open-source programming language popular in scientific computing, was released in early May. There are a number of interesting new features this time around, including more work addressing the startup-time complaints and a number of improvements to the package system. Beyond that, there are a few interesting features from the Julia 1.8 release to catch up on.

Julia is a general-purpose programming language which is just-in-time (JIT) compiled by the LLVM compiler. Since its public release in 2012, it has rapidly been adopted for scientific research, due to execution speed similar to Fortran combined with the convenience of REPL-based development. Julia has an expressive syntax as well as a high degree of composability of library code.

Julia 1.8

Our last detailed description of a Julia release looked at version 1.7, so we have a little catching up to do. Version 1.8 brought some changes that merit at least a quick summary. The one new feature in that release that had the greatest potential effect on writing Julia programs was the appearance of typed globals.

It's possible to write an unnecessarily slow program in any language, and Julia is no exception. The key to creating efficient programs in Julia is to write type-stable code. One of the conditions for a type-stable program is that the compiler can infer the return type of any function just from the types of its arguments in a call. In practice this is no great burden on the programmer, who simply needs to keep in mind some simple and generally intuitive guidelines. One of the first of those guidelines that the Julia programmer absorbs is that they should never use non-constant global variables (i.e. not declared using const).

It's easy to see why a non-constant global can create type instability: since the type of the variable could potentially change from an assignment elsewhere in the program, any function that uses it may have a return type that cannot be inferred. Version 1.8 allows declaring the type of a global variable, so that its value can be changed, but not its type. Programmers who want to use global variables can now do so without a performance penalty.

Another 1.8 change involves Julia's structs, which are similar to structs in C; they are the basic mechanism for creating user-defined types. Before version 1.8, they came in two distinct species: mutable and immutable structs. Once an instance of the latter is created, the values of its fields cannot be changed. A plain struct declaration creates an immutable struct, while a mutable struct declaration is for the mutable variety. Julia version 1.8 introduced the ability to mix the characteristics of both kinds of struct, by declaring some of its fields to be const:

    mutable struct MS
        a::Int
        const b::Int
    end

    thing = MS(17, 43)

This creates a mutable struct called MS and a variable thing of that type. The value of the first field can be changed, but not the second: thing.b = 18 is an error.

That covers the changes that are most relevant to daily use of the language. Most of the other significant new features in version 1.8 are related to performance, profiling, and the package system.

"Time to first x"

Precompilation delays when loading packages, which increase the latency until a loaded function can be used, are a common source of complaints from new Julia users. This is called the "time to first x" problem, often in the form of "time to first plot", referring to the delay between importing a package and seeing the result returned by one of its functions. The nature of Julia's type system and its "just ahead of time" compilation make some of this latency inevitable; it's a small price to be paid for the ability to program with user-defined types, dynamic dispatch, macros, and the rest of Julia's toolbox, while still ending up with fast code operating on primitive types.

Nevertheless, any improvement in the interactive experience is welcome. Our article measuring the decreased latency in Julia version 1.6 noted the distinct improvement over previous releases. Work on various strategies for making Julia more responsive has continued, leading to a big improvement in version 1.9. See this section of the release highlights for tables and graphs of loading times for a few large, popular packages.

A big portion of this improvement is due to the arrival of native code caching. Now, after initial precompilation, the resulting machine code is cached. The consequences are longer initial precompilation times and more space needed for the cached files, but greatly reduced time to first x in subsequent sessions. Package authors can also ship their code with cached, precompiled routines. Julia users can now start a REPL, load Plots, and see their first graph in a couple of seconds. Users can also create personal "startup packages" with a set of dependent packages and compiled, cached routines relevant to their typical workflows.

More options when adding packages

The package system now provides more flexibility when adding and upgrading packages. In the REPL's package mode, the add Package command, without any options, will install the latest compatible version of the package called Package into the active environment. This may require dependencies to be upgraded to maintain a consistent environment, which in turn will require sometimes lengthy precompilation.

Julia 1.9 provides an option only add package versions to the active environment that have already been installed on the machine. The package system is aware of everything installed across all of the environments, so that it can share resources. For example, if there are several projects that all use the same version of the Plots package, it hasn't been downloaded and compiled for each project, but only once; all the projects use the same files. This means many projects can be created without worrying about using up additional disk space.

To use this option enter:

    pkg> add --preserve=tiered_installed Package

The system will try to use installed versions first, and only download new ones if required to satisfy dependencies. An option that forbids the system to install new versions even if required, but to give up instead, is also available:

    pkg> add --preserve=installed Package

I find this option useful to save time when I need a function from a package that I know I've installed and don't care about getting the latest version.

Tools for package authors

Packages developed for other people to use should avoid having unnecessary dependent packages. Dragging along unneeded dependencies obligates users to install them when they load the package and increases the chances of conflicts. At times, when developing a package, a programmer may notice that a dependency has been added to its manifest, but not understand why it would be needed. A new package system command, why, has been added in version 1.9:

    (RBCOceananigans) pkg> why AbstractFFTs
      Oceananigans → CUDA → AbstractFFTs
      Oceananigans → CUDAKernels → CUDA → AbstractFFTs
      Oceananigans → CubedSphere → FFTW → AbstractFFTs
      Oceananigans → FFTW → AbstractFFTs
      Oceananigans → PencilFFTs → AbstractFFTs
      Oceananigans → PencilFFTs → FFTW → AbstractFFTs

In this REPL excerpt I'm in package mode with the "RBCOceananigans" environment activated; this is a project I'm working on that uses the Oceananigans fluid dynamics package. I noticed from precompilation messages that my package uses AbstractFFTs, but I didn't know why that was needed. The why command tells me that AbstractFFTs is used by some other packages that Oceananigans needs.

I can't do anything about that, since Oceananigans is essential to the project. However, if the dependency were a heavy or troublesome module that was only needed because I had included an inessential package, the why command would help me ferret this out. Perhaps I could extract what I was using from the guilty package instead of including the whole thing, and thus snip that branch of the dependency tree.

I'm impressed by Julia's package system, but didn't fully appreciate an inherent flaw until I contemplated the advantages brought by the new package extensions feature. Glancing again at the output of the why command above, we see that Oceananigans depends, for example, on CUDA. This is a package for working on graphical processing units (GPUs), which can be useful for accelerating fluid simulations. It's great that Oceananigans has support for GPUs, but if the user doesn't plan on using one (or one from the right manufacturer), it's dead weight. CUDA is a big package that takes several minutes to download and precompile. This time is added to the already significant time for Oceananigans itself and its other dependencies.

The package-extension mechanism allows the developer to segregate the parts of the (in this case) Oceananigans code that actually require CUDA into a separate module. The extension module is just source code, shipped with the main Oceananigans module code (and any other extensions). CUDA is removed from the dependency list, so when Oceananigans is installed, it does not come along for the ride. Those who really want to put their calculations on a GPU would install CUDA, which would then trigger the loading of the extension module that uses it. Extension loading can also be triggered by the installation of a specified set of packages into the environment.

This gives users more control over what packages are installed and avoids loading module code that will never be used. These benefits require package authors to reorganize their libraries into main and extension modules, which will take some time, but is already happening with some large and popular packages. To see what extensions are available for the installed packages, the package status command has a new flag:

    (RBCOceananigans) pkg> status -e
    Project RBCOceananigans v0.1.0
    Status [...]
    [91a5bcdd] Plots v1.38.12
                ├─ FileIOExt [FileIO]
                ├─ UnitfulExt [Unitful]
                ├─ GeometryBasicsExt [GeometryBasics]
                ├─ IJuliaExt [IJulia]
                └─ ImageInTerminalExt [ImageInTerminal]

The -e flag means that I only want to see information about extensions. Here, I've learned that, of the packages installed in my project, only Plots comes with extensions. The output shows the names of the available module extensions and, in square brackets, which packages they depend on. Plots will load faster for me if I do not use some of its optional features, because I can avoid installing and precompiling unneeded packages.

Numbered REPL prompt

A new option in the REPL changes its familiar prompt to numbered red and green input and output prompts, in affectionate imitation of IPython. The figure below shows the result of activating the numbered prompt.

The feature is turned on by calling REPL.numbered_prompt!() from the REPL package. As the figure above shows, previous returned results are available by indexing the Out vector using the displayed prompt numbers. The special REPL modes and their prompts are unchanged, so the package, help, and shell modes are unaffected by activating the numbered prompt. A convenient feature of the Julia REPL is that the user can paste text copied from the shell history or any other source, and any prompts or returned results that are part of the pasted content are automatically stripped away, leaving only the actual input commands. This feature still works with numbered prompts turned on: In and Out prompts are deleted from pasted code.

Interactive tasks

The new interactive tasks feature in Julia is my favorite from the new release, because it enables a convenient mode of working through the REPL or in other interactive contexts. Starting in version 1.9 Julia can be started with the flag -t m[,n] to create two "threadpools", a normal one with m threads and, optionally, an interactive one with n threads. See my concurrency in Julia article for an introduction to the use of threads and tasks in the language.

After invoking Julia with this flag, tasks started with the Threads.@spawn macro will be confined to the "normal" threads. They're started on one of the m threads and may not migrate to any of the "interactive" threads. Tasks launched with Threads.@spawn :interactive are assigned to one of the interactive threads and given priority in scheduling. To prevent them from starving the other threads, these interactive tasks should be written so that they yield frequently. Since a common application for interactive tasks is for user interaction, yielding will occur as a matter of course, since waiting for input causes an automatic yield.

To test this new capability, I launched the REPL on a two-core machine, first with julia -t 2. Then I spawned a half-dozen compute-heavy function tasks that yielded infrequently. As expected, the response in the REPL became sluggish, taking one to several seconds to return a prompt after hitting return. I repeated the experiment after invoking Julia with julia -t 1,1. After spawning the same half-dozen CPU-hungry tasks, the CPU meters showed 100% utilization, but on only one thread. This time, the REPL, which was provided with an interactive thread, continued to respond instantly, since the churning of the other tasks had no effect.

The ability to launch computations "in the background" and continue development work undisturbed is particularly convenient. Other purposes that suggest themselves include web applications, GUI applications, or other programs that respond in realtime to the user.

Other improvements

Aside from the major improvements described above, the new release brings a handful of other useful new features. We can now give Julia a "heap size hint" on startup, defining a limit above which the garbage collector will try harder to reclaim memory. The hint is transmitted though a flag:

    $ julia --heap-size-hint=2G

This sets a memory limit of two gigabytes, for example.

The default sorting algorithm has been replaced by a faster one. In a simple test of sorting arrays of 10⁸ native floats and ints, I found that version 1.9 was twice as fast as version 1.8, but also allocated twice as much memory to carry out the sort.

For some time Julia has provided both a @fastmath macro and a --math-mode=fast startup flag to turn on the fastmath LLVM option. These perform floating-point arithmetic more quickly, but less predictably and accurately. Fortran and C compilers have similar options. The macro marks individual blocks for the fastmath treatment, while the flag applies it to the whole program. In the current version the startup flag has been disabled (it's accepted but doesn't do anything, so deployment scripts need not be changed), due to unacceptably inaccurate results from some functions. Those with the courage to use fastmath must decide where it should be applied.

Until now, standard-library packages could not be upgraded separately from the Julia version. The new release experiments with upgradable standard-library packages, which are distributed with Julia (so can be used without installation, as now), but otherwise are treated just as "normal" packages. As of now, only the DelimitedFiles package, a small library that stores arrays in files, is getting this treatment, but, if the experiment succeeds, upgradable standard-library packages may become the norm.

This one is rather esoteric (I haven't tried it), but seems to be important to some people: starting with version 1.9, Julia libraries can be called from multi-threaded C++ programs. This article shows how the feature can be used. Beyond that, support for the half-precision (Float16) floating-point hardware present on the Apple M-series computers has been added for those who want the enhanced performance and can tolerate the loss of precision.

Conclusion

Since version 1.0, all of the changes to the Julia language have been backward-compatible, so any programs that worked with the first public release will work with this latest version, aside from possible package incompatibilities. On a personal note, I am preparing a code repository for a book that I began writing when Julia was at version 1.6. I'm packaging and testing all of the programs in the book with the current Julia release, which means that Julia's package system updates any dependencies to their latest compatible versions. I'm over halfway through the book, and, so far, all of the programs have worked without modification, with the exception of one specific problem with a plotting backend that's already fixed. This was a pretty good test of the robustness of the package system and the backward compatibility of the language; the result is, for me, relieved surprise.

Rumor has it that version 1.10 is less than a year away, and that it will bring more improvement in package-loading times. Among other features coming in the next release are new and improved mathematical functions in the standard library (for example, fourthroot(), also called ∜); enhanced formatted printing; more useful display of matrices with Rational elements; easier-to-read stack traces; better function dispatch for signatures with Union types; the ability to choose how many threads will be used by the garbage collector; and optional display of per-package precompilation timings. If all of that holds true, it will be another steady, incremental improvement in the language and its ecosystem.

Comments (4 posted)

Zoned storage and filesystems

By Jake Edge
May 25, 2023

LSFMM+BPF

Issues around zoned storage for filesystems was the topic of a combined storage and filesystem session at 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit led by Bart Van Assche, Viacheslav A. Dubeyko, and Naohiro Aota. Zoned storage began with the advent of shingled magnetic recording (SMR) devices, but is now implemented by NVMe zoned namespaces (ZNS) as well. SMR devices can have multiple zones with different characteristics, with some zones that can only be written in sequential order, while other, conventional zones can be written in any order. The talk was focused on filesystems using the sequential type of zones since the conventional zones are already well-supported in Linux and its filesystems.

Van Assche began by giving an overview of zoned storage and its advantages; he quickly went through some bullet points from the talk slides. For NAND flash devices, having sequential zones means that they can have a smaller logical-to-physical (L2P) mapping table, which improves performance. In addition, these zones eliminate internal garbage collection and the consequent write amplification, which allows the host to have better control over the latency of writing to the device. Read performance can also be improved because filesystems can allocate a contiguous logical-block-address (LBA) range for files.

He then turned to the zoned-storage interface. Zones are contiguous LBA ranges that do not overlap with other zones; multiple zones can be written simultaneously. There are four states for a zone: empty, open, closed, or full. Zones that are either open or closed are considered active; devices may have limits on the number of active zones.

Powers of two

He stated that the NVMe standard specifies that zone sizes are always a power of two, but was corrected by several attendees. Linux imposes that restriction, not the standard. Multiple NAND flash vendors want to be able to have non-power-of-two (npo2) zone sizes. In particular, vendors of Universal Flash Storage (UFS) devices want more flexibility in the zone sizes.

Pankaj Raghav of Samsung has posted patches for supporting zone sizes that are not a power of two. Android also needs this support, Van Assche said. He wondered if the patches were ready to go upstream at this point. He was hoping that block maintainer Jens Axboe would be present for the discussion, but that was not the case.

Josef Bacik wondered if the Linux filesystems community really cared one way or the other. He asked Johannes Thumshirn if Btrfs cared, for example. Thumshirn said that he thought it would be messy to support npo2, but that the problems could perhaps be considered bugs and get fixed. Bacik asked how many of these devices actually exist today. Damien Le Moal said that effectively everything on the market today has zones that are sized as a power of two.

Le Moal said that his view is that flash-based zones should look like the existing SMR sequential zones, all of which have sizes that are a power of two. As yet, there are few deployed flash-based zoned-storage systems, so avoiding confusing things between SMR and flash devices was desirable. The UFS vendors are trying to push npo2 to avoid having to add more functionality in their firmware, he said. "Do we want to take the burden of dealing with the non-power-of-two, instead of the drive vendors doing it?"

Van Assche said it is more than just UFS vendors that would like to do this. Le Moal would still prefer that the drive vendors handle this and he does not see why there would be performance problems in doing so, as has sometimes come up. Others disagreed, or at least thought that there was enough push for npo2 from customers of various sorts that something should be done. One attendee suggested a middle layer that would mediate between the filesystems and devices; extracting maximum performance is not really needed for these devices. "Let's just be done with it, please." From the frustration expressed, it is clear that the topic has come up a lot without getting resolved.

Bacik said that he truly did not care, and thought that was generally true for filesystem people, but he would also like to see this problem resolved in some fashion. He looked briefly at the patches, which did not seem too invasive to him; "I'm not the block-layer guy, so I could be wrong, and Jens isn't here to yell at me". He does not understand "why we are fighting about this, if it's not that big of a deal to support".

Someone pointed out that Christoph Hellwig was adamantly opposed to the npo2 support; "now I understand it", Bacik said with a laugh. Hannes Reinecke suggested that even the middle-layer approach that was suggested would get strong opposition from Hellwig (who was at the summit, but not at this discussion). Le Moal said that so far all of the reasons he has heard for supporting npo2 in the kernel were wrong and demonstrate a misunderstanding of zoned storage on the part of device makers. If that support goes into the kernel, it should only be done if there are sensible reasons to do so, he said.

There was a fair amount of disagreement in the room, with people talking over each other and several simultaneous side conversations taking place. It was not particularly heated, but was somewhat chaotic and hard to follow. Van Assche said that there were not good arguments either for or against the npo2 support in his mind, but Android, at least, is being pushed hard by the storage vendors. The ultimate decision is Axboe's, Bacik said; more discussion of it in the room is not really going to change anything, so he suggested moving on.

Zoned Btrfs

At that point, Aota switched over to the status of zoned-storage support in Btrfs, which he has been working on for a number of years now. Btrfs supports both SMR and ZNS devices, with the latter added for the 5.16 kernel. SMR works well, but there are some problems with the ZNS support, he said.

Currently, Btrfs on ZNS can report ENOSPC even when there is still space on the device due to zones being activated at reservation time, rather than only while data is being written. That means there may be no zones available to be activated when data needs to be written. There can also be slow performance because metadata overcommit is disabled in Btrfs on zoned storage. He is reworking some of the code to address these problems, he said, which will allow the metadata overcommit to be re-enabled.

Zone sizing

Dubeyko then shifted gears to another topic: what is the best zone size based on the differing needs of filesystems and SSD devices? Smaller zones (hypothetically 128KB) are more complicated for the device because they require a huge mapping table and a complex mapping scheme. But, for a filesystem, a small zone can have smaller extents, with faster reclaim, lower garbage-collection overhead, and faster read I/O, he said. Larger zones (2GB, for example) have a lot of negatives for filesystems, but are much easier for the devices. He wondered if it might make sense to allow filesystems to choose among a few different zone sizes for a device.

Le Moal said that the zone size and overall capacity of the device have to work together. A 16TB drive with 128KB zones is "going to suffer"; the number of zones in the device makes a difference. He said that it is also not something that can be changed at the software level; it is up to the drive vendors to choose a zone size that makes the most sense for the most use cases of their hardware.

One attendee said that they think the next generation of ZNS drives will generally have zones of around 50 or 100MB, and wondered if that was a reasonable size for filesystems. He believes that the 1-2GB zones used in current devices will likely be around 100MB in devices for high-volume deployments. Ted Ts'o said that he was confused why the zone size was even being discussed in the room, "because, ultimately, I don't think it's up to us". The market will dictate its needs to the vendors, so if a high-volume handset maker, such as Samsung, were to say that it wants UFS devices with zones of a certain size, that's what will be built, he said. Others generally agreed with that as time ran out on the session.

Comments (7 posted)

Cloud-storage optimizations

By Jake Edge
May 26, 2023

LSFMM+BPF

"I/O hints" for storage devices, which are meant to improve performance by giving the devices extra information about the nature of the I/O, have a long history with Linux. But the code for write hints was "ripped out last year", according to a message from Ted Ts'o proposing a discussion about new optimizations for cloud-storage devices. That discussion took place in a combined storage and filesystem session at the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit. In it, Ts'o proposed that the Linux community define its own set of hints rather than just following along with the hints in the standards—which have largely been ignored by the vendors in any case.

Background

He began by pointing out that "we have been talking about this set of storage extensions for freaking ever"; the earliest LSFMM discussion that he found was from the 2013 event, but he wondered if that was actually the earliest. There is mention of I/O hints in the report from day two of LSFMM 2012 that indicates the topic had already been around for a while, but perhaps not before that at LSFMM. "Here we are ten years later—possibly longer", he said, to knowing chuckles around the room; he wanted to reflect a bit on why that was.

Working with the standards committees is "slow and expensive"; he has done it and would not necessarily recommend it for others. It requires a lot of travel and there are several bodies involved, which multiplies the problem, especially in times where budgets are tight. But, then, even if a spec gets approved, hardware vendors rarely actually implement these features in easily available devices; it is often only available in high-end, extremely expensive drives.

As a result of that, I/O hints were added to the kernel but were removed around 18 months ago because no one was using them, he said. "They're back" was heard from the audience to more chuckles. But Ts'o thinks things can be a bit different this time around because of the prevalence of cloud-based emulated block devices, which are essentially software-defined storage. Those devices can be updated with new features much more easily and quickly than waiting for hardware vendors to decide to implement something. In addition, in the past "hardware vendors would care about $OTHER_OS" and did not care what Linux people thought; but these days the dominant OS running on cloud virtual machines is Linux.

Ts'o said that there is a weekly call among ext4 and other filesystem developers that coincidentally has attendees from Oracle, Amazon, and Google, who are, of course, cloud vendors. Many of the call attendees are thinking about doing similar things with their filesystems, which involve "making assumptions about how the emulated block device in the cloud works". It occurred to him that they could do more than that; "the somewhat radical idea" that he wanted to propose is that the Linux community could add its own vendor extensions that could be used by these devices.

Instead of some storage vendor being responsible for the extension, it would come from the Linux community. A reference implementation could be created for QEMU and if one or more cloud vendors could be convinced to adopt it, "then it could be purposely built for us". Developers would not have to try to figure out how to map the SCSI I/O hints from a decade ago to Linux, he said.

Storage-track organizer Martin Petersen pointed out that in his hints work from ten years ago, he had mapped posix_fadvise() flags to SCSI and NVMe hints; he shopped that around to various storage vendors as what would make sense for Linux "and it went nowhere". He is strongly in favor of reviving the effort and calling it a "Linux cloud" extension; "it makes a ton of sense, it fixes a ton of performance problems, and it is like 150 lines of code".

Cloud optimizations

Given that attendees seemed to be in favor of the overall plan, Ts'o wanted to talk about specific optimizations that he and others are thinking about. The cloud vendors have observed that MySQL and PostgreSQL both use 16KB database pages and would like to be able to write those in all-or-nothing fashion. That guarantee could come from the kernel or the hardware, he said, but the requirement is for no "torn writes" (i.e. partial writes).

NVMe already has a an atomic-write extension and one is being added to SCSI, but with slightly different semantics, Ts'o said. But, today, "as an accident of implementation", due to the flags that get passed in the BIO for a direct I/O write, the block layer will not tear an aligned 16KB write; it "will not split them apart in awkward places".

Buffered I/O is not treated that way, he said, which can lead to torn writes. But for direct I/O, he and others have "desk-checked the code" as well as running torture tests to try to cause torn writes. There are some who are thinking of deploying this as it stands, but others are looking for a guarantee from the operating system rather than just rely on an accident of the implementation.

An OS guarantee is a reasonable request, Ts'o said; in addition, getting some kind of atomic solution for buffered I/O would be great because PostgreSQL only does buffered I/O. This would allow database systems to eliminate their double-buffered writes. So far, it seems to work fine for the cloud-storage devices; "maybe there are some weird semantics between NVMe and SCSI, but we don't care".

It would be nice if the block layer could find out whether the device guarantees that it will not tear for aligned writes of, say, 16, 32, or 64KB, so that the block layer can also split on those boundaries. Storage-track organizer Javier González pointed out that there is upcoming LSFMM session on support for large block sizes; there are already patches for some of that support available.

Luis Chamberlain, who would be leading the large-block discussion the next day, wondered about the limit of the size of the atomic writes that users want and how that relates to the block size that the device specifies. Keith Busch said that for NVMe SSDs today, the sizes for atomic guarantees range from 4KB up to 64KB. But Fred Knight pointed out that there is a large storage vendor that guarantees atomic writes of "hundreds of megabytes", but the block size is 4KB. Since a large vendor has done that, he suspects that others will too. Chamberlain concluded that there would be value in supporting block sizes beyond 64KB.

Ts'o said that providing information that a set of blocks is associated with a particular inode could be used by storage devices for, say, garbage collecting all of them together. He does not know how practical that actually is, but as a filesystem developer he has no problem adding the inode information if it will help. Petersen said that he and Christoph Hellwig had a proposal like that, using a hash of the inode number, around ten years ago that also did not go anywhere. But James Bottomley wondered if it even mattered; since there are mostly extent-based filesystems that write large extents, can't the storage devices just use the large write as a signal that the blocks go together? Ts'o said that was probably workload-dependent, but that this particular optimization was not really one of his priorities.

A more interesting optimization in his mind is giving the device hints about whether a read is actually synchronous from an application or whether it is coming from the block layer doing a readahead of some kind. But Petersen and Josef Bacik said there is already a flag being used for that; Petersen said that it is needed because a failed readahead is not treated the same as a failed application read.

Another optimization, which has probably seen the most work over the years, Ts'o said, is to provide a hint that a given write is for data, metadata, or a journal. That journal indication could be for a filesystem journal or a database journal. That could allow the storage devices to prioritize the writes that are truly important versus those from background activities like backups.

Working group

He thinks that a working group including cloud-vendor representatives could define something along those lines, which could be implemented in QEMU. Using that to demonstrate the benefits could lead the cloud vendors to start implementing the feature. Bart Van Assche asked that Android be included in any such working group; the project is working on a proposal to standardize write hints to distinguish between data and metadata writes. González said that the NVMe device in QEMU is only used for compliance testing, not for performance, so there has been talk of creating another NVMe device for QEMU with a fast path that could go directly to a VFIO passthrough device.

There was some fast-paced disagreement about whether the NVMe and SCSI standards bodies needed to see an open-source implementation before actually standardizing something. In the end, that may not matter, Ts'o said, if there is a "Linux cloud" vendor extension, things that fall under it do not need to work for the hardware vendors. He has observed that sometimes those vendors are more interested in throwing sand in the gears of the standardization process than they are in adding features—especially if they perceive it might give competitors an advantage. That statement was met with laughing denials from various parts of the room.

In fact, the Linux community can move much more quickly without having to go to standards meetings in far-flung places multiple times per year, Ts'o said. "We can just simply make something that works"; people who can go to the standards meetings can take that work and standardize it if they want. He thinks it might be easier to align the cloud-storage people, which can result in a quicker turnaround on these kinds of features.

González asked if Ts'o had some kind of governing or organizing body in mind for this work, but Ts'o said he had not gotten that far. He thought that something informal, which resulted in something that works in QEMU, would be sufficient, but if a more formal organization is needed, the Linux Foundation would be an obvious possibility. His suggestion would be to keep the process as lightweight as possible though, and liked Petersen's idea that the linux-fsdevel mailing list be the "organization".

Comments (10 posted)

Atomic block-write operations

By Jake Edge
May 30, 2023

LSFMM+BPF

Martin Petersen and John Garry led a session at the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit on work they have been doing to implement atomic block writes of various sizes for SCSI and NVMe. The idea is to support devices that can guarantee atomic operations for sizes larger than their block size. It is an attempt to "find common ground" between the two standards, Petersen said, because the two have slightly different semantics, depending on the device type, and different restrictions, which has made for an "interesting project". It has been a challenge to find an abstraction layer that can work with the "five different variants of SCSI and NVMe implementations that may or may not be out there".

Currently, they have a QEMU and a scsi_debug implementation of this work, Petersen continued. It is plumbed into pwritev2() and io_uring, so that it can be used from applications. A special fallocate() call can be made to tell the underlying filesystem that the application wants its file allocations to be aligned with whatever the hardware requires in order to provide atomic guarantees. An fallocate2() call was added to XFS for their testing. There is an interface for an application to query the hardware for the range of block sizes that it supports for atomic operations. The application can then do atomic block operations on the file using direct I/O.

Garry then described some more of the details. There is a new RWF_ATOMIC flag for pwritev2() that is part of the patches he posted in early May. He said that the patch developers come "from a database point of view", where the database has fixed block sizes, so the flag requests the kernel to write each database block atomically, not that the entire write (potentially consisting of multiple database blocks) is done atomically.

The patches are "not too intrusive"; without counting the documentation, there are around 1200 lines of changes. About half of those changed lines are in scsi_debug.c because the locking model in that driver needed to change. There are about 300 changed lines in the block layer; atomic writes fit in the existing block layer pretty well, he said. There were also changes in the XFS code, which may or may not stay, he is not sure.

Damien Le Moal asked why fallocate() was used rather than simply adding a direct I/O flag. Ted Ts'o said that the key is to ensure that the filesystem allocates the file data in a way that is aligned properly for the hardware. For ext4, that can already be done with a flag when the filesystem is created; it will then always allocate file data on the proper alignment boundaries. He fully supports the fallocate() approach and would add that to ext4 if it goes upstream; the advantage is that you do not have to create a specific filesystem in order to access the atomic capabilities of the device.

Ts'o wondered about the need for the pwritev2() flag, however. His understanding of the NVMe spec is that devices advertise that they will not do partial writes (i.e. torn writes) for power-of-two sizes up to, say, 16KB or 32KB. So he was hoping for a simple change to the block layer to note that fact and not split BIOs (i.e. struct bio instances) at any other boundary.

There are four new request-queue limits in the block layer, two of which are complementary, Garry said. There are unit minimum and maximum values, which are the smallest and largest size that are supported by the device for atomic writes; those are both powers of two and the expectation is that any block size used by applications will be as well.

There is also an atomic write boundary value that is specific to NVMe; any I/O that crosses it will be split by the device. Petersen gave an example of a 128KB boundary value; any write that crosses the boundary that exists every 128KB will become two I/Os. That means that the block-allocation path needs to be careful about the boundary in order to avoid torn writes, he said. SCSI has its own boundary, but it is different Garry said. Fred Knight pointed out that as long as the I/Os start at the "right" place for the atomic-block alignment, they will not span the boundary.

The fourth value "is max bytes, just to confuse things", Garry said. It may be different than the unit maximum; it specifies the total maximum size for an I/O consisting of atomic-write operations. Petersen said that the overall I/O might consist of, say, 16 chunks, each of which must be written atomically. It is the unit minimum and maximum that user-space applications need to be aware of, Garry said; those can be retrieved using statx(). The other two are only used internally to the block layer.

Ts'o said he had mostly ignored those parameters because the database developers do not seem to care about anything other than 16KB atomic writes. The database may send some large number of those 16KB chunks in a single write, but only need to be guaranteed that those chunks are not torn; the whole I/O can be torn on those boundaries without a problem. Hearkening back to his talk, he said that the cloud providers could simply have their block devices support the 16KB requirement, without the extras, though he understands why a more general solution might be needed for other use cases.

In response to a question from Jan Kara, Garry said that user-space does not explicitly choose an atomic-block size. The atomic-block size is inferred by the block layer from the alignment and size of the write. Kara is concerned about partition alignment and device-mapper interactions that will interfere with that and wondered if some kind of offset will be needed from the user-space side, but Petersen said that these partition-alignment problems have already been solved for other reasons; the intent is to keep things simple.

Javier González asked about the range of sizes being considered in the work; there are limits at various levels, so what use cases are being targeted. Petersen said that database systems generally have 8KB, 16KB, or 32KB blocks and typically do their writes in chunks of 512KB to 1MB, which is what they want to facilitate. Le Moal said it will probably be difficult for the devices to support much more than that.

Garry said that once the block layer has inferred the block size for the application, it uses that whenever a write is done. It fills BIOs to that size or a multiple of it; when BIOs are split, the inferred block size is used. Kara wondered about what happens if user space submits unaligned or incorrect-length writes; Garry said that the code does rely on "careful user-space programming". Knight said that one of the differences between NVMe and SCSI is that NVMe will simply perform that kind of write non-atomically, while SCSI returns an error.

Ts'o said that he understood why the initial implementation is only for direct I/O, but he would like to find a way to support PostgreSQL, which uses buffered I/O. He is hoping there could be some way to teach the writeback code that some set of contiguous page-cache pages correspond to a user-space block that should be written atomically.

Petersen asked Darrick Wong, who was dialed in remotely, if he had thoughts on how to make that work. Wong said that he was unsure how to do atomic writes for page-cache pages, but thought perhaps there could be some kind of mode that indicated that a file should only be written with atomic writes, "then try to do it right". He does not think it would be impossible to do using the iomap interface, but it "would be a pain" because the folio sizes and the atomic-block sizes may not be the same.

Bart Van Assche suggested raising the overall block size to 16KB, so it is guaranteed that the writes are aligned and are a multiple of that size. If it is necessary to do smaller writes, sub-block write operations could be added. Ts'o thought that the block size could be set on a per-inode basis using fcntl(), then the writeback code would know to do atomic writes on properly aligned and sized sets of dirty pages; pages that were not aligned and appropriately sized would get no guarantee with regard to atomicity. It would be somewhat fragile and would not be as good as the direct I/O implementation, he said, but would not require any code changes for PostgreSQL to take advantage of it.

Wong said that perhaps XFS could support 16KB blocks; for years, Dave Chinner has been heard to mutter that the filesystem is "really really close" to being able to do so. It mostly requires changes to iomap to handle multiple pages in a block and then fixing the size of folios to 16KB, Wong said.

Luis Chamberlain said that the work that is going on to support larger block sizes should sort out what is needed to support atomic writes for buffered I/O. He would be leading a discussion on that topic the next day and thinks that a good outcome from LSFMM would be to flesh out the different use cases and to come up with test cases for all of them. His main concern is for memory fragmentation if the underlying folios are not being created and freed at the same rate. González thought that the atomic-block size would generally be the same throughout the system, but Petersen said that there are common use cases where the database-block size is different between databases on the same filesystem.

The conversation continued on for a ways, going in several different directions. The feature is fairly small and works now for direct I/O, Petersen said; certainly people in the room were interested in seeing it in the kernel and had plenty of ideas for where it could go from there.

Comments (1 posted)

Flexible-order anonymous folios

By Jonathan Corbet
May 25, 2023

LSFMM+BPF

The conversion to folios is intended to, among other things, make it easy for the kernel to manage chunks of memory in a number of different sizes. So far, though, that flexibility is not being used in the kernel's handling of anonymous pages. At the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit, Yu Zhao and Yang Shi ran a session in the memory-management track aimed at charting a path toward support for anonymous pages in a variety of sizes.

Zhao began by pointing out that the kernel currently only handles two sizes of anonymous pages — the base (usually 4KB) size and the PMD size (usually 2MB huge pages). Recent CPUs, though, have added support for the coalescing of translation-lookaside-buffer (TLB) entries, allowing a single entry to cover as many as four physically contiguous pages, often transparently. Taking advantage of that capability would improve performance on those CPUs.

Managing memory in base-page units is not scalable, he said. It makes the system deal with large number of page faults, least-recently-used (LRU) lists containing millions of pages, and increased TLB-flushing costs. Using a larger base-page size can improve performance, but at the cost of increased internal fragmentation and, perhaps, forcing user-space changes (a recompile if nothing else). But mid-sized folios might just be "a sweet spot" for a number of reasons. Internal fragmentation is reduced, and the presence of an "accessed" bit for each base page means that sparsely used folios can be broken up. Larger folios are entirely transparent to user space, and easier than huge pages to allocate; the ability to use TLB coalescing will also make them perform better.

Implementing larger anonymous folios will require solving some problems, though, starting with finding a suitable policy for when they should be used at all. Some heuristics, including looking at alignment and sizing, can be used to find suitable virtual memory areas (VMAs). Should a single large size be used, he asked, or should it vary from one VMA to the next? For an allocation policy, he suggested attempting to allocate huge pages first, then falling back to a folio at the TLB-coalescing size, then base pages if all else fails.

Behavior under memory pressure is another thing that should be thought through, he said. But, even then, trying to allocate the largest sizes first is probably the best policy, as long as care is taken to avoid forcing excessive reclaim when the larger sizes are not available. Pasha Tatashin said that reclaim from larger folios could also be tricky, since freeing them might require first making another allocation.

Shi asked whether the allocation process should try all of the page orders from the huge-page size on down, or if it should, instead, skip some orders. Matthew Wilcox said that, with the right allocation flags, trying all orders might be fine, but it might be better to modify the page allocator to return the largest size available up to a given order and avoid making multiple allocation calls.

There was a brief digression into the proper use of the mapcount field of struct page. In theory it tracks the number of contexts that have the given page mapped, but the use of this field has led to bugs in the past and there is disagreement over what its semantics should be. Wilcox said that the use of mapcount for higher-order pages needs to be rethought.

With a new virtual-memory feature comes a need for statistics to track it; at least one new counter will be needed to track large-folio use, but Zhao wondered if more would be required. Wilcox suggested just counting folios as single pages, but Tatashin said that would make it impossible for users to understand what was happening in their programs. Wilcox answered that they'll know that large folios are in use when their programs run faster. Tatashin said that users would want to be able to debug why their programs aren't going faster; Wilcox suggested /proc/pid/smaps or tracepoints, but Shi said additional counters would also be helpful there.

Zhao discussed the reclaim process briefly, repeating that the base-page access bits can be used to detect internal fragmentation, where only part of a large folio is actually being used. A heuristic can be used to determine when a sparsely used folio should be split. He asked whether large folios should be swapped as a unit; the answer from the group was "yes".

There were a few other complications to be rushed through as the session ran out of time. The memory-compaction code currently skips large folios; it needs to learn how to work with them as well. Collapsing individual pages into large folios could improve performance, but it has to be done carefully to keep khugepaged from working against the reclaim code. When a large page needs to be split, the question of whether to split all the way to base pages or to keep larger sizes arises.

Zhao concluded by saying that there is an RFC patch set implementing large anonymous folios in circulation, and that the group should have a look at it.

Comments (2 posted)

Optimizing single-owner memory

By Jonathan Corbet
May 26, 2023

LSFMM+BPF

The kernel's memory-management subsystem is optimized for the sharing of resources to the greatest extent possible. But, as Pasha Tatashin pointed out during a memory-management session at the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit, a lot of memory has a single owner and will never be shared. He presented some ideas for optimizing the management of that memory to a somewhat skeptical crowd.

The problem he is trying to solve, Tatashin began, is specific to Google, but he thought that others might be experiencing it too. The memory he is talking about is anonymous memory that is never shared. A process may allocate a substantial amount of memory, then never fork, or it might have used madvise() to tell the kernel not to share a range of memory with any children. At Google, 90% of memory is never shared, since Google heavily favors the use of threads rather than independent processes.

The kernel's memory management is not as efficient as it could be for this kind of workload, he said. About 1.6% of of a system's physical memory is dedicated to page structures to manage that memory. Google's server fleet contains petabytes of memory, which is its most expensive component, so the expense of that 1.6% overhead is considerable. The page structure is there to manage all types of memory, he said; it's not needed for single-user anonymous memory.

Eliminating the possibility of sharing for that memory would bring some advantages. He has had to debug problems where memory is falsely shared, often as the result of driver bugs. Without sharing, those bugs won't happen. Since single-owner memory is always migratable, it can always be assembled into 1GB huge pages, which helps performance.

Tatashin described a single-owner-memory driver that would have two components. The memory pool would manage 1GB chunks of memory that can come from a number of sources, including hugetlbfs, device DAX, or a CXL memory pool; this memory need not have page structures associated with it. The memory pool would take pains to separate movable allocations from those that cannot be moved. The other piece is the driver itself, which manages memory in smaller chunks and makes it available to processes. It implements a new type of virtual memory area (VMA) that is marked as being page-frame-number (PFN) mapped, so the kernel will not expect to find page structures behind it. This driver can support most memory-oriented calls like madvise().

User-space processes can then open /dev/som to allocate single-owner memory for their use. Tatashin has run into some problems with the implementation, though, including the fact that PFN-mapped VMAs are not supported everywhere in the kernel. In particular, the get_user_pages() family of functions will not work with them, making it impossible to use single-owner memory in a number of contexts.

Page aging, he said, is hard to manage since, without page structures, there is no least-recently-used (LRU) list to consult. One solution here would be to create a new, smaller variant of struct page for this purpose; he has been resisting that approach so far, but does not have a better solution. Swapping is not supported with this memory, and neither are NUMA placement or hardware poisoning.

Tatashin said he realizes that the work implementing folios and page descriptors will eventually solve many of the same problems. But, he said, this work is still expected to take some years to complete, and he would like to have a solution sooner than that. Matthew Wilcox interjected that the folio work would happen more quickly if Tatashin helped, and said that the single-owner-memory work is trying to eliminate the memory-management subsystem, but will end up reimplementing it all. This effort, he said, is likely to end up like hugetlbfs, which duplicates much memory-management functionality; he questioned whether anything would be gained in the end.

As the session wound down, others in the group expressed similar feelings. There will never be an end to the addition of features to this special-purpose allocator, so it would always be growing. John Hubbard expressed the consensus in the room when he suggested that Tatashin just work to make the core memory-management code better suit his needs.

Comments (none posted)

Mitigating vmap lock contention

By Jonathan Corbet
May 26, 2023

LSFMM+BPF

The "vmap area" is a range of kernel address space used when the kernel needs to virtually map a range of memory; among other things, memory allocations obtained from vmalloc() and loadable modules are placed there. At the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit, Uladzislau Rezki, presenting remotely, explained a performance problem related to the vmap area and discussed possible solutions.

The problem, he said, is that the vmap area is protected by three global spinlocks. The global free-space area is covered by free_vmap_area_lock, the tracking of mapped areas by vmap_area_lock, and the list of lazily freed areas by purge_vmap_area_lock. These locks, he said, can turn into a significant bottleneck on systems with a large number of CPUs. The vmap_area_lock controls access to a red-black tree that can be used to find an allocated area using an address within it. These areas can be seen by looking at /proc/vmallocinfo. The free_vmap_area_lock, instead, regulates access to free space and can experience high lock contention.

The allocation path has to acquire both free_vmap_area_lock (to find a free range) and vmap_area_lock (to mark that range as busy). The freeing path, instead, needs vmap_area_lock and purge_vmap_area_lock. This pattern means that the three areas cannot be accessed concurrently. Running some tests on a "super-powerful computer", Rezki measured a basic vmalloc() call as taking about 2µs when a single thread was running. With 32 threads calling vmalloc() simultaneously, that time grew to 50µs — 25 times greater. That slowdown is the result of contention on the vmap-area locks.

The biggest problem, he said, is vmap_area_lock. This is partly due to a fair amount of fragmentation in the allocated areas, he said; the free and purge lists have fewer, larger areas and, as a result, less contention. Rezki proposed addressing this problem by adding a per-CPU cache; each CPU would pre-fetch some address space into its cache, then allocate pieces of that space to satisfy requests.

An attendee pointed out that the problem of allocating vmap-area space looks similar to allocating user-space address space and asked whether the same infrastructure could be used for both. Rezki answered that user-space allocation is a bigger problem, so the solution is heavier, and optimized implementations are still in development. The real problem with the vmap area is the serialization of requests across CPUs, which is amenable to a simpler solution.

Liam Howlett said that the vmap_area_lock is used for both allocation and freeing operations; if it could be avoided in one of the two paths, that could reduce contention. Rezki said that is true in theory, but that the bookkeeping has to be done somehow regardless. Howlett repeated that the problem is similar to the allocation of virtual-memory areas for user space. Memory-management developers should learn from each other, he said, rather than going off and doing their own things.

Rezki moved on to the management of free space in the vmap area. When a range in that area is freed, the approach would be to convert the address into the appropriate per-CPU zone, lock that zone, and remove the allocation. Then the lazy-free zone could be locked, and the newly freed area added there. A separate context would occasionally drain that lazy list; in his patch set it is being drained to the global area for now.

He concluded by asking what his next steps should be; the answer was to post patches and follow the usual process. He was asked for performance numbers, but had none available. When asked where this contention has been observed, he said it shows up on Android systems during video playback. The session ended with Michal Hocko suggesting that Rezki join his work with the efforts to improve user-space address allocation if possible.

Comments (3 posted)

Improving page-fault scalability

By Jonathan Corbet
May 29, 2023

LSFMM+BPF

Certain topics return predictably to development conferences every year, usually because developers are still struggling to find a viable solution to a specific problem. One such topic is the lack of scalability in the kernel's page-fault-handling code, so it was no surprise to see this problem on the agenda for the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit. Matthew Wilcox led a session in the memory-management track to discuss the state of page-fault handling and what can be done to improve it further.

He started by noting that Suren Baghdasaryan has been doing the bulk of the work in this area over the last year. There are two big issues when it comes to page-fault scalability: contention for the per-process mmap_lock lock and priority inversions between monitoring tasks and the real workload (which can come about, for example, when the monitoring task reads data from /proc/pid/smaps). In the 2022 discussion, he said, a number of options for addressing these issues were discussed, including the longstanding work to implement speculative page-fault handling, using read-copy-update (RCU) for much of the handling path, locking at the virtual-memory-area (VMA) level, and finer-grained locking using the recently added maple tree data structure.

Since then, the maple tree has replaced the red-black tree for managing process address spaces. Baghdasaryan has implemented the RCU lookup and per-VMA locking and gotten it upstream. Work that is in progress includes a set of patches to handle faults on file-backed VMAs with up-to-date pages in the page cache (which can be handled without starting I/O). There is also work underway to improve fault handling for pages in the swap cache. Both of those cases have to fall back to the mmap_lock now.

In other words, work continues to grow the number of cases that can be handled without resorting to the mmap_lock. Future projects will add more of these cases, including initiating and waiting for I/O, providing the data for /proc/pid/smaps, faults handled via userfaultfd(), faults in VMAs created by device drivers, and so on. Someday there will be no cases left to convert, and mmap_lock can be removed entirely.

Expanding on the "waiting for I/O" case, Wilcox said that he is looking at both swap-backed and file-backed pages. The current plan is to take a reference to the file being read, drop the lock, then sleep. Once the I/O completes, the fault-handling process would restart from the beginning to catch up with any other address-space changes that may have taken place. He asked the group whether this was the correct model, or whether it would be better to simply block changes to the faulting VMA while the I/O is underway.

Michal Hocko answered that the two cases are different and should perhaps be handled differently. In the swap case, there is always the possibility that the owning process could unmap the memory while waiting for the faulting page to be read in. This problem could be avoided by simply holding the VMA lock while waiting. This approach would not work as well for the file-backed case, he said, where the VMAs do not map to process-visible objects.

Another potential problem, Wilcox said, can come about in the case of a process where two threads call malloc() at the same time. Each of those calls could end up calling mmap() to get more memory to satisfy the allocation request. Those two calls would normally create two VMAs, but the kernel might get clever and combine the two; that, in turn, could create contention on the VMA lock that the application is not expecting. Steve Rostedt suggested using tracing to get some real numbers showing whether this is a real-world problem, but Hocko said that he sees regular bug reports involving "an unnamed database product" showing this kind of contention.

Wilcox said that the case of initiating I/O without mmap_lock held is easier. It has been established that calling into drivers with memory-management locks (such as a per-VMA lock) held is a safe thing to do, even in the absence of mmap_lock.

The monitoring case presents its own challenges, he continued. It is possible to walk the VMA lists holding just the RCU lock, and a number of /proc interfaces can work in that mode; "it's just a matter of programming". But the smaps file is more complicated; to collect its information, it must be able to keep page tables from being freed, and that requires taking the mmap_lock. On the x86 architecture, it is also necessary to disable interrupts — the result of a "stupid legacy that we should just get rid of", he said. Vlastimil Babka said that this behavior is the result of the need to block inter-processor interrupts that flush translation lookaside buffers.

Jason Gunthorpe said that page-table freeing could maybe be protected by RCU, but that would require embedding an rcu_head structure in the page structure; Wilcox answered that it's already there, but that page-table freeing is not using it. Mike Rapoport said that RCU freeing of page tables is feasible; Wilcox replied that he'd like to see it done, but that "there might be demons" there. Hocko, though, said that this could be a good low-hanging-fruit project for somebody to look into.

For the userfaultfd() case, where the fault is being reported to user space for resolution, Wilcox allowed that he lacked ideas. Baghdasaryan said that it looks similar to the swap case, and that dropping the lock before notifying user space could work.

For device-driver VMAs, there is the problem that the drivers themselves might be depending on mmap_lock being held, so just dropping it is likely to lead to unpleasant bugs. Wilcox suggested that he inexplicably lacks the desire to audit every driver in the kernel for this kind of problem. Instead, drivers will need to explicitly indicate that they are prepared to handle faults without the lock held. Drivers would also have to indicate that they do not drop the mmap_lock in their fault handlers. Drivers could possibly implement the map_pages() VMA operation to map their pages ahead of time, which is the most efficient way to map pages into user space. map_pages() is protected by the RCU read lock, though, meaning that drivers cannot sleep while using it.

Gunthorpe said that drivers have to be prepared for a process to fork, creating two independent copies of the driver-provided VMA. Since each mapping will be protected by a separate mmap_lock, drivers can't rely on that lock in any case. Michel Lespinasse said that there are, nonetheless, drivers that depend on mmap_lock, so some sort of allowlist will be needed to be able to call into drivers without that lock. For now, though, the per-VMA locks are for anonymous VMAs only, so the lack of mmap_lock not an issue for drivers.

Finally, Wilcox turned to the idea of removing mmap_lock entirely, which is an objective he would like to reach someday. It remains a multi-year project, though, much like the removal of the big kernel lock, which was finally completed in 2011. An important step in that direction would be to stop using the lock to protect the VMA tree, splitting each use into its own lock. Lespinasse said that he couldn't see how that could be done; the interactions with reverse mapping, in particular, would complicate things.

As the session concluded, it was pointed out that the quest for scalability does not end with the removal of mmap_lock; Wilcox is already looking forward to handling faults without taking the VMA locks either. It is reasonable to expect that the VMA locks will cause cache-line contention, though some profiling with perf will be needed to verify that. There is a path toward lock-free fault handling, he said, but it involves a fair amount of complexity and he'll only pursue it if performance requires it.

Comments (none posted)

Code tagging and memory-allocation profiling

By Jonathan Corbet
May 31, 2023

LSFMM+BPF

The code-tagging mechanism proposed last year by Suren Baghdasaryan and Kent Overstreet has been the subject of a number of (sometimes tense) discussions. That conversation came to the memory-management track at the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit, where its developers (Baghdasaryan attending in-person and Overstreet remotely) tried to convince the attendees that its benefits justify its cost.

Baghdasaryan started by saying that the use case for code tagging was memory-allocation profiling — accounting for all kernel allocations in order to monitor usage and find leaks. Any solution in this area, he said, must have a low-enough overhead that it can be used on production systems. It must also produce enough information to be useful; achieving both objectives can be hard. The proposal is a two-level solution, providing a high-level view with low overhead and the ability to get a detailed view for specific call sites.

The proposed implementation uses code tagging, which works by injecting a structure into a specific code location to identify that location. Application-specific fields can be attached to these tags; they can be used for allocation profiling, fault injection, latency tracking, and more. A special macro is used to put the structures into a separate executable section, but some inline code is also needed to associate the structure and the call.

The performance overhead of this mechanism, he said, is 36% for slab allocations and 26% for page allocations. That may seem high, but he argued that developers should consider that the code in question is highly optimized. Enabling memory control groups add ten times the overhead that allocation profiling does. The memory overhead depends on the number of CPUs in the system; it was about 0.3% of memory on an eight-core Android device with about 10,000 allocation call sites.

The prepared part of the discussion ended with Baghdasaryan asking the developers in the room if they would use this tool.

Steve Rostedt said that he had been asked whether it might be possible to implement the tagging more efficiently with static calls, which can be patched in or out at run time. The proposed code-tagging feature, he said, is adding macros around other macros, and must be explicitly added for every interface to be profiled. It injects code into every call site, which will lead to poorer locality and worse performance. He suggested that an alternative could possibly be created using objtool; it could find all of the call locations for a function of interest and create a trampoline for each in a separate section. That trampoline would log the data of interest, then call the target function. In normal operation, the trampoline would be unused; to turn the monitoring on, the call sites would be patched at run time to, instead, jump to the trampoline.

Overstreet responded that this solution replaced magic macros with something even more magic; Rostedt answered that this is how ftrace and a number of other functionalities work now. It is well-tested and can be expected to work.

Overstreet said that there is value to placing annotations in the source code; it allows the programmer to choose which functions are annotated and serves as a sort of documentation. The code tags can also be used for fault injection, allowing, for example, the writing of a test that would exercise the error handling code at each call site. Rostedt answered that all of this could be done in the trampoline as well; there could even be a BPF hook to make it more flexible.

John Hubbard said that 36% is a high overhead; the ability to turn that off would be an important feature. He said that he prefers the approach taken by tools like bpftrace, which attaches probes at run time. Overstreet said that one can't enable counters at run time and expect them to have any meaning; Baghdasaryan added that, if the counters are not enabled at boot, the system would see — and potentially be confused by — memory being freed that had been allocated before monitoring was enabled. Rostedt said that this problem can be addressed by booting the system with monitoring enabled, then turning it off once the needed data had been collected.

Overstreet complained that the trampoline idea would impose a greater overhead when it was turned on; Rostedt disagreed, pointing out that the trampoline would be entered with a direct jump, so no extra function calls are added. There followed an extended and sometimes heated discussion on the details that, in your editor's opinion, is not really worth reproducing here.

Michal Hocko brought the discussion to a close by noting that those details were not the important issue at hand; the developers needed to consider the overall design of any instrumentation mechanism and decide which would work best. Overstreet did not help his case by saying, at this point, that he would like to add some counters to struct page for more data collection. That idea was summarily rejected by the group.

The session ended with nobody, seemingly, satisfied with how it went. This seems like a conversation that is destined to continue for some time.

Comments (none posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

Briefs: Bcrypt at 25; LSFMM+BPF videos; GCC 13 static analysis; Python Language Summit; RustConf keynote fiasco; Quote; ...
Announcements: Newsletters, conferences, security updates, patches, and more.

Next page: Brief items>>