Leading items

Welcome to the LWN.net Weekly Edition for June 16, 2022

This edition contains the following feature content:

Rethinking Fedora's Java packaging: packaging Java for Fedora is not an easy task; the developers involved think they have found a way to reduce the work required.
Vetting the cargo: a new Cargo tool to help track source-code auditing.
/dev/userfaultfd: a proposed new interface to userfaultfd() functionality.
The LSFMM article stream continues:
- Zoned storage: the challenges with supporting zoned storage and sequential-write-only zones in particular.
- Retrieving kernel attributes: a discussion on a proposed interface for getting information out of the kernel using the extended-attribute API.
- A discussion on readahead: how much data should the kernel be speculatively reading ahead into the page cache?
- Remote participation at LSFMM: a discussion with the remote participants on what worked—and didn't—for them.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Rethinking Fedora's Java packaging

By Jonathan Corbet
June 9, 2022

Linux distributors are famously averse to shipping packages with bundled libraries; they would rather ship a single version of each library to be shared by all packages that need it. Many upstream projects, instead, are fond of bundling (or "vendoring") libraries; this leads to tension that has been covered here numerous times in the past (examples: 1, 2 3, 4, 5, ...). The recent Fedora discussion on bundling libraries with its Java implementation would look like just another in a long series, but it also shines a light on the unique challenges of shipping Java in a fast-moving community distribution.

As is often the case with Fedora, the conversation started with the announcement of a Fedora change proposal; this one is authored by Jiri Vanek. The proposal is aimed at the Fedora 37 release, which is currently planned for late October. Vanek proposes to change Fedora's builds of the OpenJDK Java development kit (JDK) to link against several of that project's bundled libraries (including zlib, FreeType, libjpeg, and HarfBuzz) rather than the versions packaged separately in Fedora. The JDK would also be statically linked against Fedora's build of the libstdc++ library.

This proposal, however, is only the first step in a longer project to change how the JDK is built. In short, once the JDK (along with the Java runtime environment, or JRE) has been made to use its own libraries, the way these components are packaged will change. The RPM for a given JDK release will contain a single tarball that will be unpacked on the target system. Notably, this tarball will be the same for all Fedora releases, from the oldest supported release through Rawhide; the use of bundled libraries will help to ensure that this package will work the same way on all target distributions.

Motivation

One might well wonder why the JDK maintainer wants to move to this somewhat unorthodox (for Fedora) packaging method. The answer, in short, is that packaging Java is painful, and trying to package it with dynamic libraries makes things worse; the change proposal leads off this way:

To introduce properly working dynamically linked JDK to Fedora took several excellent engineers several years. Even after a decade, there are remaining unfixed issues. And new issues are appearing.

Any project that bundles libraries tends to become difficult to build against any outside version of those libraries; the insulation from changes in the library itself and the ability to add localized changes are usually why the bundling is done in the first place. Any distributor attempting to unbundle those libraries will encounter a number of incompatibilities and is, as a result, courting a certain amount of pain. In the case of OpenJDK, it seems, that pain can be severe.

There are a couple of complications beyond that, though. One is that, before a Java build can actually call itself Java, it must have been run through the gauntlet of tests known as the technology compatibility kit (TCK). Every binary build, even one that just fixes a small bug, must pass the TCK and, as Vanek described, that is not a simple matter of typing "make test". TCK runs take time, a fair amount of hardware, and a fair amount of human attention. Simply doing successful TCK runs is an expensive endeavor; it gets a lot worse when test failures must be addressed. That is another reason why the use of OpenJDK's bundled libraries is appealing: there will be fewer TCK failures to deal with.

All of the problems described so far are complicated by the fact that Fedora does not just ship one version of the JDK — it ships four of them (versions 1.8.0, 11, 17, and "latest"). Each of those except the last is a long-term-support version that is depended on by other packages; together, they quadruple the pain of shipping Java in Fedora.

It is thus unsurprising that developers like Vanek feel the need to somehow reduce the amount of work that is required to keep Java working in the Fedora distribution. That is the ultimate goal of the packaging proposal. A single package for a given JDK version that works on all Fedora releases only has to be tested once, reducing the effort required by at least a factor of three.

Discussion

Predictably, there was no shortage of developers who were opposed to the idea of bundling libraries with OpenJDK. Vitaly Zaitsev said that "bundled versions are always outdated and may be even vulnerable"; Florian Weimer added that bundling could have the effect of reintroducing vulnerabilities that have been fixed in Fedora's versions of the libraries. Neal Gompa pointed out that Fedora's FreeType and HarfBuzz libraries have changes that improve font rendering; circumventing those libraries, he said, would "negatively impact the user experience". He also suggested that bundling could impede the efforts to improve Wayland support in Java.

Andrew Hughes, instead, said: "It's not quite as simple as bundled is always worse". He also complained about "the number of times we've released OpenJDK updates for Fedora on unembargo date, only for them to sit waiting for karma"; the Fedora project's delays in releasing the security fixes that are available for bundled libraries make him less sympathetic toward the arguments against bundling in general, he said. Vanek said that the OpenJDK project fixes issues in bundled libraries quickly because it "can not allow itself to have security holes", and that those libraries have sometimes gotten fixes before Fedora's versions have. With regard to the advantages of using the system libraries in general, he said:

This is the holy grail we have been pursuing for last 10 years. But now we gave up. To keep java alive in Fedora, we have to take this one step back.

It did not take participants long to realize, though, that bundling was not the real issue needing to be decided; as noted above, it is only the first step in a larger program. If the bundling is allowed to proceed, the other steps will surely follow. As a result, the discussion covered a lot of ground beyond the bundling proposal itself.

One concern that was raised several times is that the Fedora OpenJDK maintainers intent to simply repackage binaries obtained from upstream rather than building them specifically for Fedora. That appears to be the result of a communication failure, though; Vanek was clear that the plan has always been to build the binaries in-house. The intent is to reduce the number of binaries that must be built, but not to start shipping binaries built by somebody else.

Gompa decried what he sees as the poor state of Java support in Fedora in general:

Both in the wider Red Hat world and within Fedora itself, Java has received less and less love, to the point that it has been brutally starved. It has gotten so bad that Debian (!!!) beat us to newer Java and we had to *beg* for this to be fixed. So much for Features and First, eh?

He went on to say that Fedora was once "the leader in the Linux Java ecosystem", but that more recently the distribution has, instead, "lost" packages like Eclipse, IntellJ, and NetBeans. The bundling proposal, he said, just looked like yet another reduction in investment in Java. Stephen Smoogen replied that Fedora's position in the lead was brief at best, and that developers who have tried to work on Java have been driven away by the "toxic reaction" they received from other developers. Peter Boy, instead, said: "We have - as for years - a very stable and current and highly usable JRE / JDK". He disputed that packages like Eclipse have been lost; instead, since packaged Java applications run on any JRE, they can be downloaded directly from the source. "We switched from RPM to another distribution method".

There were suggestions that, if running the TCK is a big part of the burden of maintaining Java in Fedora, perhaps the solution is to not run the TCK as often. Without TCK runs, the build cannot be certified, but some developers (such as Tomasz Torcz) argue that this certification is not needed. Vanek, though, replied that putting an uncertified JDK into Fedora cannot be done. Daniel Berrangé added that "saying that we don't need certification of JDK is effectively saying that we don't need to do testing of JDK in Fedora".

The problem is that, without certification, a JDK build cannot be called "Java" and cannot use that term (which is trademarked) in any way. That problem can, in theory, be worked around by stripping out all of the trademark usage in the OpenJDK code, which is otherwise free software; this is the core of the "IcedTea" solution that has been seen occasionally in the past. This job is not easy, though, and fraught with hazards; Vanek predicted that "sooner or later a swarm of lawyers would appear" if it were tried.

Another idea that was suggested by developers like Fabio Valentini was to stop shipping so many versions of the JDK. Perhaps, he said, 1.8.0 (released in March 2014) could be dropped, with a corresponding reduction in the work required. Vanek said, though, that dropping old versions would be hard; 1.8.0 (also known as JDK8) is still supported upstream, and if it were dropped "some legacy applications will be unhappy". That is the only version that could even be considered for dropping at this time, he said; there is not a lot of savings to be found there.

Michael Catanzaro, perhaps playing the devil's advocate, suggested that Fedora could stop shipping OpenJDK entirely. "It's starting to sound like Fedora would be providing very little value here on top of what is offered by upstream". This idea did not get far in the discussion; it seems that there is little appetite for a Fedora without native Java support.

Decision

The discussion on the list eventually wound down as the participants got tired, but the issue did then move to the May 31 meeting of the Fedora Engineering Steering Council (FESCo). Bundling of libraries didn't sit well with the FESCo members either, and they worried that, if they said "yes" to this proposal, they would be implicitly accepting the rest of the plan as well. Even so, at the end of the discussion, FESCo approved the change proposal by a five-to-zero vote (with one abstention). So bundled libraries appear to be in the plans for Fedora 37 (with possible backports to earlier releases as well). As for the rest of the plan, there are undoubtedly more lengthy email threads to endure before that can be decided.

Comments (112 posted)

Vetting the cargo

By Jonathan Corbet
June 10, 2022

Modern language environments make it easy to discover and incorporate externally written libraries into a program. These same mechanisms can also make it easy to inadvertently incorporate security vulnerabilities or overtly malicious code, which is rather less gratifying. The stream of resulting vulnerabilities seems like it will never end, and it afflicts relatively safe languages like Rust just as much as any other language. In an effort to avoid the embarrassment that comes with shipping vulnerabilities (or worse) by way of its dependencies, the Mozilla project has come up with a new supply-chain management tool known as "cargo vet".

The problem

The appeal of modern environments is easy enough to understand. A developer working on a function may suddenly discover the need to, say, left-pad a string with blanks. Rather than go though the pain of implementing this challenging functionality, our developer can simply find an appropriate module in the language-specific repository, add it to the project manifest, and use it with no further thought. This allows our developer to take advantage of the work done by others and focus on their core task, which is probably something vital like getting popup windows past ad blockers.

There is an obvious problem with this approach: our developer knows nothing about what is inside this newly added module — or in any of the other modules that this one might quietly pull in as dependencies. This is true when the module is first imported, and becomes even more so as those dependencies are updated, perhaps by somebody other than the original author. It is a recipe for security problems.

The Mozilla project has been trying to increase the safety of the Firefox browser for years in numerous ways; one of those is rewriting much of the browser in the Rust language — which, itself, has its origins at Mozilla. At this point, though, much of the code shipped with Firefox originates outside the project; from the announcement:

Firefox’s Rust integration makes it very easy for our engineers to pull in off-the-shelf code from crates.io rather than writing it from scratch. This is a great thing for productivity, but also increases our attack surface. Our dependency tree has steadily grown to almost four hundred third-party crates, and we have thus far lacked a mechanism to efficiently audit this code and ensure that we do so systematically.

Nearly four-hundred third-party crates does indeed seem like a significant attack surface. A bug in any one of them could lead to the shipping of a vulnerable browser, and the consequence of a crate containing malware could be quite a bit worse. It is, indeed, good that the project is thinking about how to address this threat.

Tracking code audits

There are many ways to improve confidence in the security of a chunk of code. Writing that code in a memory-safe language is one such way; in a Rust program without unsafe blocks, there are whole classes of problems that simply cannot exist. But more than that is required and, in the end, there is no substitute for simply looking at the code and understanding what it does. If a program like Firefox is built only from code that has been diligently audited, the confidence in its security will be higher.

The cargo vet mechanism, built into Rust's Cargo dependency manager and build system, is meant to help with the task. It can't do the tedious and demanding work of actually auditing code, but it can help to keep track of which code has been audited and ensure that unaudited code does not find its way into a production build.

The initial cargo vet setup creates a new directory, called supply-chain, in the source directory; this new directory contains a couple of files called audits.toml and config.toml. The setup also looks at all of the project's dependencies (which are already tracked by cargo in the Cargo.lock file) and marks them all as being unaudited.

A developer can mark a module as being audited (after, presumably, having actually audited it) by adding a block to audits.toml like the following:

    [[audits.left-pad]]
    version = "1.0"
    who = "Alice TheAuditor <NothingGetsPastMe@example.com>"
    criteria = "safe-to-deploy"

This entry says that version 1.0 (and only that version) of the left-pad crate was audited and deemed to be "safe to deploy" in a production build. There are two "audit criteria" defined by cargo vet, the other being "safe-to-run"; others can be added as need be. There are ways of indicating that a range of versions has been audited, or that the patch from one version to the next has been. It is also possible to put in a violation line with a version range; that indicates that those versions have failed the audit and should not be used. Other examples of audits.toml entries can be found on this page.

Once these audits are in place, cargo vet can be run to ensure that all code in the build has been audited. If some dependencies have been updated, the tool will indicate that they require auditing and cause the build to fail. It can also fetch the source for the dependencies in question from crates.io (rather than, say, the project page on a public forge site) to ensure that the code being audited is the same as the code being deployed.

The cargo vet tool, in other words, can help a project keep track of the vetting of its dependencies, and it can help prevent the shipping of unaudited code to users. But it doesn't change the fact that auditing all of that code is a lot of work in the first place. A lot of that work could perhaps be saved, though, if projects could collaborate and share the audits that they have done.

Bringing in the community

One other key objective driving cargo vet is to spread the work of auditing around the community. Since a project's audits.toml file will be a part of its source repository, it will be available to anybody else who can see that repository; that is the whole world for most open-source code. In other words, the results of a project's auditing work will normally be available for the rest of the world to see — and make use of. After all, if one project has audited a dependency and found nothing amiss, and if that project's judgment is to be trusted, then there is little reason for any other project to repeat that work.

To take advantage of another project's auditing work, cargo vet can be told to import its audits.toml file and accept the audit results found therein. Needless to say, a certain degree of trust should exist before delegating one's auditing tasks to others on the Internet. There is currently no mechanism for discovery of available audits, and no way (in cargo vet at least) to verify that the person listed in the audits.toml file actually claims to have done an audit — anybody who can write the file can add any text they want. If the use of this mechanism takes off, though, such features can be added in the future.

The overall goal of this work is to take away excuses for not properly auditing dependencies:

Each new participant automatically contributes its audits back to the commons, making it progressively less work for everyone to secure their dependencies. We’ve learned many times that the best way to move an ecosystem towards more-secure practices is to take something that was hard and make it easy, and that’s what we’re doing here.

The hope is that, as the amount of audited code increases, the use of cargo vet will grow as well. The infrastructure may be a good start but, as the announcement notes, there is a remaining problem that could be hard to overcome: "there is no way to independently verify that an audit was performed faithfully and adequately". Creating a system of sharing audits across the community looks like a difficult task in the absence of some sort of reputation system that lets users decide which audits they should actually trust.

This project is quite new, though, so it is not surprising that some gaps remain. There can be no doubt that cargo vet is trying to address a pressing and urgent problem, so it is good to see this work being done. If this approach pans out, the use of random modules by unknown authors from a central software repository might just become a slightly more rational thing to do.

Comments (66 posted)

/dev/userfaultfd

By Jonathan Corbet
June 13, 2022

The userfaultfd() system call allows one thread to handle page faults for another in user space. It has a number of interesting use cases, including the live migration of virtual machines. There are also some less appealing use cases, though, most of which are appreciated by attackers trying to take control of a machine. Attempts have been made over the years to make userfaultfd() less useful as an exploit tool, but this patch set from Axel Rasmussen takes a different approach by circumventing the system call entirely.

A call to userfaultfd() returns a special file descriptor attached to the current process. Among other things, this descriptor can be used (with ioctl()) to register regions of memory. When any thread in the current process encounters a page fault in a registered area, it will be blocked and an event will be sent to the userfaultfd() file descriptor. The managing thread, on reading that event, has several options for how to resolve the fault; these include copying data into a new page, creating a zero-filled page, or mapping in a page that exists elsewhere in the address space. Once the fault has been dealt with, the faulting thread will continue its execution.

A thread will normally encounter a page fault while running in user space; it may have dereferenced a pointer to a not-present page, for example. But there are times that page faults can happen within the kernel. As a simple example, consider a read() call; if the buffer provided to read() is not resident in RAM, a page fault will result when the kernel tries to access it. At that point, execution will be blocked as usual, but it will be blocked in the kernel rather than in user space.

Blocking on page faults within the kernel is a normal experience when dealing with user-space memory, and everything works as it should. There is one little problem, though. If an attacker can force a page fault at a known point in the kernel — which is often not hard to do — they can use userfaultfd() to block the execution of a thread in the kernel indefinitely. That, in turn, can expand a race window that would otherwise be difficult or impossible to hit, giving the attacker a chance to change the world in potentially surprising ways while the kernel is waiting.

This abuse of userfaultfd() is not just a theoretical possibility; various exploits (example) using userfaultfd() have been disclosed over the years. The problem was deemed serious enough that some restrictions were added in 2020. If the vm/unprivileged_userfaultfd sysctl knob is set to zero (as it is on many distributions), then one of two conditions must apply for a userfaultfd() call to succeed: either the calling process has the CAP_SYS_PTRACE capability, or it supplies the UFFD_USER_MODE_ONLY flag to the system call. In the latter case, page faults encountered while running in the kernel will not be processed via the userfaultfd() mechanism, even if they occur within a registered area.

This change was merged for 5.11 at the end of 2020. It closes off this use of userfaultfd() by attackers, but it also makes the full functionality unavailable to legitimate (but unprivileged) processes. As Rasmussen notes in this patch from the series, that problem can be worked around by giving the process in question the CAP_SYS_PTRACE capability, but that enables a number of actions that have nothing to do with userfaultfd(). Specifically, it could allow the process to read data from or inject code into any other process on the system, which may be undesirable. It would be good, instead, to be able to enable the full userfaultfd() functionality for a process without granting it wider, unneeded privileges.

Rasmussen's solution is to create a new special file called /dev/userfaultfd that gives access to this functionality without the need to call userfaultfd(). One might think that opening this file would yield a file descriptor that acts just like the descriptor from userfaultfd(), but it is not quite as simple. Instead, the only thing that can be done with a /dev/userfaultfd file descriptor is to call ioctl() with the USERFAULTFD_IOC_NEW command; that will create a userfaultfd()-style file descriptor.

A file descriptor created in this way will behave like one from userfaultfd() in every way, with one exception: the handling of kernel faults will be allowed regardless of the calling process's privilege level or the setting of the global sysctl knob. The effect, in other words, is to circumvent the 2020 patch, making full userfaultfd() features available again to all processes. The catch is that a process must be able to open /dev/userfaultfd in the first place to gain access to the feature it provides. By setting the access permissions on this file, an administrator can control who is able to open it and use userfaultfd() in this way.

In other words, /dev/userfaultfd allows an administrator to give the ability to handle kernel faults to specific processes without the need to grant any other privileges. This patch series is in its third revision, and it would appear that the review comments received so far have been addressed. Barring some sort of surprise, this new tweak to the security policy surrounding userfaultfd() seems likely to find its way into the kernel during a near-future merge window.

Comments (19 posted)

Zoned storage

By Jake Edge
June 14, 2022

LSFMM

Zoned storage is a form of storage that offers higher capacities by making tradeoffs in the kinds of writes that are allowed to the device. It was the topic of a storage and filesystem session led by Luis Chamberlain at the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM). Over the years, zoned storage has been a frequent topic at LSFMM, going back to LSFMM 2013, where support for shingled magnetic recording (SMR) devices, which were the starting point for zoned storage, was discussed.

Chamberlain began with the news that a zoned storage microconference had been accepted for this year's Linux Plumbers Conference (LPC). He encouraged attendees to submit topics and hoped it was an opportunity to introduce more user-space developers to zoned-storage concepts. LPC will be held September 12-14 in Dublin, Ireland.

In a "really fast intro" to zoned storage, he quoted from the zoned-storage home page linked above: it is "a class of storage devices that enables host and storage devices to cooperate to achieve higher storage capacities, increased throughput, and lower latencies". It comes in different form factors, including SMR and "the latest and trendiest one", which is based on SSDs using NVMe zoned namespaces (ZNS). The storage device is divided into zones; "sequential zones" are those where all writes in a zone must be sequential and the only way to overwrite the zone is to reset its write pointer to the beginning. Both SMR and ZNS devices can optionally also have "conventional zones" (or namespaces) that support random writes. There can be drives that only have sequential zones, however, which is something that filesystem developers need to keep in mind, he said.

There are some "ecosystem considerations" to also keep in mind. For one, replacing a zoned-storage device can only be done with another device that has the same zone sizes. Also, ZNS requires manually switching the kernel I/O scheduler to mq-deadline. That is only true for some filesystems, though, an audience member said. Damien Le Moal said that the underlying issue is that the order of writes needs to be guaranteed and the scheduler was the easiest place to ensure that.

Ted Ts'o said there is a philosophical question on whether the kernel should be tracking the write pointers for the zones or whether user space should be doing that. At Google, there is an out-of-tree patch set that hacks the CFQ scheduler so that it will not merge requests and gives user space the responsibility to track write pointer, but he knows that this out-of-tree approach is not sustainable long term. It was generally agreed that there is much to discuss around these topics at LPC.

Moving on, Chamberlain noted that the patches for supporting zone sizes that are not powers of two (npo2) had been posted and that more revisions should be coming soon (v6 was posted on May 25). He also said that btrfs check currently does not work for image dumps of any zoned storage, probably due to a lack of zone information on the disk. There are also some questions about how to write the superblocks for Btrfs on not-power-of-2 devices.

Hearkening back to an earlier session, Chamberlain said that the lack of support for ioctl() and direct I/O (i.e. O_DIRECT) in Java causes problems for using it with zonefs, which requires direct I/O. Le Moal said that the long-term goal is to remove the direct I/O requirement, but it is currently needed to maintain the write-order guarantees.

After Chamberlain asked about the status of bcachefs, Kent Overstreet said that it already has native zoned-device support. The allocation mechanism for bcachefs has always been bucket-based, going back to bcache, which is a good match to zoned storage. Those devices will work just as well as regular block devices for bcachefs, he said.

Chamberlain said he has been using kdevops to "test the hell out of zoned-storage devices" with both fstests and blktests. There are two modes of testing that can be done, either with real hardware or using QEMU virtual devices. There are some problems, currently, with the QEMU driver for ZNS devices, Le Moal said.

Bart Van Assche wondered if there was a need to keep an eye on the standards committees because it seems like there is a gap between the feature sets offered by SCSI and NVMe. Today the zoned-device code is shared for SCSI and NVMe in the kernel, but he worries that may have to change. In particular, filesystems might be forced to choose which of the two types they support. Le Moal said that there are features in ZNS that are not available for SCSI; Btrfs has been trying to adapt to those differences, for example.

Van Assche is working on Android devices with zoned storage. The devices are currently SCSI, but will eventually be NVMe-based; he is concerned that the filesystem being used will have to change because of that. Le Moal said that the goal is that the filesystems should not have to care about the type of the underlying device. Several others said that because the underlying storage technology is quite different, though, the kernel may end up with two APIs for zoned storage. To a certain extent, the differences in the I/O scheduler reflect that, one attendee said.

Josef Bacik said that there should be only minimal differences that filesystems need to be aware of in order to support zoned storage. There are already changes in Btrfs to support the idea of zones, but that is about as far as things should go; anything further should be pushed out to user space, he said. Ts'o said that the block layer is the right place to handle these extra features and to do so in a way that hides the underlying differences from filesystems. He noted that features like discard are used by filesystems through the block-layer interface, which hides the device differences.

If there are new features that can provide a large benefit, storage-device developers should have a conversation with the filesystem community about them, Ts'o said. If, for example, providing a hint that some data is for a journal, such as for a filesystem or database, would make a big performance difference, there may be a way to change the filesystems and applications to take advantage of it. But the benefit needs to be large and the interface to the feature needs to be stable. "It would probably be an ioctl()", he said to chuckles, but he could imagine adding a feature of that sort.

Comments (2 posted)

Retrieving kernel attributes

By Jake Edge
June 13, 2022

LSFMM

At the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM), Amir Goldstein and Miklos Szeredi led a discussion on a new interface for extracting information from kernel objects using the filesystem extended-attributes (xattr) interface. Since Szeredi was not present in Palm Springs, he co-led the session virtually over Zoom audio, which was the only filesystem session with a virtual leader at LSFMM this year. Szeredi's proposal for an interface of that sort had been posted just the day before the session.

Goldstein started things off by noting that there are several use cases where there is a need for a new API to obtain bits of information from the kernel, so it seems like a good idea to create a common API that can meet those needs. Szeredi proposed the getfattr mechanism, which builds on the xattr interface; Goldstein said that he was happy with the idea, as were Szeredi and Dave Chinner, who suggested the idea a year or so ago. In addition, other than an objection to binary data, Greg Kroah-Hartman was "not unhappy" with the idea.

Szeredi took over to describe the proposal in more detail. The intent is to be able to get attributes from some kernel objects; those could be mounts or inodes, but processes or other objects are possible as well. There are several existing interfaces for getting this kind of information, but each has a different way to access the attributes, so it would be nice to have a unified interface, he said.

The xattr API was repurposed for his proposal. It uses a different namespace for the new attributes, however, in order to ensure that legacy code will not break due to unexpected new attributes. For example, listxattr() would not return attributes from the new namespace. One objection to the interface is that it is not efficient enough if there is a need to retrieve multiple attributes. Szeredi said that would need to be tested to see if it is truly a problem, but if so, the API could be extended with a bulk-retrieval mechanism.

Goldstein said that the same interface could be used for a "setfattr" tool that could set attributes; he wondered if there were any objections to the general idea. David Howells said that he had some "potential objections" that are likely surmountable: for example, getfattr does not have the right security checks. It should either have no security checks or ones like the statfs() system call has. The checks required could be based on the namespace being queried, so that Linux security module (LSM) checks could be accommodated as needed.

Howells would also rather see the information be returned as binary data, rather than strings, especially for things that need to be retrieved quickly. He has gotten messages from some developers who liked his fsinfo() proposal because it returned data in binary form, so there was no need to parse it. Goldstein said that others want to be able to use getfattr in shell scripts, however.

The idea is to have a simple and flexible generic interface, he said. If there is a need for higher performance, then once that has been demonstrated, a different interface can be added. Howells said that there is a need for systemd to be able to read a list of thousands of mounts; it will need higher performance. But Goldstein said that most systems do not have thousands of mounts; another interface that is less simple and generic can be added for those kinds of use cases.

Ted Ts'o said that the interface being proposed is not for reading lists of thousands of mounts, but is, instead, for getting information like: "what is the mount point for that particular file or directory?" Thousands of mount points are a reality on production systems at Google, he said, but this interface is not meant for that case. In his mind, getfattr is the non-controversial part; it is the setfattr piece that has not been specified which requires a lot more consideration. There are questions of which attributes can be set, what the permissions required are, how the interface can be introspected, and so on. If the setting interface is not done right, he said that Luis Chamberlain would eventually have to give a talk to complain that it is just as bad as ioctl() is. If getfattr is the camel's nose under the tent for an unseen set interface, that worries Ts'o.

Christian Brauner said that the systemd developers should get an opportunity to weigh in on the proposal. There are longstanding bugs and serious performance issues that the tool has experienced when gathering mount properties on thousands of mounts on production workloads. That is part of what is driving the fsinfo()-style of interface. Brauner thinks it is important to address those problems in any kind of proposal of this sort.

Howells said that some kind of "get mountlist" call might be sufficient to solve the specific problems that systemd is experiencing. Goldstein said that it was not necessary for this proposal to solve all of the problems, however. The problem for systemd is that a single change to the mounted filesystems requires it to rescan the mount tree because it does not get notified of what the change is, only that something has changed, Szeredi said. If the notification could somehow be improved, that might solve systemd's problem.

Goldstein said that the proposal email showed a static hierarchy of attributes but that the hierarchy can be extended flexibly such that each filesystem type could have its own namespace. The CIFS filesystem already does that for both getting and setting attributes in a cifs.* namespace. Brauner asked what the new system call underlying getfattr looked like. Goldstein replied that it was simply using getxattr(). The difference is in the interpretation of the namespaces that are included in the path name argument.

Ts'o said that it did not really make sense for ext4 to switch to this xattr-based mechanism, since it already has a way for programs to retrieve ext4-specific information via sysfs. That code must be maintained for backward compatibility, so adding more code to support the xattr-based mechanism is not attractive. Trying to force all filesystems and applications to use the proposed interface for filesystem-dependent information is probably a bad idea, he said. Any filesystem that wants to use it, should go ahead and do so, however. He just does not see any real value for ext4.

Brauner suggested making it a different system call, even if it is actually using the same getxattr() code underneath. The current expectation for xattrs is that they are stored on disk associated with a file, which is not the case for "fattrs". Goldstein agreed that it probably makes sense to do so. Two other things to consider are adding a getxattrat() system call and, perhaps, a way to get multiple xattrs in a single call, he said. XFS has an ioctl() command for getting multiple xattrs, which could perhaps be generalized. With that, the session ran out of time, but it seems that the xattr-based approach will continue to be pushed forward.

Comments (14 posted)

A discussion on readahead

By Jake Edge
June 15, 2022

LSFMM

Readahead is an I/O optimization that causes the system to read more data than has been requested by an application—in the belief that the extra data will be requested soon thereafter. At the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM), Matthew Wilcox led a session to discuss readahead, especially as it relates to network filesystems, with assistance from Steve French and David Howells. The latency of the underlying storage needs to factor into the calculation of how much data to read in advance, but it is not entirely clear how to do so.

Wilcox began by describing readahead a bit. If user space is reading a file one byte at a time, Linux does not actually read the data that way; instead, it issues reads for a bigger chunk, say 64KB, which gets stored in the page cache. There is a certain amount of latency between the time a page is requested from the storage and when it appears in the page cache; that latency varies greatly over the wide variety of storage types that Linux supports. For network storage, those types can range from locally stored data on multi-gigabit Ethernet to data stored halfway around the world over decidedly slower links. Similarly, for local storage it can range from a 5GB-per-second NVMe SSD to some "crappy USB key picked up from a vendor at a trade show". There is "a lot of stuff to contend with there".

In his experience, block-layer developers tend to do all of their testing using direct I/O; they think that "the page cache sucks" so they avoid it in their testing. James Bottomley said they are often trying to exclude the effects of the page cache from their testing in order to eliminate variables outside of their control. Wilcox said that it was unfortunate, since the performance including the page cache is "what the users are actually seeing"; it would be nice to notice problems where either too much or too little readahead is affecting the performance—before users do.

He said that he has a KernelNewbies wiki page where he is collecting his thoughts about readahead and the page cache in general. The page cache "is awesome in some ways, in other ways it's terrible", but it would be good to fix the problems that it has. The Android developers encountered a problem with readahead, but "they worked around it in the worst way possible". They changed a setting for readahead, moving it from 256KB to several hundred megabytes, he said, in order to shave "some fraction of a second" from application startup time. That has other effects, of course; when they tried to upstream the patches, the memory-management developers said "no".

Howells suggested that the Android developers should be using fadvise() (or, he amended, madvise()) to indicate that the file should have more aggressive readahead. But Wilcox did not agree: "can we stop trying to pretend that user space knows what it is doing?" Android is a specialized environment, Bottomley said; it uses a log-structured filesystem, so the workaround may actually make sense. Wilcox expressed some skepticism on that score.

Overall, there are a bunch of readahead problems to solve. "I'm going to say the 'f' word; folios play into this a bit." The use of larger folios is driven by readahead, currently; the larger the readahead data gets, the larger the folios to hold it get. That is useful for testing. Filesystems that support larger folios, which is only XFS for now (though AFS and v9fs patches are queued), will allocate four-page (i.e. order-2) folios.

Wilcox said that French made him aware that Windows does huge reads for readahead on CIFS. French agreed with that, noting that because of the expected network latency, Windows can readahead up to 16MB, though that seems to degrade performance. He tested with a number of different sizes (256KB, 512KB, 1MB, 4MB) and found better performance with each of those, the most dramatic being seen when going to 512KB. On the Azure cloud, the value was set to 1MB because there were performance decreases for some workloads at 4MB.

The Linux CIFS server defaults to 4MB, he said, based on the results of his testing. It is clear that anything less than 1MB performs worse unless there is a fast network in between. The problem he sees is how this value can get changed sanely, throttled or raised as appropriate. Sometimes the page cache knows more than the filesystem, or the reverse can be true, and the network layer needs to factor in as well. It is not clear to him how that can all be resolved.

There is a mechanism to communicate this kind of information from the filesystem to the virtual filesystem (VFS) layer and page cache using the BDI (struct backing_dev_info), Wilcox said. That is where the VFS looks to find out the performance characteristics of the underlying storage. It may not currently have all of the right information, he said, but that's the place to put it.

When user space is reading a file one byte at a time, a 64KB read is issued instead; a "was this readahead useful" marker is placed around 20KB into the buffer and when it is reached, another 64KB read is issued. The intent is that the second read completes before user space consumes the rest of the 44KB, but the filesystem has no idea of what the latency is for the read. One could imagine measuring how long it takes to do the read and comparing it with the user-space consumption rate to better determine when to schedule the next read, he said, but that is not done.

That second 64KB read has its marker placed right at the beginning of the buffer; when that marker is reached, it decides to grow the readahead buffer and reads 128KB. It will increase once more (to 256KB), but not any further, though it should probably go up from there. The 256KB limit has been that way for 20 years and "I/O has hardly changed at all in that time", Wilcox said sarcastically; increasing that limit is many years overdue at this point. Josef Bacik read a chat comment from Jan Kara that said SUSE has had a limit of 512KB for years. Given that, Wilcox thought an immediate move to a 1MB maximum was in order.

But, rather than increasing the size of the read directly, Wilcox would rather issue multiple 256KB reads back to back because that will help the page cache track if the data that is read ahead is actually being used. French said that would be preferable for CIFS, as multiple smaller reads are better performance-wise. Jeff Layton came in via Zoom to say that he thought a single, larger read would be better for NFS and was surprised to hear that smaller reads were better for CIFS.

Howells said that which of the options is better is totally filesystem-dependent. Ted Ts'o said that it will also depend on the underlying storage; increasing the readahead size on memory-constrained devices may also be problematic. There is not going to be a single "magic readahead formula" that can be used in all situations. He suggested making it possible to experiment with different algorithms using BPF; that way a grad student, for example, could easily try out different ideas to see which worked best. Wilcox said that he had wanted to argue with that idea but then decided he liked it because it would make it someone else's problem, which was met with much laughter.

Chuck Lever said that he thought the readahead situation was being greatly oversimplified in the discussion. In the original example of a program reading a byte at a time, there is no reason to increase beyond 64KB, since that will keep up with the program just fine. There are problems with requesting too much data and filling the page cache with pages that are not actually going to be read.

Another problem is that queueing up a 1MB read, for example, on a network filesystem will hold up other smaller requests, like metadata updates, while the read is ongoing. He agrees with the need for experimentation and thinks that should take precedence over any immediate increase in the kernel's readahead size. There is a need for testing with a lot of different workloads, filesystems, and so forth to determine what the overall systemic effects of that kind of change would be.

Comments (16 posted)

Remote participation at LSFMM

By Jake Edge
June 15, 2022

LSFMM

As with many conferences these days, the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM) had a virtual component. The main rooms were equipped with a camera trained on the podium, thus the session leader, so that remote participants could watch; this camera connected into a Zoom conference that allowed participation from afar. In a session near the end of the conference, led by conference organizer Josef Bacik, remote participants were invited to share their experiences—on camera—with those who were there in person. It was an opportunity to discuss what went right—and wrong—with an eye toward improving the experience for future events.

Ric Wheeler was first up; he said that aside from the first few minutes where he could not hear anyone, "it was a good virtual experience". Mel Gorman echoed that and noted that it "was infinitely superior to not being able to participate at all". One thing he noted is that the "raised hands" in the Zoom interface were not monitored in some sessions. There were two sessions that he remembered where the speaker asked for objections or other comments and he was left "screaming at the mic". It is difficult to keep an eye on that when leading a session, he said; it was only a minor detraction from the overall great experience.

Bacik agreed that there were problems with ensuring that remote participants were heard. James Bottomley, who was onsite, asked why the "curated mute" feature was chosen; "since we vaguely trust ourselves", why not allow participants to unmute on their own? Bacik said that he had not really thought about it much going into LSFMM; eventually the filesystem track switched to "unmute at will", which seemed to work well.

Remote participant Jan Kara agreed that being able to unmute when he wanted to say something worked well, though, overall, the experience was "better than I was expecting". Bottomley said that it would have been better if Kara had turned on his video when he was going to talk, so that there was someone to focus on rather than having a "disembodied voice from around the room". Kara said he was unsure whether there was a way to see his video in the room if he had done so; in fact, the video side of the Zoom conference was only available if attendees were signed into it. Unlike this session, most of the others did not display the Zoom video on the conference-room screen.

Matthew Wilcox, who was present in Palm Springs, wanted to thank the remote participants for putting in the effort to attend, especially those where it was at being held at highly inconvenient times. "You have made this such a better conference."

A remote attendee said that LSFMM "went way better than I expected". He is more comfortable interacting via text than voice, so it was nice to have the Zoom chat feature available. Sometimes the chat got read to the room, which was "really great". If people introduced themselves before they made comments on the microphone, he said, it would help, especially for those who are new to the community.

Bottomley asked if it would be better to have a chat that everyone in the room could see (rather than just those logged into Zoom), but the attendee said that giving the session leader the option to decide if the comment was worth passing on made it less worrisome for him to make a comment. Onsite participant Ted Ts'o wondered if having a chat channel separate from the videoconference application, as it is for the Linux Plumbers Conference (LPC), made sense. Sometimes having a chat back-channel is useful even when there is another foreground conversation going on in the audio channel, he said.

Another remote attendee said that the sessions with remote presenters, where people could just comment without raising their hands, seemed to work better. One problem they noticed was that the video resolution was not sufficient to be able to read the slides or screen sharing much of the time; larger fonts would be helpful. An LSFMM virtual presenter noted that he could not see which of the local attendees was speaking, which was less than ideal. Bottomley had pointed his laptop camera at the room for the session and added it to the videoconference, but said that it was hard to see who might be speaking. For LPC, there will be a camera operator at the front of the room who will zoom in on speakers, but that "costs a fairly fantastic amount of money"; the laptop-camera option might be better if it is at all useful, he said.

A first-time attendee, who was remote, said that "content-wise it was perfect", though there was a problem with the volume oscillating periodically. Another remote attendee had several suggestions based on other events and meetings they had attended over the last few years. When the session leader wanders away from the podium, they are not just leaving the video feed, he said, they are also stepping away from the microphone so it can be hard to hear them at that point. A separate wiki or shared notes site that can be updated by everyone is helpful, as is a "watcher of the tech space" to monitor hands being raised, chat messages, and so on. He thought the laptop camera was helpful from a feedback perspective; if nothing else, it gives an indication of when might be a good time to jump in with a remote comment.

One of the A/V people said that, based on doing other events of this sort, he recommended having a separate screen at the front dedicated to the Zoom conference instead of trying to share it with the slides and such. That way, remote people can turn on their cameras and speak to the room when they have comments or questions. In addition, he agreed that having someone in charge of paying attention to the videoconference side of things (a "Zoom ambassador") is important. Bacik thanked the A/V staff, who did a "fantastic job", he said, using a modifier on that phrase that anyone who knows him can guess. That led to a loud round of applause for the staff.

There was some discussion of the problems inherent in being a virtual participant. For example, the more-or-less nine-to-five schedule can be even more difficult over video, which is why all-virtual events tend to stretch out the schedule over more days. That is not really possible for a hybrid event, since the in-person attendees cannot typically stay for many more days; the expense of renting the conference facility also factors into that. But virtually attending a conference in a different time zone and then trying to put in a full day at the office "is brutal", Gorman said.

In addition, virtual attendees miss out on the hallway track, which is often cited as half the value of technical conferences, and the social events, which are also valuable. Gorman said that it might be kind of creepy to try to somehow involve virtual attendees in those. There are tradeoffs that people are making when they attend virtually; missing out on those pieces is just part of that, he said. He is concerned that going too far in trying to accommodate virtual attendees would reduce the value of the in-person portion too much. There was general agreement with that.

Overall, it would seem that LSFMM went well, both locally and virtually. Much of that was due to the help of the Linux Foundation events staff, who Bacik has been working with for three years as the conference was scheduled—then canceled—several times. He specifically thanked those folks in the conference closing session a few hours later. It would seem there are some fairly minor improvements to be made, so a hybrid LSFMM next year should be even better.

Comments (8 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>