|
|
Subscribe / Log in / New account

LWN.net Weekly Edition for June 9, 2022

Welcome to the LWN.net Weekly Edition for June 9, 2022

This edition contains the following feature content:

This week's edition also includes these inner pages:

  • Brief items: Brief news items from throughout the community.
  • Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Per-file OOM badness

By Jonathan Corbet
June 2, 2022
The kernel tries hard to keep memory available for its present and future needs. Should that effort fail, though, the tool of last resort is the dreaded out-of-memory (OOM) killer, which is tasked with killing processes on the system to free their memory and alleviate the problem. The results of invoking the OOM killer are never going to be good, but they can be distinctly worse if the wrong processes are chosen for an untimely end. As one might expect, the effort to properly choose the right processes is an ongoing effort. Most recently, Christian König has proposed a new mechanism to address a blind spot in the OOM killer's deliberations.

When the system runs out of memory, the OOM killer's job is to try to resolve the problem while causing the least possible amount of collateral damage; a number of heuristics have been applied to the victim-choosing logic toward that end. One obvious rule is that it is generally better to kill fewer processes than many, and the way to do that is to select the processes that are currently consuming the most memory. Often, a single out-of-control process is responsible for the problem in the first place; if that process can be identified and killed, the system can get back into a more stable condition.

The OOM killer, thus, scans through the set of running processes to find the most interesting target. At the core of this calculation is a function called oom_badness(), which sums up the amount of memory (and swap space) being used by a candidate process. That sum is then adjusted by the process's oom_score_adj value (which is a knob that an administrator can tweak to direct the OOM-killer's attention toward or away from specific processes) before being returned as the process's score. The process with the highest score as determined by this function will be the first on the chopping block. Any process's score can be seen at any time by reading its oom_score value from its /proc entry.

One problem with this algorithm, as identified by König, is that oom_badness() does not take into account all of the memory used by a process. Specifically, memory associated with files is not counted; consider, for example, any extra memory that a device driver must allocate when a device special file is opened and operated upon. For some workloads, this memory can be significant, with the result that the processes accounting for the most memory use might not look like attractive OOM-killer targets.

As a simple example, he said in the patch-series cover letter, a malicious process can call memfd_create(), then just write indefinitely to the resulting memfd; the memory consumed by the memfd will not be seen as belonging to the offending process so, when the memfd ends up consuming all of the available memory, the OOM killer will pass over that process. This sequence "can bring down any standard desktop system within seconds". Another problem area, he said, is graphics applications that allocate significant amounts of memory within the kernel for graphical resources.

The solution is to give the OOM killer visibility into the memory resources that are consumed in this way. That, in turn, involves adding yet another member to the ever-growing file_operations structure:

    long (*oom_badness)(struct file *file);

Documentation is lacking, but the intent seems to be that this function, if it exists, should return the amount of extra memory attached to the given file, in pages. This function will be called from within the global oom_badness() function to take that extra memory usage into account; if the file involved is shared between processes, the memory usage will be divided equally among those processes.

Implementations of the new function have been added to the shared-memory filesystem code, the DMA-buf subsystem, and to most graphics drivers. With this mechanism in place, the system has a better idea of where the OOM killer's wrath should be directed to maximize the chances of freeing up significant amounts of memory and bringing the system back to a stable state.

Of course, the hazards of any new heuristic can be seen in this claim in the cover letter: accounting for this memory, König says, provides "a quite a bit better system stability in OOM situations, especially while running games". Accounting for memory used by graphics drivers is likely to point the finger at graphics-intensive applications — games, for example — as the source of an out-of-memory problem. Having the OOM killer take its vengeance on that game may restore the system, but the user, whose nearly complete quest would be abruptly terminated thereby, might be forgiven for thinking that the situation was better before.

In other words, there is still no truly good solution to the OOM problem other than not getting into that situation in the first place. After all, the OOM killer is still, as Andries Brouwer suggested in 2004, like choosing passengers to toss out of a crashing aircraft. When the system runs out of memory anyway, though, it is important to free memory quickly, and that is most likely to happen if the OOM killer has an accurate picture of which processes are using the most memory. Properly accounting for memory attached to files seems like a useful step in that direction.

Comments (37 posted)

What constitutes disclosure of a kernel vulnerability?

By Jonathan Corbet
June 3, 2022
Opinions differ on the best way to disclose security vulnerabilities, but there is a general consensus in our community that vulnerabilities should, indeed, be made public at some point. What happens between the discovery of a vulnerability and its disclosure can be more controversial. A recent discussion on the handling of kernel vulnerabilities has led to change in the policies of the linux-distros mailing list — all based on the question of what constitutes "disclosure".

There are two mailing lists that are commonly used for the discussion of vulnerabilities in the Linux community; they are not limited to kernel problems. The first of these, linux-distros, is a closed list that is used to coordinate the response to non-public security bugs. The second, oss-security, is a public list which is used for, among other things, the public disclosure of vulnerabilities. Both are administered by Alexander "Solar Designer" Peslyak.

There is a long list of policies that apply to postings on linux-distros, including one that requires the public disclosure of all vulnerabilities reported there within a relatively short period of time. That rule is there to ensure that companies don't sit on vulnerability reports indefinitely, no matter how embarrassing they are. Another list policy, though, says that vulnerabilities that are already public have no place on linux-distros; all discussion of public vulnerabilities belongs on oss-security instead. The implementation of these policies has often proved to be tricky, especially when dealing with kernel vulnerabilities; see this 2021 article for a recent example.

In mid-May, Peslyak wrote to oss-security in search of a solution for the ongoing mismatch between the list policies and how the kernel project does business. The core problem is how security problems are often handled in the kernel community:

For Linux kernel maintainers, it is customary to post a fix technically publicly but without indication of its security relevance, then work on getting it merged into the various trees, and expect that its security relevance wouldn't be clearly indicated publicly for a while.

Such patches tend to look like this (though the exploitability of that particular bug has not been verified here). According to the linux-distros list policy, this public posting of a fix makes a particular vulnerability ineligible for discussion there — the vulnerability has already been disclosed. But distributors of the Linux kernel still often want a way to discuss the real problem, which has not been disclosed yet, under embargo and coordinate the shipping of the fix to their users. That cannot be done on oss-security, which is public, and it cannot be done on linux-distros because posting a patch is seen as having disclosed the vulnerability.

Increasingly, Peslyak said, the linux-distros policy is simply being ignored when it comes to kernel vulnerabilities; he asked what should be done about that problem. One option, he said, would be to continue to look the other way when a vulnerability for which a public patch is available shows up on linux-distros. Alternatives would include strictly enforcing the policy (and thus forcing kernel vulnerabilities off the list entirely), changing the list policy, or even just shutting down the list entirely.

As one might expect, a variety of opinions was expressed — though nobody seemed to be in favor of just killing the list. Jason Donenfeld suggested enforcing the policy, since kernel developers have little interest in anything but fixing the bug anyway. The dominant view, though, seemed to be in favor of adjusting the list's policies to better fit how the kernel project operates. As Donenfeld put it, kernel developers have little interest in the "security game" and are unlikely to start playing it, but Greg Kroah-Hartman described another reason for why the kernel project handles security fixes the way it does:

As you know, there are different "grades" of attackers. There's a huge range from "run metasploit that I just downloaded" to "look at this kernel change and figure out how to abuse the system that does not have it". By delaying a small bit of time from publicly posting a patch to telling the world that "hey, that was a security fix over there" that allows the community that works in the public added time for review and testing as our testing infrastructure that is NOT public is quite limited and reviews are limited given the huge range of needed developers to do that review.

That delay can allow users to have the fix on their system first before the "metasploit" package is updated to attack it, which reduces the amount of vulnerable systems out there. Yes, it does not solve the "prevent readers of all commits" issue, but I don't know what we can really do about that except switch to a closed source development model, which isn't a good thing overall anyway.

The review issue is not a small one. Security fixes are not immune from the ills that plague software development in general; they can easily introduce bugs (including security-related bugs) of their own, cause user-space regressions, and more. Like all other changes, they benefit from more review (and extensive testing) before being applied. There are limits to how much of that review and testing can happen without posting the patch in public.

The benefits of obscuring the security problem motivating a specific patch may not be quite so clear. It may well reduce the number of casual attacks from the people often known as "script kiddies" but, as Kroah-Hartman pointed out, it does little to defend against capable attackers who are reading the commit stream for the purpose of finding vulnerabilities. Even so, it seems clear that there are developers and companies that see good reasons to keep security problems under wraps, at least for a short period.

Given the perceived value of posting patches without explicitly disclosing the underlying security issue, Kroah-Hartman said, the best thing to do would be to amend the list policy to allow such posts. He suggested that other large projects might also benefit from such a policy. Peslyak wasn't convinced that those projects, most of which do not use linux-distros at all, would be interested, but he did, in the end, decide to amend the linux-distros policy to accommodate the kernel's way of doing things. Issues with a public fix are themselves considered public, the policy reads, except when they aren't:

There can be occasional exceptions to this, such as if the publicly accessible fix doesn't look like it's for a security issue and not revealing this publicly right away is somehow deemed desirable. In particular, we grant such exceptions to Linux kernel issues concurrently or very recently handled by the Linux kernel security team.

Given the lack of subsequent discussion, it seems likely that this change is acceptable to the list members as a whole. Meanwhile, Vegard Nossum is working on some changes to the kernel's documentation to make the policies for the reporting of security bugs more clear. None of this will definitively end the discussions around vulnerability reporting, disclosure, and mailing-list policies but, with luck, it will make things work a bit more smoothly than they do now.

Comments (5 posted)

5.19 Merge window, part 2

By Jonathan Corbet
June 6, 2022
The 5.19 merge window was closed with the 5.19-rc1 release on June 5 after the addition of 13,124 non-merge changesets to the mainline kernel. That makes this merge window another busy one, essentially matching the 13,204 changesets seen for 5.18. The approximately 8,500 changesets merged since our first 5.19 merge-window summary contain quite a bit of new functionality; read on for a summary of the most interesting changes that were pulled during the second half of this merge window.

Architecture-specific

  • The remaining 32-bit Armv4T and Armv5 systems have finally been dragged into the multi-platform world. As Arnd Bergmann noted in the merge message: "This series has been 12 years in the making, it mostly finishes the work that was started with the founding of Linaro to clean up platform support in the kernel".
  • The h8300 architecture has been removed — again. As noted in the merge message, it was deleted once in 2013 and reinstated two years later. Since then, it has seen almost no maintenance, so now it is gone again.
  • Changes to the riscv architecture include the addition of support for "supervisor-mode page-based memory types" (allowing pages to be marked with attributes like "non-cacheable"), support for running 32-bit binaries on 64-bit systems, and an implementation of kexec_file_load().
  • The initial support for Loongson's "LoongArch" CPU architecture has been merged.

    LoongArch is a new RISC ISA, which is a bit like MIPS or RISC-V. LoongArch includes a reduced 32-bit version (LA32R), a standard 32-bit version (LA32S) and a 64-bit version (LA64).

    This documentation commit has more information about this architecture.

  • There is a new generic ticket spinlock type that can be implemented on most architectures that cannot support a full qspinlock implementation. It is being used by openrisc, csky, and riscv.

Core kernel

  • A new proactive reclaim mechanism has been merged that gives user space some control over working-set management. The memory controller has a new control file called memory.reclaim; writing a number there will initiate an attempt to reclaim that many bytes from the indicated control group. See this commit for more information.
  • The longstanding problems with copy-on-write and get_user_pages() have gotten a bit better with the merging of an extensive set of fixes; this article describes the changes.
  • The kernel can better account for (and control) the use of memory when compressed swapping with zswap is in use; see this changelog for a bit more information.
  • The kernel can now keep track of which modules (if any) tainted the kernel, even after those modules are unloaded.
  • The sysctl knobs for the System V interprocess communication mechanisms have been reworked to be properly associated with each IPC namespace. This paves the way toward allowing unprivileged processes to change them within user namespaces, but that last step has not yet been taken.

Filesystems and block I/O

  • The fanotify mechanism implements a new flag (FAN_MARK_EVICTABLE) that does not pin the targeted inode in the inode cache. If the inode is evicted for any reason, the associated mark will be lost. The purpose of this feature appears to be to allow applications to mark subtrees to be ignored without actually pinning parts of those subtrees in the cache.
  • The XFS filesystem has gained the ability to store billions of extended attributes with any given inode. Evidently, there are people out there who actually want to be able to do that. While they were in the neighborhood, the XFS developers raised the maximum number of extents per file from a measly 4 billion to 247.
  • XFS has also gained a feature called "logged attribute replay"; it allows multiple extended attributes in a file to be modified together in an atomic fashion. This merge message has a bit more information about both changes.
  • The NFS "courteous server" feature will avoid purging lock state for an unresponsive client for up to a day — unless some other client requests a contending lock. Without this feature, an unresponsive client's locks will be unconditionally purged after 90 seconds.
  • The overlayfs filesystem can now handle ID-mapped mounts.

Hardware support

  • Clock: Airoha EN7523 SoC system clocks, Renesas RZ/G2UL clock controllers, R-Car V4H clocks, MediaTek MT8186 clock controllers, STMicroelectronics STM32MP13 reset clock controllers, Qualcomm SC7280 LPASS core & audio clock controllers, Qualcomm SC8280XP global clock controllers, Renesas RZ/N1 realtime clocks, and HPE GXP timers.
  • GPIO and pin control: Marvell 98DX25xx and 98DX35xx pin controllers, Qualcomm SC7280 LPASS LPI pin controllers, Renesas RZ/G2UL pin controllers, and Mediatek MT6795 pin controllers.
  • Graphics: NewVision NV3052C RGB/SPI panels, Lontium LT9211 DSI/LVDS/DPI bridges, Synopsys Designware GP audio interfaces, Intel MEI graphics system controllers, Freescale i.MX8MP LDB bridges, and Rockchip VOP2 visual output processors.
  • Input: Azoteq IQS7222A/B/C capacitive touch controllers and Raspberry Pi Sense HAT joysticks.
  • Miscellaneous: Apple SART DMA address filters, Apple ANS2 NVM Express host controllers, Microchip PolarFire random number generators, NVIDIA Tegra GPC DMA controllers, Renesas RZ/N1 DMA routers, Qualcomm light-pulse generators, Xilinx LogiCORE IP AXI timers, Sunplus pulse-width modulators, Apple eFuses, and Qualcomm SC8280XP and SDX65 interconnect buses.
  • Sound: Generic serial MIDI devices, Realtek ALC5682I-VS codecs, NVIDIA Tegra186 asynchronous sample rate converters, Cirrus Logic CS35L45 smart speaker amplifiers, Mediatek MT8186 audio DSPs, and Analog Devices MAX98396 speaker amplifiers.
  • USB: ON Semi FSA4480 analog audio switches.
  • Watchdog: Sunplus watchdogs, Renesas RZ/N1 watchdogs, and HPE GXP watchdogs.
  • Also: the new "hardware timestamp engine" subsystem supports devices that can record timestamps in response to events. The NVIDIA Tegra 194 timestamp provider is the first device supported by this subsystem.

Miscellaneous

  • The kernel's firmware loader can now handle firmware files that have been compressed with Zstandard. The firmware loader also has a new sysfs file that allows firmware loads to be initiated from user space; there is a bit of documentation in this commit.

Virtualization and containers

  • The virtio-blk driver now supports polled I/O, an enhancement that, according to this commit message, improves latency by about 10%.

Internal kernel changes

All of those changes inevitably brought a lot of bugs with them; the time has now come to try to identify and fix those problems. Assuming that 5.19 turns out to be a normal nine or ten-week cycle (and it has been a long time since anything else has happened), the final 5.19 kernel release will happen on July 24 or 31.

Comments (2 posted)

Maintainers don't scale

By Jake Edge
June 6, 2022

LSFMM

In something of a grab-bag session, Josef Bacik led a discussion about various challenges that Linux kernel maintainers face, some of which lead to burnout. The session was originally going to be led by Darrick Wong, but he was unable to come to LSFMM, so Bacik gathered some of Wong's concerns and combined them with his own in a joint storage and filesystem session at the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM). As part of the discussion, Bacik presented his view on what the role of a kernel maintainer should be, which seemed to resonate with those present.

Fuzzing and CVEs

He started by noting that some of the areas that Wong wanted to discuss had already come up in other sessions, including the difficulties in setting up and running fstests and the need for backports of fixes to multiple stable releases. One topic that had not come up, though, was the increasing prevalence of people running fuzzers against filesystems, then filing for CVEs on the resulting problems. The CVE then triggers the kernel security process, which limits the amount of time available to make a fix and also limits which people can be involved in the investigation and bug-fixing process.

[Josef Bacik]

In Bacik's opinion, a fuzzed filesystem does not constitute a security bug; "I know I'm probably a heretic for saying that". Filesystems can only be mounted by the root user, he said, but that is often countered with the example of a USB drive; "turn off automount" is his answer for that. In any case, the problem Wong described is not one that he has personally experienced, but he asked Ted Ts'o what he knew about it.

Ts'o said that companies that want to sell their products into certain markets, such as to the US government, have special requirements with regard to fixing security problems. There are various time frames on how quickly the bugs must be fixed on production systems based on their severity score. If those are not met, then there is an auditing process where arguments like "we're not crazy enough to let random container people mount untrusted filesystems" can be made to explain why the CVE does not apply. But that involves describing the situation to a government bureaucrat, which, unsurprisingly, security teams do not enjoy, he said.

The good news is that generally those types of bugs do not have a high severity score because they are not remotely exploitable. Chris Mason said that he did not think the security@kernel.org team was opening CVEs for the reports it receives; Ts'o agreed, but noted that the research labs that find these bugs often have a financial incentive to open CVEs.

Luis Chamberlain said that the process followed depends on the subsystem maintainer to a large extent. Some subsystems, such as networking, do not fix bugs behind closed doors, while others do. The problem, of course, is that public fixes can lead to zero-day exploits until the fix is made and rolled out to distributions and into production. Ts'o said that it may also depend on the employer; there may be pressure applied to a particular employee who happens to also be the maintainer of a subsystem. That is not directly "maintainer stress" but is instead targeting an employee, he said.

Expectations

Bacik said that he wanted to discuss "what we expect from maintainers". Traditionally, the expectation has been that they merge, write, test, review, and backport patches; that is a lot of work for one person. Many maintainers have come up with ad hoc solutions to try to scale that back, such as making sure that the developers for the subsystem are also reviewing patches. Ensuring that there are good testing setups is important to that effort as well.

He would like to ensure that maintainers are also actively working on maintaining the community itself. Linux developers are passionate about the work that they do, but that passion sometimes leads to conflicts. "We get a little short with each other", which is not a good thing to maintain good working relationships, he said.

Getting together in the same room, for example at LSFMM, is helpful, but he would like to see more done to get out ahead of these kinds of problems. It would be great to resolve these difficulties before they blow up and before they require getting together at a conference to fix them. A maintainer cannot follow all of the email threads, Bacik said, but it is fairly easy to spot "the big stuff that may be contentious". In those cases, he would like to encourage maintainers to either be a mediator, or find someone in the community who is good at mediating, to try to help developers from getting "overly invested in the code; in the end, this is just code".

The intent is to find a way to ensure that those who are butting heads do it respectfully and in a way that will allow them to continue working together. If the maintainers could be "a little bit more proactive" about keeping an eye on contentious discussions, it would help head off these kinds of problems.

Ts'o said that he has been having weekly ext4 development meetings that have been really helpful in reducing friction for filesystem development. Wong attends those meetings, which allows them to informally discuss things, maintainer to maintainer, that will ultimately need to be resolved on the linux-fsdevel mailing list. Jan Kara and other filesystem developers, who bring other useful perspectives, attend as well. Though Ts'o is sure that no one needs more meetings, he does wonder if some kind of monthly or quarterly gathering of developers could make a difference.

Christian Brauner agreed with the idea of periodic meetings; he thought it would be "very useful". He has run into mailing-list conflicts along the way and said that it is only "in very very rare circumstances" when a third party comes along to calm things down on the list. Bacik said that it is important to ensure "that a random person can pop into a conversation" to ask the participants to calm down. That is not necessarily the case currently.

It is tempting to suggest that Linus Torvalds (or some other individual) should be the one to step in and calm things down, but depending on a single person is not likely to work. Sometimes people are not paying attention at the right time; Bacik said he did not see the problems Brauner was referring to until well after the fact. Bacik wondered if people who are involved in a conflict of that sort should put out a call for a maintainer or other third party to step in and help calm things down.

There are two areas that he would like to see maintainers take more of a lead in: conflict resolution and technical direction. Sometimes there are two developers who are disagreeing about the best way forward; "that's where I really want maintainers to step up more" and point out the direction that the project should take. That will help prevent the developers from continuing to butt heads and allow the project to progress.

Ts'o said that he agrees that the main goal should be conflict resolution, but that is in an ideal world. In his experience, a maintainer generally picks up all of the tasks that the rest of the community is unwilling to do. Since no one steps up to do test engineering for ext4, for example, he takes care of it because it needs to be done.

He is also spending quite a bit of time in recruiting new people to the ext4 community and in encouraging companies to staff positions for ext4 developers. That includes "working with an unnamed cloud company, not my own" to try to get some of its developers working on ext4 and, hopefully, to join that weekly development call. That is something he suggested other maintainers consider doing; other subsystem developers can also help the maintainer with that effort, which ultimately will help reduce burnout. A big problem is that "we simply do not have enough people on some of the filesystem teams", he said.

Bacik said that it is his goal to reduce the number of things that maintainers are doing so that they can focus on the areas where their experience and knowledge are needed. Automation, especially in testing, is a good way to reduce the burden for maintainers. The important jobs for maintainers, conflict resolution and technical direction, cannot be automated; that is the piece that requires a human.

Moving faster

He concluded the session by saying that he would like to see the development of Btrfs (and, by extension, other parts of the kernel) move more quickly than they do today. In order for that to happen, though, there need to be more and better tests that are being continuously run to detect when bugs are introduced. Testing and test engineering are the kinds of tasks that can be handed off to more junior engineers, he said, though the bar for test quality needs to be maintained.

In general, he said that he wants to foster an environment where things move quickly, but mistakes will be made "and that's OK". Developers need to understand that they can make mistakes, those mistakes will be detected early on, and the change will simply be reverted. Then they can try again, That philosophy can be applied to the tests and testing infrastructure, as well as to new features for the kernel.

Comments (1 posted)

Best practices for fstests

By Jake Edge
June 7, 2022

LSFMM

As a followup to a session on testing challenges earlier in the day, Josef Bacik led a discussion on best practices for testing in a combined storage and filesystem session at the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM). There are a number of ways that developers can collaborate on improving the testing landscape using fstests and blktests, starting with gathering and sharing information about which tests are expected to pass and fail. That information depends on a lot of different factors, including kernel version and configuration, fstest options, and more.

Existing testing

Bacik began by briefly describing the testing setup that he uses for Btrfs, which continuously runs fstests. There is a dashboard web site that shows the test runs and failures, along with which configurations are affected. "That has been huge", he said; it has been running for a year at this point.

He noted that Luis Chamberlain runs tests in a loop a thousand times, but Bacik's methodology is to simply run them once. Over the year they have been running that way, he has gathered a list of which tests are flaky; he then fixed many of the flaky tests to make them more reliable. Both types of testing are valid and useful, Bacik said, but his approach has given him the "best bang for the buck" for his needs.

Ted Ts'o said that for finding bugs, he trusts fstests much more than "soaking in linux-next"; he only rarely finds bugs from linux-next testing. Fstests are "way more nitpicky", so a 15% failure rate on an fstest will correspond to a real bug, generally, but not one that users will encounter, even on the most stressful production workloads. That is part of the reason that he is not hugely concerned when there is an fstest failing sporadically. He would love to be able to fix all of the problems that he sees, but realistically, that is not possible; he does not have the time or the engineers on his team to do so. It is an "uncomfortable truth" that is difficult for new users of fstests to understand.

Meanwhile, Ts'o said that he has files with lists of which tests he excludes, along with the reasons they are excluded. They also describe why the tests fail, which can be due to fstests bugs or simply flaky ext4 behavior. He would like to figure out where those files can be checked in so that others can benefit from that information. They are in his GitHub repository, but perhaps adding them to the kernel documentation with a "freshness date" would make more sense.

James Bottomley asked about the 0day testing bot for testing the linux-next kernels. Ts'o said that the bot only runs fstests for a single configuration for ext4, which generally runs cleanly. That configuration, which uses 4KB block sizes, is also the most-used configuration for production systems, but there are a total of 12 configurations that he tests. Bottomley suggested that making more use of the 0day bot would be worthwhile, but Ts'o worried about flaky tests causing lots of spurious regression reports "because the 0day bot got unlucky". But it is worth looking into, he said.

Bacik said he would like to see the community "move towards this new reality" where it is easy to tell what tests are expected to fail. For example, he has no idea what tests are expected to fail for ext4, so when he makes a change that impacts ext4, he does not know if any failures are due to that change. Chamberlain has exclude files for kdevops, but it would be good to have a place where filesystem developers can obtain the exclude files and update them as needed. The tests listed in those files can also be useful as an exercise for onboarding new engineers, he said; they can be asked to track down why a test is failing.

When he gets some time to do some fstests development, Ts'o said, he is going to add a mode that will immediately rerun a failed test 25 or 100 times to establish a failure percentage. He does not want to auto-exclude a test that is failing, say, 15% of the time because he will stop caring about it at that point; ext4 developers need those tests to continue to run. But there are others who simply want to try to determine if they broke anything with their patch, so there needs to be a way to address both types of testing.

Omar Sandoval asked if it made sense to put these kinds of exclude lists directly into the fstests repository. Ts'o said that the lists are configuration specific; Chamberlain elaborated on that, noting that the lists depend on the kernel version, fstests configuration, and what type of underlying device is being used for the test. Tests on loopback filesystems can have different failure modes, for example. There was some discussion of the need to organize the lists based on all of the different factors. Agreeing on naming to describe the fstests configuration will be helpful to allow the associated exclude lists to be portable among various test runners.

Standardization

Chamberlain thought that the kernel configuration could be standardized for test environments, as kdevops has a single configuration that can be used for KVM locally or for a variety of cloud providers. Ts'o said that his test runner (kvm-xfstests/gce-xfstests) also had a standardized configuration, so he and Chamberlain should compare notes. But there are still options that Ts'o needs when building kernels, for example enabling KASAN or kernel modules, which are needed when also running blktests. It is possible that the tool building Kconfig files should be standardized between the test runners, he said.

Bacik said that he would like to see the filesystem developers get out of the business of running nightly and continuous tests. He currently has four systems at home that he uses to do that but would like to retire them and have the Linux Foundation or someone take over doing that job. Ts'o thought that even having a centralized dashboard to report the test results from various developers on their physical or cloud-based systems would be a step in the right direction; having one place to see the current state of fstests would be helpful.

There is a problem with follow-through on these types of efforts, one attendee said. This topic comes up frequently at LSFMM, and ideas for solutions are discussed, but nothing really comes of it all. Christian Brauner agreed that follow-through has been lacking over the years. But Chamberlain said that kdevops came out of discussions at previous LSFMMs; he has spent a lot of time getting that working and it is available for use now. The only reason he did not use Ts'o's test runner is because he wanted to target multiple clouds; gce-xfstests only targets the Google Cloud. Kdevops already has support for exclude lists, Chamberlain said, which he updates regularly. "Kernel configuration that works on all cloud solutions, what else do we need?"

Bacik agreed, noting that he had tried Ts'o's solution, but that it had not worked for Btrfs at the time, whereas kdevops does. He adopted it because he cannot sustain Btrfs development without community support for testing infrastructure; he had been trying to do test wrangling and Btrfs development without success. He would like to see more people coalesce around that solution, work out the bugs and kinks, then turn it over to someone else to either run it or to fund it running in the cloud.

Chris Mason said that the Linux Foundation is not really set up to fund these kinds of efforts directly. Instead, they channel money from interested companies into projects of that sort. He said that it should just be a matter of getting the companies who are funding filesystem development to sign up. The funding is the easy part, he said; the hard part is to get everyone on the same set of tools.

Ts'o said that there are actually two hard parts; the other is that there is a lack of engineering time to analyze the failures that occur. He would happily give Google Cloud credits to people who would run gce-xfstests if they would commit to spending time analyzing the failures that they find. There is perhaps a need to gather requirements, Ts'o said, because he has looked at kdevops and it does not address some critical requirements for his filesystem-development workflow. For example, kvm-xfstests can pick up a local kernel he just built on his laptop, toss it into a virtual machine, and run fstests on it, but kdevops targets the QA use case, so that kind of test is not supported. He thinks that unifying on things like kernel configuration and exclude-list handling would be a good place to start.

Ts'o said that it may turn out that there are different tools for different use cases; the local kernel testing case is important for him. Bacik agreed that local kernel testing is critical, but thought it was possible that kdevops could add that capability. Chamberlain said that it should not be difficult to do so.

Bacik said that he wanted to get his nightly testing systems switched over to using kdevops before the Linux Plumbers Conference in Dublin in September, which is when many of the same people will be together in one room again. It was generally agreed that collaborating on requirements, kernel-configuration handling, fstests exclude-list handling, and things of that nature would continue, perhaps as threads on the fstests mailing list.

Comments (none posted)

ioctl() forever?

By Jake Edge
June 8, 2022

LSFMM

In a combined storage and filesystem session at the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM), Luis Chamberlain and James Bottomley led a discussion about the use of ioctl() as a mechanism for configuration. There are plenty of downsides to the use of ioctl() commands, and alternatives exist, but in general kernel developers have chosen to continue using this multiplexing system call. While there is interest in changing things, at least in some quarters, the discussion did not seem to indicate major changes on the horizon.

Problems

Chamberlain began with a history lesson with "some rants" thrown in. ioctl() is still used a lot by filesystems and the block layer in Linux, but the wireless-networking subsystem, which he used to work on, successfully shifted away from ioctl(). The system call "wasn't really originally designed for what we think it was", he said; it is "essentially a hack". In Douglas McIlroy's history of Unix, it was called "a closet full of skeletons" that was mainly used to prevent the addition of too many new system calls. Those things should be kept in mind when thinking about ioctl(), he said.

[James Bottomley]

The first version of Linux did not have ioctl(); it was added in Linux-0.96a in May 1992. A small patch in 1993 changed a type from unsigned int to unsigned long, which eventually led to compatibility headaches for 32-bit ioctl() calls issued on 64-bit systems. The Unix idea that everything is a file is useful because it allows for flexibility, but it also allows for lazy API design, he said. Beyond that, ioctl() commands are not well documented and the interface does not allow for introspection

Lack of introspection abilities is not a problem that Chamberlain has encountered directly, so he asked Bottomley to elaborate on that. The problem crops up in the container world, Bottomley said, and it is not just for ioctl() commands but also introspection for system calls. For example, securing a Docker container by limiting the system calls it can make does not really secure anything if there are opaque ioctl() commands that can be used to circumvent the restrictions. So there is a lot of concern about non-introspective interfaces because they "can't be policed properly by the tools we usually use for containers, like seccomp and even eBPF".

The specific problem with ioctl() is that there is a "dense binary packet" that gets passed into the call, which makes it difficult for external tools to deduce what the packet contains. In theory, the kernel could switch to using XML or JSON, but that does not really change the underlying problem much, he said. The introspection problem remains "almost regardless of which interface we choose".

There are other problems with ioctl(), Chamberlain said. For example, he asked Arnd Bergmann about ioctl() support for different architectures. He got back an itemized list of caveats. "The world is not peachy for architecture support as well".

Greener grass

The Linux wireless-networking configuration underwent a shift from the ioctl()-based wireless extensions to the netlink-based nl80211 interface. Chamberlain invited attendees to compare include/uapi/linux/wireless.h with include/uapi/linux/nl80211.h to see how much cleaner the new interface is. The netlink interface is not designed to be generic, so it may not be the right choice for filesystems and the block layer. But he is sure that that it is possible to find something better than the ioctl()-based interface we have now.

Chamberlain handed the microphone back to Bottomley so that he could talk about configfd as a possibility. But Bottomley said that he was not going to promote configfd, though he did describe it a bit. It came out of his efforts on the shiftfs filesystem, which was eventually supplanted by ID-mapped mounts. Configfd was based on the fsconfig() system call, which allows setting a bunch of configuration information on a filesystem atomically, but configfd was bidirectional. David Howells, who developed fsconfig() and the related new mounting API, interjected that fsconfig() was originally bidirectional as well, though Al Viro removed that piece before it was merged.

Instead of defending configfd, Bottomley said, he wanted to talk about the necessity of ioctl(). When there is a need for "an exception to the normal semantic order of things", an ioctl() command can provide it. And there will always be a need for exceptions, no matter how tightly regulated that semantic order is. There will always be a requirement that two parties be able to communicate data that cannot be structured using the existing mechanisms—an exception. Whether that data is sent as JSON, XML, or binary data, it is, effectively, an ioctl().

The introspection problem is real, but is one that he thinks could be handled with documentation. Christian Brauner said that the problem goes beyond just ioctl(); there are a number of different problems with seccomp() filtering because of the need to inspect the system call arguments to help make filtering decisions. Pointer arguments have been discouraged for new system calls because seccomp() cannot follow the pointers. But using pointers to structures is a technique for creating extensible system calls, so seccomp() also needs to change. It is a problem "slightly to the side" of the ioctl() problem, but it needs to be solved as well, he said.

Bottomley said that this shows that even if it were decreed that ioctl() commands should all move to new system calls, the problem with introspection would just move with it. Ted Ts'o said that kernel developers rightly keep a tight grip on new system calls and their interfaces. So adding a new system call involves an enormous bikeshedding exercise with lots of additional requirements, including documentation and working with features like seccomp() filtering. Often, the feature developer does not care about the container use case, even if they should, so they move it to an ioctl() command "so they can dodge the bikeshedding".

The more perfect the kernel community tries to make the system call interface, the more incentive there is for developers to route around it, Ts'o said. He has heard people talk about adding a feature via a filesystem-specific ioctl() command as a way to avoid the "fsdevel bikeshed party". That is unfortunate, since there is plenty of useful architectural review that might come with trying to make the feature more widely usable, but it is understandable that people take the expedient approach. No one has infinite resources, Ts'o said.

Alternatives

Josef Bacik asked what the alternative is. "You're going to pry ioctl()s from my cold dead hands unless you give me something else." The Btrfs developers have "wasted a lot of time" in grand architectural discussions that ended up with the community saying that a feature should just be put into an ioctl() command—after a year of discussion. Bottomley said that he would argue ioctl() commands, used judiciously, are just fine.

Kent Overstreet said that ioctl() commands are simply a driver-specific system call; there is a real need for that. It provides a mechanism to try out a feature in a more private way before it gets promoted to a system call, where it becomes permanent. Amir Goldstein agreed, noting that the "chattr" ioctl() command was implemented by two different filesystems before it was determined to be a generally useful feature and moved into the virtual filesystem (VFS) layer.

There are multiple existing mechanisms for configuration in the kernel, Ts'o said. ioctl() commands are just system calls in disguise, both of which provide ways to do configuration, but procfs and sysfs files can also be used for that. Beyond those, the new mount API or configfd provide other configuration mechanisms. But which gets used depends in part on how much pain there is in trying to change the mechanism for a new task, he said. If the pain of adding ioctl() commands rises to the same level as for system calls, developers will simply find a "different escape hatch".

But Chamberlain said that adding new wireless commands did not require additional system calls or ioctl() commands because it uses netlink. Those changes can be made in a domain-specific place without all of the problems that come from the other mechanisms. Brauner said that he had a hard time seeing what could replace the ioctl() interface, however; he wondered if Chamberlain was suggesting switching to something netlink-based. Chamberlain said that it was just one idea, but Howells noted that netlink could not be used because it depends on networking being configured into the kernel, which is not always the case.

There was some discussion of alternatives, but it is clear that ioctl() itself is not going away and that it fills a need. Finding ways to make the ioctl() arguments more introspectable would be useful, as would better documentation. But if requiring those things causes the friction level for adding new commands to rise too much, it will have the opposite of its intended effect. No real solution seemed to be forthcoming from the discussion, though no one seems entirely satisfied with the status quo either.

Comments (49 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

  • Briefs: Linux 5.19-rc1; Pipe speed; Fedora 34 EOL; NixOS 22.05; openSUSE Lead 15.4; Tails 5.1; Mozilla translation plugin; ...
  • Announcements: Newsletters, conferences, security updates, patches, and more.
Next page: Brief items>>

Copyright © 2022, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds