LWN.net Weekly Edition for January 21, 2021

Welcome to the LWN.net Weekly Edition for January 21, 2021

This edition contains the following feature content:

Installing Debian on modern hardware: why does Debian make it so hard to install on systems needing proprietary firmware?
MAINTAINERS truth and fiction: how well does the kernel's MAINTAINERS file correspond to what actually happens?
Fast commits for ext4: speeding some ext4 filesystem operations with smarter journaling.
Resource limits in user namespaces: a proposal to move some global resource counters into user namespaces.
An introduction to SciPy: a collection of Python modules for scientific computing.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Installing Debian on modern hardware

By Jake Edge
January 20, 2021

It is an unfortunate fact of life that non-free firmware blobs are required to use some hardware, such as network devices (WiFi in particular), audio peripherals, and video cards. Beyond that, those blobs may even be required in order to install a Linux distribution, so an installation over the network may need to get non-free firmware directly from the installation media. That, as might be guessed, is a bit of a problem for distributions that are not willing to officially ship said firmware because of its non-free status, as a recent discussion in the Debian community shows.

Surely Dan Pal did not expect the torrent of responses he received to his short note to the debian-devel mailing list about problems he encountered trying to install Debian. He wanted to install the distribution on a laptop that was running Windows 10, but could not use the normal network installation mechanism because the WiFi device required non-free firmware. He tracked down the DVD version of the distribution and installed that, but worried that Debian is shooting itself in the foot by not prominently offering more installation options: "The current policy of hiding other versions of Debian is limiting the adoption of your OS by people like me who are interested in moving from Windows 10."

The front page at debian.org currently has a prominent "Download" button that starts to retrieve a network install ("netinst") CD image when clicked. But that image will not be terribly useful for systems that need non-free firmware to make the network adapter work. Worse yet, it is "impossible to find" a working netinst image with non-free firmware, Sven Joachim said, though he was overstating things a bit. Alexis Murzeau suggested adding a link under the big download button that would lead users to alternate images containing non-free firmware. He also pointed out that there are two open bugs (one from 2010 and another from 2016) that are related, so the problem is hardly a new one.

While they are hard to find, there are unofficial images with non-free firmware for Debian, as Holger Levsen noted; he also pointed to his 2017 blog post that he uses to rediscover those images when he needs them. It is a rather strange situation; Emanuele Rocca put it this way:

So the current situation is that we make an active effort to produce two different types of installation media: one that works for all users, and one broken for most laptops. Some sort of FOSS version of an anti-feature. Then we publish the broken version on the front page, and hide very carefully the version that works.

This absurdly damages our users without improving the state of Free Software in any way, while Ubuntu puts the firmware back into the images and can rightly claim to be easier to install.

But Jeremy Stanley took exception to that characterization:

The one you say "works for all users" doesn't "work" for me because it contains proprietary closed-source software I don't want. This boils down to a debate over whether the Debian community values convenience over ideals. It can be argued that for users who value convenience more, Ubuntu already exists. Why compete with that and compromise Debian's ideals at the same time?

That is, of course, the crux of the matter. Debian has a set of ideals about the kinds of software it distributes, enshrined in the Debian Free Software Guidelines (DFSG); non-free licenses do not fit within those ideals. In addition, the Debian Social Contract (which contains the DFSG) specifically notes that "non-free works are not a part of Debian". But the problem at hand is that potential users may not even be able to install Debian (or use it once installed) if they cannot access the network; it is hard for some to see how that advances the cause of free software, which is also a part of the contract.

In response to Stanley, Russ Allbery pointed out that there is a middle ground. No one had suggested removing the official images that do not have the non-free firmware, but there are some interested in making it easier to find the images needed for much of today's hardware.

The point is to make things easier for our users. Right now, we're doing that for you but not for the users who don't care whether firmware is non-free. I think the idea is that we should consider making things easier for both groups of users. There's no reason to make things worse for you and others who want the fully free installer in the process.

The official installer does offer the option of installing non-free firmware from a USB drive, "but very few people use it", Andrew M.A. Cater said. Allbery described the process he goes through to try to use that mechanism; it is far from straightforward even for someone quite familiar with Debian:

I have always managed to get it to work, but usually it's an hour of cursing and Googling things and making different USB sticks in different formats with different file systems and retrying parts of the installation until I hit on the magic combination of factors to make it work.

One can only imagine that new users who encounter this wall are unlikely to continue to down the Debian path. Allbery said that an installer with non-free firmware would work much better for him, but he wasn't able to find the specific one he needed (for the "testing" version of the distribution). Andrey Rahmatullin said that the inability to find these images is caused by a "failing of the Debian websites"; there should be an easier path to find the alternate installation images. Russell Stuart said that he always runs into the same problem that Allbery reported and that, even though Stuart is a strong proponent of the separation of free and non-free in Debian, firmware is a different beast:

[...] these firmware blobs are peculiar. They don't run on the same CPU, we talk to them with strictly open API's and protocols. In that way, they aren't anything new. We already talk to and depend on megabytes of nonfree software to get our laptop's booted, but we tolerate it because it lives in ROM. We don't consider firmware in ROM to be part of Debian even though it must be running for our Debian machines to run.

After Paul Wise pointed out that there actually are unofficial images with non-free firmware for the testing distribution, Holger Wansing suggested some changes (and as a patch) for the web site to make it easier for users to find these images when needed. As Marc Haber said, though, the installation experience is likely the first impression a potential new user will get; "we should not be trying THIS hard to be a failure in this very important part of the relationship our product is building with the user". But pointing users at the unofficial images is different from Debian officially distributing this non-free firmware, as Steve McIntyre pointed out:

There's a major difference here - do we want Debian's *official* media to include non-free stuff? We've had this discussion a few times, including in person back at DC15 [DebConf 15] at least. Back then, the overwhelming response was *no*. We can change that, but it's not something to do lightly.

Haber feels strongly that being purists about firmware is only leading to fewer new users. Wise agreed in part: "the current situation wrt hardware and software freedom is pretty catastrophic". He suggested making things clearer for users and potential users, perhaps by way of an "installer launcher app". That app would analyze the needs of the existing hardware to help guide (and presumably educate) users in their installer choice.

While there were lots of ideas of how to make things better, this problem has existed for a long time in Debian. Marco d'Itri said that he had raised the issue back in 2004, but it likely goes much further back than that. Ansgar Burchardt said that in 2016 he had proposed creating a new section in the repository to hold the non-free firmware (separate from the rest of the non-free software), which might be a preliminary step. But consensus was not reached and that effort died on the vine. As with the open bugs, these accounts show that the distribution has been struggling with this issue for quite some time.

At this point, it is not at all clear what will happen. The discussion may just fade away, only to be picked up again down the road. The problem is real and making the situation better, at least, does not seem all that difficult, nor particularly harmful to Debian's overall goals. But that has been true all along and here we are. It would seem that there has simply not been enough "push" to make progress, but with any luck, this time around things will be different.

Comments (94 posted)

MAINTAINERS truth and fiction

By Jonathan Corbet
January 14, 2021

Since the release of the 5.5 kernel in January 2020, there have been almost 87,000 patches from just short of 4,600 developers merged into the mainline repository. Reviewing all of those patches would be a tall order for even the most prolific of kernel developers, so decisions on patch acceptance are delegated to a long list of subsystem maintainers, each of whom takes partial or full responsibility for a specific portion of the kernel. These maintainers are documented in a file called, surprisingly, MAINTAINERS. But the MAINTAINERS file, too, must be maintained; how well does it reflect reality?

The MAINTAINERS file doesn't exist just to give credit to maintainers; developers make use of it to know where to send patches. The get_maintainer.pl script automates this process by looking at the files modified by a patch and generating a list of email addresses to send it to. Given that misinformation in this file can send patches astray, one would expect it to be kept up-to-date. Recently, your editor received a suggestion from Jakub Kicinski that there may be insights to be gleaned from comparing MAINTAINERS entries against activity in the real world. A bit of Python bashing later, a new analysis script was born.

Digging into `MAINTAINERS`

There are, it turns out, 2,280 "subsystems" listed in the MAINTAINERS file. Each of those subsystems includes a list of covered files and directories. One can look at the commits applied against those files to see who has been working in any given subsystem; writing patches obviously qualifies as this sort of work, but so do other activities like handling patches (as indicated by Signed-off-by tags) or reviewing them (Reviewed-by or Acked-by). By making use of a bit of CPU time diverted from cryptocurrency mining, it is possible to come up with an approximation of when a given subsystem's listed maintainers last actually did some work in that subsystem.

The full results of this analysis are available for those wanting to see the details.

There are, however, ways of narrowing down the data a bit to pick out some of the more interesting artifacts in this file. For example, there are 367 subsystems for which there is no maintainer or the maintainer has never been seen in the entire Git history (excluding "subsystems" with no files — see below). In many of these cases, the subsystem itself is well past the prime of its life; there simply isn't a lot of work for a 3c59x network-card maintainer to do these days. The networking developers are not buried in ATM patches, the Palm Treo hasn't seen much support work, Apple has released few M68k systems recently, there aren't many Arm floppy drives still in use, and S3 Savage video cards just aren't the must-have device they once were. Many of these entries are likely to point to code that could be removed altogether.

Similar lessons can be drawn from the list of subsystems with no listed maintainers at all. Of course, some of those are rather vague in other ways as well; one subsystem is simply called "ABI/API" and points to the linux-api mailing list. There is actually one file associated with this "subsystem"; it's kernel/sys_ni.c, which handles calls to non-implemented system calls. This entry is thus an attempt to get developers to copy the linux-api list when they add new system calls. A similar entry exists for "Arm subarchitectures".

Some maintainerless subsystems, such as the framebuffer layer, could probably benefit from somebody willing to take them over. The reiserfs filesystem lacks a maintainer but still seems to have some users. Others, like DECnet or the Matrox framebuffer, are probably best left alone (or removed) at this point.

Some "subsystems" listed in the MAINTAINERS file have no files to maintain; one interesting example is "embedded Linux", allegedly maintained by Paul Gortmaker, Matt Mackall, and David Woodhouse. Given the success of embedded Linux, one can only assume that they are doing an outstanding job. The "device number registry" claims to be maintained, but the entry contains only a pointer to a nonexistent web page. The URLs in the "disk geometry and partition handling" entry still work, but the pages do not appear to have been updated for well over a decade; not much is happening with Zip drive geometry these days, it would appear. The man pages, instead, are actively maintained, but they do not exist within the kernel tree.

Help needed

There are a couple of conclusions that can be drawn from the results so far. One is that many kernel subsystems are not really in need of maintenance at this point; some of them, instead, may be in need of removal. Another is that perhaps the MAINTAINERS file itself is in need of a bit of cleanup in spots. But it is also worth asking whether this data can be used to spot subsystems that could benefit from a new maintainer. To answer that question, some additional CPU time was expended to find all subsystems meeting these criteria:

There is either no listed maintainer or the alleged maintainers have been inactive in that subsystem for at least six months.
At least 50 commits have touched that subsystem since the release of the 5.5 kernel in January 2020.

The idea behind this search was to find subsystems that are still undergoing some sort of active development, but which do not have an active, listed maintainer. The results can be divided into a few different categories.

Some MAINTAINERS entries have broad lists of covered files that make the commit count seem larger than it really is. For example, the subsystem named "ASYNCHRONOUS TRANSFERS/TRANSFORMS (IOAT) API" includes all of drivers/dma, which is also claimed by "DMA GENERIC OFFLOAD ENGINE SUBSYSTEM". That subsystem, in turn, is actively maintained by Vinod Koul. There are two subsystems that fall into this category; in the tables below "Activity" indicates the last observed activity by the listed maintainers (if any), while "Commits" shows the number of commits affecting the subsystem since 5.5:

Subsystem Activity Commits

ASYNCHRONOUS TRANSFERS/TRANSFORMS (IOAT) API —— 536

HISILICON NETWORK SUBSYSTEM DRIVER 2019-11-16 258

Subsystem	Activity	Commits
ASYNCHRONOUS TRANSFERS/TRANSFORMS (IOAT) API	——	536
HISILICON NETWORK SUBSYSTEM DRIVER	2019-11-16	258

These subsystems either do not exist as a separate entity, or they should have their lists of covered files reduced to match reality.

Then, there are the subsystems where the maintainers hide behind a corporate email alias. The listed maintainer for "DIALOG SEMICONDUCTOR DRIVERS" is support.opensource@diasemi.com, which is obviously not an address that will appear in any actual commits. A look within that subsystem shows active reviews from diasemi.com addresses, though, so the subsystem cannot really be said to be unmaintained. This category contains:

Subsystem Activity Commits

DIALOG SEMICONDUCTOR DRIVERS —— 120

QUALCOMM ATHEROS ATH9K WIRELESS DRIVER —— 65

WOLFSON MICROELECTRONICS DRIVERS —— 146

Subsystem	Activity	Commits
DIALOG SEMICONDUCTOR DRIVERS	——	120
QUALCOMM ATHEROS ATH9K WIRELESS DRIVER	——	65
WOLFSON MICROELECTRONICS DRIVERS	——	146

Related to the above are subsystems where the maintainer entry is simply out of date; the listed maintainer is inactive, but somebody else, often from the same company, has picked up the slack and is acting as a de-facto maintainer. These include:

Subsystem Activity Commits

HISILICON NETWORK SUBSYSTEM 3 DRIVER (HNS3) 2019-11-16 234

HISILICON SECURITY ENGINE V2 DRIVER (SEC2) 2020-06-18 55

LINUX FOR POWER MACINTOSH 2018-10-19 71

MELLANOX ETHERNET INNOVA DRIVERS —— 93

MELLANOX MLX4 IB driver —— 70

OMAP HWMOD DATA 2016-06-10 102

QCOM AUDIO (ASoC) DRIVERS 2018-05-21 125

TEGRA I2C DRIVER 2018-05-30 56

Subsystem	Activity	Commits
HISILICON NETWORK SUBSYSTEM 3 DRIVER (HNS3)	2019-11-16	234
HISILICON SECURITY ENGINE V2 DRIVER (SEC2)	2020-06-18	55
LINUX FOR POWER MACINTOSH	2018-10-19	71
MELLANOX ETHERNET INNOVA DRIVERS	——	93
MELLANOX MLX4 IB driver	——	70
OMAP HWMOD DATA	2016-06-10	102
QCOM AUDIO (ASoC) DRIVERS	2018-05-21	125
TEGRA I2C DRIVER	2018-05-30	56

Finally, there are the subsystems that truly seem to lack a maintainer; they typically show patterns of commits either merged by a variety of subsystem maintainers, or passing through one of a few maintainers of last resort. They are:

Subsystem Activity Commits

ARM/UNIPHIER ARCHITECTURE —— 73

DRBD DRIVER 2018-12-20 51

FRAMEBUFFER LAYER —— 402

HMM - Heterogeneous Memory Management 2020-05-19 54

I2C SUBSYSTEM HOST DRIVERS —— 434

MARVELL MVNETA ETHERNET DRIVER 2018-11-23 65

MEDIA DRIVERS FOR RENESAS - VIN 2019-10-10 56

MUSB MULTIPOINT HIGH SPEED DUAL-ROLE CONTROLLER 2020-06-24 54

NFC SUBSYSTEM —— 72

PROC FILESYSTEM —— 171

PROC SYSCTL 2020-06-08 51

QLOGIC QLGE 10Gb ETHERNET DRIVER 2019-10-04 77

STAGING - REALTEK RTL8188EU DRIVERS 2020-07-15 121

STMMAC ETHERNET DRIVER 2020-05-01 174

UNIVERSAL FLASH STORAGE HOST CONTROLLER DRIVER —— 277

USB NETWORKING DRIVERS —— 119

X86 PLATFORM DRIVERS - ARCH —— 120

Subsystem	Activity	Commits
ARM/UNIPHIER ARCHITECTURE	——	73
DRBD DRIVER	2018-12-20	51
FRAMEBUFFER LAYER	——	402
HMM - Heterogeneous Memory Management	2020-05-19	54
I2C SUBSYSTEM HOST DRIVERS	——	434
MARVELL MVNETA ETHERNET DRIVER	2018-11-23	65
MEDIA DRIVERS FOR RENESAS - VIN	2019-10-10	56
MUSB MULTIPOINT HIGH SPEED DUAL-ROLE CONTROLLER	2020-06-24	54
NFC SUBSYSTEM	——	72
PROC FILESYSTEM	——	171
PROC SYSCTL	2020-06-08	51
QLOGIC QLGE 10Gb ETHERNET DRIVER	2019-10-04	77
STAGING - REALTEK RTL8188EU DRIVERS	2020-07-15	121
STMMAC ETHERNET DRIVER	2020-05-01	174
UNIVERSAL FLASH STORAGE HOST CONTROLLER DRIVER	——	277
USB NETWORKING DRIVERS	——	119
X86 PLATFORM DRIVERS - ARCH	——	120

Most of the above will be unsurprising to people who have been paying attention to the areas in question. The framebuffer subsystem is a known problem area; the "soft scrollback" capability was recently removed from the framebuffer driver due to a lack of maintainership. Quite a few people depend on this code still, but it is increasingly difficult to integrate with the kernel's graphics drivers and few people have any appetite to delve into it.

The I2C host drivers do, in fact, have a de-facto maintainer; it's Wolfram Sang, who also maintains the core I2C subsystem. He has long wished for help maintaining those drivers but none seems to be forthcoming, so he takes care of them in the time that is available. /proc is an interesting example; everybody depends on it, but nobody has taken responsibility for its maintenance. HMM, too, is interesting; its creator went to a lot of effort to get the code merged, but appears to have moved on to other pursuits now.

All of the above look like places where aspiring kernel developers could lend a welcome hand.

What about subsystems that have no entry in the MAINTAINERS file at all? If one were to bash out a quick script to find all files in the kernel tree that are not covered by at least one line in MAINTAINERS, one would end up with a list of just over 2,800 files. These include the MAINTAINERS file itself, naturally. Of the rest, the vast majority are header files under include/, most of which probably do have maintainers and should be added to the appropriate entries. Discouragingly, there are 72 files under kernel/ without a listed maintainer — a situation which certainly does not reflect reality. The SYSV IPC code is unmaintained, reflecting its generally unloved nature. Most of the rest of the unmaintained files are under tools/ or samples/.

A harder case to find is that of files that are covered by a MAINTAINERS entry, but which are not actually maintained by the named person; this will happen often with entries that cover entire directory trees. Your editor is listed as handling all of Documentation, but certainly cannot be said to be "maintaining" many of those files, for example; this is a situation that will arise in many places in the kernel tree.

If one were to try to draw some overall conclusions from this data, they might read something like the following. The MAINTAINERS file definitely has some dark corners that could, themselves, use some maintenance (some of which is already being done). There are some parts of the kernel lacking maintainers that could definitely use one, and other parts that have aged beyond the point of needing maintenance. For the most part, though, the subsystems in the kernel have designated maintainers, and most of them are at least trying to take care of the code they have responsibility for. The situation could be a lot worse.

[As usual, the script used to generate the above tables can be found in the gitdm repository at git://git.lwn.net/gitdm.git.]

Comments (10 posted)

Fast commits for ext4

January 15, 2021

This article was contributed by Marta Rybczyńska

The Linux 5.10 release included a change that is expected to significantly increase the performance of the ext4 filesystem; it goes by the name "fast commits" and introduces a new, lighter-weight journaling method. Let us look into how the feature works, who can benefit from it, and when its use may be appropriate.

Ext4 is a journaling filesystem, designed to ensure that filesystem structures appear consistent on disk at all times. A single filesystem operation (from the user's point of view) may require multiple changes in the filesystem, which will only be coherent after all of those changes are present on the disk. If a power failure or a system crash happens in the middle of those operations, corruption of the data and filesystem structure (including unrelated files) is possible. Journaling prevents corruption by maintaining a log of transactions in a separate journal on disk. In case of a power failure, the recovery procedure can replay the journal and restore the filesystem to a consistent state.

The ext4 journal includes the metadata changes associated with an operation, but not necessarily the related data changes. Mount options can be used to select one of three journaling modes, as described in the ext4 kernel documentation. data=ordered, the default, causes ext4 to write all data before committing the associated metadata to the journal. It does not put the data itself into the journal. The data=journal option, instead, causes all data to be written to the journal before it is put into the main filesystem; as a side effect, it disables delayed allocation and direct-I/O support. Finally, data=writeback relaxes the constraints, allowing data to be written to the filesystem after the metadata has been committed to the journal.

Another important ext4 feature is delayed allocation, where the filesystem defers the allocation of blocks on disk for data written by applications until that data is actually written to disk. The idea is to wait until the application finishes its operations on the file, then allocate the actual number of data blocks needed on the disk at once. This optimization limits unneeded operations related to short-lived, small files, batches large writes, and helps ensure that data space is allocated contiguously. On the other hand, the writing of data to disk might be delayed (with the default settings) by a minute or so. In the default data=ordered mode, where the journal entry is written only after flushing all pending data, delayed allocation might thus add more delay between the actual change and writing of the journal. To assure data is actually written to disk, applications use the fsync() or fdatasync() system calls, causing the data (and the journal) to be written immediately.

Ext4 journal optimization

One might assume that, in such a situation, there are a number of optimizations that could be made in the commit path; that assumption turns out to be correct. In this USENIX'17 paper [PDF], Daejun Park and Dongkun Shin showed that the current ext4 journaling scheme can introduce significant latencies because fsync() causes a lot of unrelated I/O. They proposed a faster scheme, taking into account the fact that some of the metadata written to the journal could instead be derived from changes to the inode being written, and it is possible to commit transactions related to the requested file descriptor only. Their optimization works in the data=ordered mode.

The fast-commit changes, implemented by Harshad Shirwadkar, are based on the work of Park and Shin. This work implements an additional journal for fast commits, but simplifies the commit path. There are now two journals in the filesystem: the fast-commit journal for operations that can be optimized, and the regular journal for "standard commits" whose handling is unchanged. The fast-commit journal contains operations executed since the last standard commit.

Ext4 uses a generic journaling layer called "Journaling Block Device 2" (JBD2), with the exact on-disk format documented in the ext4 wiki. JBD2 operates on blocks, so when it commits a transaction, this transaction includes all changed blocks. One logical change may affect multiple blocks, for example the inode table and the block bitmap.

The fast-commit journal, on the other hand, contains changes at the file level, resulting in a more compact format. Information that can be recreated is left out, as described in the patch posting:

For example, if a new extent is added to an inode, then corresponding updates to the inode table, the block bitmap, the group descriptor and the superblock can be derived based on just the extent information and the corresponding inode information.

During recovery from this journal, the filesystem must recalculate all changed blocks from the inode changes, and modify all affected data structures on the disk. This requires specific code paths for each file operation, and not all of them are implemented right now. The fast-commits feature currently supports unlinking and linking a directory entry, creating an inode and a directory entry, adding blocks to and removing blocks from an inode, and recording an inode that should be replayed.

Fast commits are an addition to — not a replacement of — the standard commit path; the two work together. If fast commits cannot handle an operation, the filesystem falls back to the standard commit path. This happens, for example, for changes to extended attributes. During recovery, JBD2 first performs replay of the standard transactions, then lets the filesystem recover fast commits.

`fsync()` side effects

The fast-commit optimization is designed to work with applications using fsync() frequently to ensure data integrity. When we look at the fsync() and fdatasync() man pages, we see that those system calls only guarantee to write data linked to the given file descriptor. With ext4, as a side effect of the filesystem structure, all pending data and metadata for all file descriptors will be flushed instead. This creates a lot of I/O traffic that is unneeded to satisfy any given fsync() or fdatasync() call.

This side effect leads to a difference between the paper and the implementation: a fast commit may still include changes affecting other files. In a review, Jan Kara asked why unrelated changes are committed. Shirwadkar replied that, in an earlier version of the patch, he did indeed write only the file in question. However, this change broke some existing tests that depend on fsync() working as a global barrier, so he backed it out.

Ted Ts'o commented that the current version of the patch set keeps the existing behavior, but he can see workloads where "not requiring entanglement of unrelated file writes via fsync(2) could be a huge performance win." He added that a future solution could be a new system call taking an array of file descriptors to synchronize together. For now, application developers should base their code on the POSIX definition, and not rely on that specific fsync() side effect, as it might change in the future.

Using fast commits

Fast commits are activated at filesystem creation time, so users will have to recreate their filesystems to use this feature. In addition, the required support in e2fsprogs has not yet been added to the main branch, but is still in development. So interested users will need to compile the tool on their own, or wait until the feature is supported by their distribution. When enabled, information on fast commits shows up in a new /proc/fs/ext4/dev/fc_info file.

On the development side, there are numerous features to be added to fast commits. These include making the operations more fine-grained and supporting more cases that fall back to standard commits today. Shirwadkar is also working on fast commits with byte-granularity (instead of the current block-granularity) support for direct-access (DAX) mode, to be used on persistent memory devices.

The benchmark results given by Shirwadkar in the posted patch set show 20-200% performance improvements with filesystem benchmarks for local filesystems, and 30-75% improvement for NFS workloads. We can assume that the performance gain will be more important in applications doing many fsync() operations than in those doing only a few. Either way, though, the fast-commits feature should lead to better ext4 filesystem performance going forward.

Comments (59 posted)

Resource limits in user namespaces

By Jonathan Corbet
January 18, 2021

User namespaces provide a number of interesting challenges for the kernel. They give a user the illusion of owning the system, but must still operate within the restrictions that apply outside of the namespace. Resource limits represent one type of restriction that, it seems, is proving too restrictive for some users. This patch set from Alexey Gladkov attempts to address the problem by way of a not-entirely-obvious approach.

Consider the following use case, as stated in the patch series. Some user wants to run a service that is known not to fork within a container. As a way of constraining that service, the user sets the resource limit for the number of processes to one, explicitly preventing the process from forking. That limit is global, though, so if this user tries to run two containers with that service, the second one will exceed the limit and fail to start. As a result, our user becomes depressed and considers a career change to goat farming.

Clearly, what is needed is a way to make at least some resource limits apply on per-container basis; then each container could run its service with the process limit set to one and everybody will be happy (except perhaps the goats). One could readily imagine a couple of ways to do this:

Turn the resource limits that apply globally (many are per-process now) into limits that can also be set within a user namespace. The global limit would still apply, but lower limits could be set within a namespace to get the desired effect.
Create a new control-group controller to manage resource limits in a hierarchical manner. This kind of control, after all, is just what control groups were created for.

Gladkov's patch set, though, takes neither of those approaches. Instead, this patch set moves a number of global resource-usage counters (processes, pending signals, pages locked in memory, bytes in message queues) into the ucounts structure associated with user namespaces. That makes the tracking of the use of these resources specific to each namespace.

User namespaces are arranged hierarchically up to the "initial namespace" at the root, and there is a ucounts structure allocated for each. The resource-usage counts are managed all the way up the hierarchy. So, if a process creates a new process within a user namespace, the process count in that namespace will be incremented, but so will the counts in any higher-level namespaces. The resource limit (which is still global) is checked at every level in the hierarchy; exceeding the limit at any level is cause to block an operation.

If one is slow and undercaffeinated like your editor, one might wonder how this is supposed to work. Yes, each user namespace will now have its own count for resources like processes. If the global limit is set to one, each user namespace can contain one process without exceeding the limit at that level. But the counts propagate upward; if both namespaces have a common parent, then the limit will be exceeded at that level and our user is left no happier than before.

A look at the test code provided with the patch set gives an answer. In the test program, the "server" processes are created by root before changing user and group IDs and moving into a separate user namespace. The parent namespace thus belongs to root and is not subject to any resource limits set after the user-ID change. It all works as long as one's use case matches this pattern.

Still, one might wonder why the other approaches weren't taken. Having the limits propagate downward (rather than counts propagating upward) would seem to address this problem as well in a more flexible way that doesn't require root privileges. In fact, Linus Torvalds asked why this approach wasn't taken in response to a previous version of the patch set. Eric Biederman answered that the limit approach "needs to work as well", but then reiterated the use case without really clarifying why the count-based approach is needed.

Using control groups for this purpose was discussed back in 2015. At that time, control-group maintainer Tejun Heo rejected the idea, calling it "pretty silly". He continued:

In general, I'm pretty strongly against adding controllers for things which aren't fundamental resources in the system. What's next? Open files? Pipe buffer? Number of flocks? Number of session leaders or program groups?

If you want to prevent a certain class of jobs from exhausting a given resource, protecting that resource is the obvious thing to do.

That particular conversation went fairly badly downhill from there, but this specific outcome has stood over time: control-group controllers are not used for control of resource limits within containers.

For users who are facing this problem now, the only apparent solution is Gladkov's patch set. Whether these patches are merged will, however, depend on whether the rest of the kernel community thinks that this approach is the correct one. That conversation has not yet happened, and may depend on a clearer description of the semantics of this change (and its motivation) being posted first. Resource limits within containers is a problem that has remained unsolved for years; it may take longer yet to get to the real solution.

Update: as explained in the comments, resource limits are already per-process, so nothing has to be done on that side to make them adjustable on a per-container basis. The counts used to enforce those limits, though, are global, causing the sort of interference described above. So the proposed solution — making the counts local while still aggregating them upward — appears to make sense.

Comments (23 posted)

An introduction to SciPy

January 19, 2021

This article was contributed by Lee Phillips

SciPy is a collection of Python libraries for scientific and numerical computing. Nearly every serious user of Python for scientific research uses SciPy. Since Python is popular across all fields of science, and continues to be a prominent language in some areas of research, such as data science, SciPy has a large user base. On New Year's Eve, SciPy announced version 1.6 of the scipy library, which is the central component in the SciPy stack. That release gives us a good opportunity to delve into this software and give some examples of its use.

What is SciPy?

The name SciPy refers to a few related ideas. It is used in the titles of several international conferences related to the use of Python in scientific research. It is also the name of the scipy library, which contains modules for use in various areas of scientific and numerical computing. Some examples of these are:

scipy.integrate, with routines for numerical integration and for solving differential equations
scipy.fft, for Fourier transforms
scipy.linalg, for linear algebra
scipy.stats, for statistics
scipy.special, providing many special functions used in mathematical physics, such as the airy, elliptic, bessel, and hypergeometric functions

Beyond scipy, the SciPy collection includes a number of useful components packaged together for use in scientific research. NumPy, the fundamental library that adds an array data type to Python and wraps the C and Fortran routines that make numerical work with Python practical, is one of those components, as is pandas, for data science, and the Python plotting library Matplotlib. The user interfaces IPython and Jupyter are also part of the collection, along with Cython, for building C extensions; there are others as well. Used in this sense, SciPy refers, then, to a software ecosystem: a large number of modules that can be used together to build solutions to research problems and to share them with others.

The idea behind SciPy is to allow the scientist, with a single install command, to have a big toolbox available. Aside from a few glitches, this mainly works as promised. Even with the residual version conflicts that crop up from time to time, the existence of the project largely frees the researcher from the arduous task of hunting down compatible versions of a dozen different libraries and installing them individually. In addition, SciPy provides a community as well as a central place to seek documentation and assistance.

Using SciPy

The range of applications for SciPy is enormous, so the small number of examples that there is space for here will naturally leave out almost everything that it can do. Nevertheless, a few examples will give a feel for how people make use of the project. The main example is similar to problems addressed in recent articles on Pluto and Julia, so that the reader can compare the methods of solution offered by the different projects.

Of course, the first step is to install everything. The project provides official installation instructions that go over a handful of ways to get the libraries in place, using more recent versions than will be provided by most distributions' package managers. I chose to use the pip install option within a fresh virtual environment. Afterward, I was left with an IPython that failed to do tab completion and crashed frequently. Some focused Google searching turned up a version conflict; something that the use of virtual environments is supposed to help prevent. The problem was fixed by following the advice to downgrade the completion module with the command:

    $ python -m pip install jedi==0.17.2

The three main ways to use Python, and therefore the SciPy libraries, are by creating script files using any editor, by using a Jupyter notebook, or by using the IPython read-eval-print loop (REPL). Of course, the "ordinary" REPL is always available by typing python in the shell, but since IPython provides significant enhancements over that, there is little reason not to use it.

Our first example will be the numerical solution of the ordinary differential equation (ODE) that we attacked in the Julia article. The problem, called the logistic equation, is:

    f' = f - f²

where the prime signifies the time derivative.

The listing below shows the steps from the IPython session where this problem was set up and solved, interspersed with explanations of what the code is doing:

    from scipy.integrate import solve_ivp

First we import solve_ivp, which is SciPy's basic ODE numerical solver. scipy.integrate is a module that includes various routines for approximating integrals and finding numerical approximations to the solutions of ODEs.

    # Here we define a function that returns the RHS of the ODE:    
    def f(t, f):
        return [f - f**2]

The f() function, which describes the right-hand side of the equation, needs to return a list. In this case it just contains a single element, because the equation is first-order, meaning that it just has a first derivative. One must rewrite higher-order equations as systems of first-order equations, in which case the returned array will have an element for each equation in the system.

    # We'll store the time interval for the solution here:
    tint = [0, 8]

    import numpy as np

    # linspace creates an array of equally-spaced values:
    t = np.linspace(0, 8, 100)

The variable t will contain an array from 0 to 8 in 100 equally spaced steps. These will be the times at which the solution gets calculated. If I only wanted to know the value of the solution at the end of a time interval, I wouldn't need to set up an array like this, but I intend to plot the solution, so I want its value at many points.

    # Now we can solve, storing the solution in sol:
    sol = solve_ivp(f, tint, [0.05], t_eval=t)

The third argument in the call to the solver is the initial condition at the first time value, which in this case is 0.05. It's in an array, like the function itself, to handle the case of a system of multiple equations. The return value, stored in sol, is an object with a handful of fields with information about the solution. The solution itself is stored in sol.y, with corresponding independent variable values in sol.t. In this case, sol.t should be identical to t, but if the solver is not supplied with an input time array, it will decide at what times to provide solution values, and sol.t will need to be consulted.

    import matplotlib.pyplot as plt

    # A no-frills plot, using the array stored in fields of sol:
    plt.plot(sol.t, sol.y[0])

    # Here is the command to display the plot:
    plt.show()

The plot is shown below. If we had a system of equations rather than just one, we could plot the variables of interest by passing additional elements of the sol.y array into the plotting function.

The scipy component of SciPy contains some interesting modules focused on various domains. For example, let's explore parts of the image processing library scipy.ndimage. The listing below is from another IPython session, where I read in an image and apply four transformations to it.

    from scipy import ndimage
    import matplotlib.pyplot as plt

    # Read an image file from disk
    bogy = plt.imread('bogart.jpg')

The transformations in the ndimage library operate on matrices; the imread() function reads the image and stores it in the variable bogy as a 1200×1200×3 numerical matrix: the original image dimensions, and one plane for each of the primary colors. Each of the image transformation functions used below accepts the input image as a first argument, followed by parameters for the transformation. The sigma parameter in the calls to gaussian_filter() and gaussian_laplace() determine the strength of the transformation. The gaussian_laplace() filter transforms a smooth image by using second derivatives to detect changes in intensity.

Shown directly below is the original image; after each transformation, the resulting image is shown.

    # The first transformation: Gaussian blurring:
    bogyBlurred = ndimage.gaussian_filter(bogy, sigma=17)

    # The rotation function takes degrees:
    bogyTilted = ndimage.rotate(bogy, 45)

    # This transformation can create sketch-like images
    # from photographs:
    bogySketched = ndimage.gaussian_laplace(bogy, sigma=3)

    # The geometric_transform routine uses a mapping function
    # to transform coordinates:
    def mapping(t):
        return (t[0], t[1]**1.3, t[2])

    bogySqueezed = ndimage.geometric_transform(bogy, mapping)

    # Each processed image is stored in its own variable. We could
    # apply additional transformations to them or save them
    # to disk with the following:
    plt.imshow(bogyTilted)
    plt.savefig('bogyTilted.png')

The ndimage library has many more functions available; it has applications that go well beyond transforming images of noir movie stars. It can usefully be applied to such problems as automatically counting the number of bacteria on a microscope slide or measuring the dimensions of objects in a photograph.

Changes in 1.6

In the release announcement, linked above, version 1.6 is described as "the culmination of 6 months of hard work. It contains many new features, numerous bug-fixes, improved test coverage and better documentation". As is usual in a large project of this sort, each item in the long list of changes is of interest only to those who happen to be using the module to which it applies. But it's worth noting here that many of the changes were for the ndimage library explored in the previous section, including an overhaul of the library's documentation.

Help and documentation

SciPy's wide adoption and long history mean that there is plenty of help and reference material available. The starting page for the official documentation is a convenient index to the documentation for each of SciPy's major components. Each of these components is its own project with its own team, and this index leads to the various project sites. For example, one of the components is SymPy, which I reviewed here recently; the index leads directly to the documentation maintained on the SymPy site.

The SciPy team maintains Planet SciPy, which is an aggregator of articles and posts about SciPy from around the web. Many of these are tutorials on various topics, both for the beginner and the expert. The SciPy Cookbook should not be overlooked. This is a collection of detailed, and sometimes extensive, code samples giving examples of the use of SciPy libraries to attack various problems. The examples range from starting points for the beginner to highly focused advanced applications.

Moving beyond official sources, the SciPy Lecture Notes is a comprehensive document that starts with overviews of the Python language and the SciPy ecosystem and continues to advanced topics in many areas. It has many contributing authors and seeks contributions from the community though GitHub. These well-illustrated Lecture Notes are called that because they are designed to be usable by teachers as projected slides in addition to being self-contained reading material. To this end, code is displayed in boxes with widgets that collapse each code sample into a more concise form, suppressing prompts and some output. This makes the slides less busy and more effective as accompaniments to lectures, while keeping all the details at hand for the online reader.

Physicist Robert Johansson maintains a set of lectures on scientific computing with Python on GitHub, which serves as a useful source of information on the use the SciPy. The lectures are in the form of Jupyter notebooks for download, but the author offers a PDF version as well.

For those who prefer books, Johansson's volume Numerical Python covers much of this ground. The Python Data Science Handbook is well regarded by data scientists, and the author, Jake VanderPlas, offers the full text free for reading both in conventional form and as a set of Jupyter notebooks. Another book by a physicist is Learning Scientific Programming with Python. Its author, Christian Hill, maintains a GitHub page with interesting case studies, in Jupyter form, related to topics mentioned in his book.

There is certainly a wealth of reference and tutorial documentation available in various forms. And when one gets stuck, one advantage of SciPy's wide adoption is that there is likely to be a way to get unstuck that can be found in sites like Stack Overflow.

SciPy or not?

Python, NumPy, and the other components that eventually became consolidated into SciPy paved the way for free software in the sciences. Although some other languages and libraries may now be superior, Python and SciPy laid the foundation for the adoption of them, partly due to the attraction of IPython/Jupyter notebooks. The Python ecosystem opened scientists' eyes to the advantages of free software as tools for research. Using free software means not only avoiding licensing fees, which are real obstacles in many parts of the world, but avoiding black boxes as well, thereby supporting openness, reliability, reproducibility, and confidence in results. The benefits to science are substantial.

If one has a good reason to use the Python ecosystem—perhaps the need for special capabilities of a particular library or being part of a group already programming in Python—then SciPy will be the obvious choice. But SciPy may not be the best choice for every scientist looking for free-software tools. Those who need only a free, fast, and sophisticated program for matrix computations, equation solving, and similar tasks, and prefer the simplicity of installing one thing that doesn't depend on other pieces, should consider Octave, reviewed here back in December. Mathematicians looking to supplement their pencil-and-paper skills may not need most of what's in SciPy. They may be better served by installing either SymPy or Maxima, both of which can do some numerical computation and graphing in addition to symbolic manipulations.

A scientist who is starting a new project and who doesn't need to interoperate with colleagues using Python should seriously consider taking up Julia as a high-level language. The Julia article linked above outlines some of the reasons for Julia's ascendancy in the sciences. The main attractions of Python, especially in fields such as data science, are some of its widely-used and well-tested libraries, but this is not a big reason to avoid Julia, which makes access to Python libraries almost as easy as using them from Python code.

A scientist with a laptop and an internet connection can install these tools and try them all out. The depth and breadth of free software available for scientific research these days would have been unimaginable a decade or so ago. Furthermore, there are no real obstacles to keeping all of that software installed, thus maintaining the ability to use the best tool to assist in any research task.

Comments (16 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

Briefs: Debian Kubernetes packaging; No-cost RHEL; Elasticsearch license change; Wine 6; Quotes; ...
Announcements: Newsletters; conferences; security updates; kernel patches; ...

Next page: Brief items>>