Leading items

Welcome to the LWN.net Weekly Edition for May 4, 2023

This edition contains the following feature content:

Namespaces for the Python Package Index: the Python community considers an improvement to its primary package repository, but there are a lot of details to work out.
Unprivileged BPF and authoritative security hooks: BPF developers return to the idea of unprivileged access, but run afoul of the rules for Linux security modules.
6.4 Merge window, part 1: the first set of changes to arrive for the 6.4 kernel release.
A kernel without buffer heads: one of the oldest data structures in the kernel may finally be on its way out — eventually.
Ruff: a fast Python linter: a new tool for quickly finding problems with Python programs.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Namespaces for the Python Package Index

By Jake Edge
May 3, 2023

PyCon

The Python packaging picture is generally a bit murky; there are lots of different stakeholders, with disparate wishes and needs, which all adds up to a fairly large set of multi-faceted problems. Back in the first three months of the year, we looked at various discussions around packaging, some of which are still ongoing. A packaging summit was held at PyCon 2023 to bring some of the participants of those discussions together in one room. One of its sessions was on adding a namespaces feature to the Python Package Index (PyPI). It provides a look into some of the difficulties that can arise, especially when trying to accommodate a long legacy of existing practices, which is often a millstone around the neck of those trying to make packaging improvements.

PyPI namespaces

PyPI has long been the go-to source for various kinds of Python modules, libraries, and applications. One of the PyPI maintainers, Dustin Ingram, wanted to discuss some ideas for providing namespaces for PyPI packages. The basic problem is that everyone is sharing a single global namespace on PyPI, he said. There are around 250,000 packages on PyPI currently, but many of them are old packages that may not really be in use any longer; overall, there is a lot of contention for package names.

The PEP 541 ("Package Index Name Retention") process for handling name conflicts "does not really scale", he said. It provides a means to acquire the package name from an abandoned or unmaintained project, but it is an involved process that requires a fair amount of manual review. The PyPI administrators have not prioritized that work, so there is a big backlog. Meanwhile, the task also requires "extremely high trust permissions" in order to transfer the ownership of a package, so it cannot just be handed off to new volunteers.

Typosquatting on package names is another problem that PyPI faces. There is code in place to prevent people from registering names that are confusingly similar to existing package names, but it is restrictive, of necessity. So it stops people who are trying to do legitimate things, which is not desirable.

Meanwhile, there are companies and others that want to be able to restrict the prefixes of package names so that packages coming from an "official source" can be distinguished from those coming from elsewhere. An easy way to handle that would be to hand out a namespace that only the organization can publish packages to. During PyCon, the PyPI project announced the addition of organizations for the index. The idea is that companies, projects, and, well, organizations can register with PyPI in order to be able to maintain a collection of projects as part of a single PyPI page, such as the one for the Python Cryptographic Authority (PyCA). But support for namespaces is not part of the new feature.

When Ingram mentioned the new organizations feature, it was met with a round of applause from the two dozen or so summit attendees. Organizations was the number one feature request for PyPI, previously, but now that it has been implemented, PyPI namespaces moves up into that slot, he said. It is not a surprise that organizations see that as a means to ensure that it is clear when code is coming from them.

"This has been done before", he said. One way would be to have GitHub-style namespaces, so that users and organizations could have the same project names that do not collide. Also, the npm ecosystem for JavaScript implemented a namespace feature in 2015 when it was at a similar size to what PyPI is today. There may be some lessons to be learned there. He is not close enough to that community to say whether it has been a success, but it does seem to be a widely used feature at this point.

One important requirement for the PyPI namespaces feature is that it not cause any breaking changes. Current installers should continue to function as they do today even if the target package has moved into a namespace. While he was speaking for himself, he believes that the other PyPI administrators are generally of the same mind. He wondered what attendees thought about the feature, whether it should be pursued, and, if so, how to make it happen.

Discussion

Former PyPI project manager (PM) Sumana Harihareswara (who wrote about PyPI for LWN back in 2018) wondered if Shamika Mohanan, the current PM, had gathered information on use cases and user-experience expectations for the namespaces feature as Mohanan had done for the organizations feature. Mohanan was not at the summit, but Ingram said that all of the work done so far has been targeting the organizations feature, not namespaces, at least yet.

Harihareswara said that, during the pip overhaul in 2020, the team did a lot of user-experience research, including user testing and interviews. She suggested doing that kind of research for the namespaces feature because understanding the mental models of the users, developers, and companies will result in a better feature. Ingram agreed and noted that he did not expect that the feature would be completed quickly; like organizations, it is a complicated feature that will require quite a bit of work. He expects that it will require funding and thinks that the funding should include money for research of that sort.

There are already namespaces, of a sort, in use; for example, pytest has a convention that the names of its plugins start with pytest-, as pytest developer Phebe Polk pointed out. Having namespace support in PyPI would be nice, she said, but it is unclear how the feature would work with existing conventions like that; there may also be conflicts with names that are being used in private PyPI mirrors used internally. One attendee suggested that registering empty namespaces, in order to reserve them for internal use, should be supported.

Bernát Gábor said that he works at Bloomberg, which has a lot of internal packages, but he also maintains public packages as well. He wondered about adding existing packages into a new namespace once it gets created. Companies will probably want to own everything under their namespaces, but open-source projects may want to allow other projects to publish packages under their namespaces. Ingram thought that the owner of the namespace would be able to make those kinds of decisions; he said that following the GitHub model makes sense.

The large companies already have problems with package names that refer to their names or products, but are not official packages, such as tools for NVIDIA or Amazon Web Services (AWS), Peter Wang said. "Names are a hard problem, we all know that". Organizations are going to expect that they completely control their namespace, but he wondered about packages from others that add some functionality on top of the official package.

Many companies and other organizations have their own internal repositories and mirrors that they use, so there is a question of how that interacts with the PyPI namespaces, he continued. How the resolution order will be determined when searching for a package, for example. To his eye it seems like a federated name system of some sort, perhaps modeled on the Domain Name System (DNS) or public-key infrastructure (PKI), may be needed. That may also mesh well with future plans for package signing in order to address supply-chain-security concerns.

Ingram agreed that package names that refer to another organization (e.g. a google-* package from a FOSS developer) are going to be problematic. He thinks it will be important to make a clear distinction between packages that are in a specific namespace versus those that have a prefix or other part of the name that refers to an organization. Currently, there are packages that have prefixes that do accurately identify the organization behind them, others that are unaffiliated with the prefixed organization, while still others that have no real naming convention at all. It is likely that new syntax will be needed to make the distinction clear and that a mapping layer will be required to map between names outside and inside of namespaces.

An attendee asked how many of those present were familiar with how npm had added its namespaces. Only about five people raised their hands, so it probably makes sense to put together a report on how that was done, he said. That will help inform any decisions based on the npm experience.

Clashes do not only exist at the PyPI package-name level, Toshio Kuratomi said, there is also the question of what name will be used in the Python import statement. The problem already exists today, but he thinks it will get worse with namespaces. For example, Fedora had a mock package at one time, but it was not on PyPI, so a mock package that was added to the index effectively pushed Fedora's package aside. Ingram agreed that mismatches between the package name and the imported module name are a real problem, but that problem already exists, so it is a bit out of scope for a discussion of the namespaces feature.

Organizations may wish to have aliases so that they can, essentially, typosquat themselves, an attendee said. For example, Meta may want to have aliases for its name, Instagram, Facebook, and other variants of those names. Ingram asked if being able to reserve multiple empty repositories would suffice. That would take care of the security problem, the attendee said, but not the usability problem; users may want to get the same package under the canonical name and its variants. Ingram agreed that some kind of aliasing would need to be incorporated into the feature.

Another attendee asked about whether there are efforts to target heavy users to get them to help finance PyPI and its infrastructure. Ingram said that while there are some heavy users, they are not really causing problems for PyPI right now. There are some users, for example those providing large GPU binaries for TensorFlow models, who may be heading in that direction, but the PyPI administrators are working with those projects to ensure that the problem does not get out of control. There are also already some protections in place to prevent overconsumption of PyPI's resources.

Harihareswara suggested that the financial tradeoffs in various potential implementations for namespaces be clearly aired in the upcoming discussions. If there are, for example, somewhat suboptimal designs that would end up saving an enormous amount of time (and money), that should be factored into the decision-making. Knowing what budget is available for the project will also help guide the community in making the appropriate choices. Ingram agreed, noting that the person who fills the newly announced position for a PyPI Safety and Security Engineer will likely have a role to play in the design of the feature as well. PyPI namespaces are pretty clearly part of the safety and security story for the package index and the language as a whole.

[I would like to thank LWN subscribers for supporting my travel to Salt Lake City for PyCon.]

Comments (13 posted)

Unprivileged BPF and authoritative security hooks

By Jonathan Corbet
April 27, 2023

When the developers of the Linux security module (LSM) subsystem find themselves disagreeing with other kernel developers, it tends to be because those other developers don't think to — or don't want to — add security hooks to their shiny new subsystems. Sometimes, though, the addition of new hooks by non-LSM developers can also create some friction. Andrii Nakryiko's posting of a pair of BPF-related security hooks raised a couple of interesting questions, one of which spurred a fair amount of discussion, and one that did not.

Nakryiko proposed the addition of two new LSM hooks to control access to BPF functionality. The first would govern the creation of BPF maps, while the second was meant to control the loading of BPF type format (BTF) data that describes functions and data structures within the kernel. The plan is to not stop there, though:

This patch set implements and demonstrates an overall approach starting with BPF map and BTF object creation, first two steps in the lifetime of a typical BPF applications. Next step would be to do similar changes for BPF_PROG_LOAD command to allow BPF program loading and verification.

There is nothing in this part of the plan that is inherently controversial; if there are use cases for access control over these features beyond checking for the CAP_BPF capability, then the addition of these hooks to enable the creation of a policy to implement that control can make sense. But that is not quite how these hooks are meant to operate. Instead, they can be used to bypass the CAP_BPF check entirely, meaning that they can make the covered functionality available to processes that lack that capability.

Authoritative hooks

The LSM subsystem has its origin in the first Kernel Summit in 2001. At that time, there was a desire to get an early version of SELinux into the kernel, but Linus Torvalds pointed out that there were other approaches to increased security under development, and he did not want to commit the kernel to any one of them. Instead, he asked for the creation of a framework that would allow multiple security mechanisms to be supported.

That framework, implementing an extensive set of hooks that can make security decisions at the relevant points in the system-call paths, eventually was merged as the Linux security module subsystem. But, before that could happen, there was a heated discussion (covered in LWN at the time) over whether the LSM subsystem should support hooks that could grant privileges that a process did not have, or whether they would only be able to add restrictions to those already implemented by the kernel's other access-control mechanisms. In the end, the decision was made that "authoritative hooks" — those that could increase privilege — would not be allowed. Among other things, this rule was seen as a way of keeping security modules from introducing security holes in their own right.

There have been a number of security modules added in the 21 years since that decision was made, but they have all been held to that rule. Easing the ban on authoritative hooks has occasionally been discussed over those years, but has never really been considered. So, when Nakryiko proposed adding a couple of authoritative hooks, LSM maintainer Paul Moore quickly responded:

One of the hallmarks of the LSM has always been that it is non-authoritative: it cannot unilaterally grant access, it can only restrict what would have been otherwise permitted on a traditional Linux system. Put another way, a LSM should not undermine the Linux discretionary access controls, e.g. capabilities.

The real solution, he said, would be to revise how the BPF code implements the CAP_BPF capability. Kees Cook disagreed, suggesting that these hooks could be seen as "fine-grained access control" rather than actually bypassing enforcement, but Moore stood firm in his opposition to the idea.

Nakryiko protested that the idea was to increase security by making it finer-grained than the single CAP_BPF capability allows. The restriction-only model, he said, would be more brittle in the end. He also added that there are a couple of real problems with capability-based enforcement when user namespaces are involved. The first is that many BPF programs, such as those that interact with tracing, inherently have a view of the entire system and cannot really be contained within a namespace. So a capability check for CAP_BPF cannot be namespace-aware.

Beyond that, though, it is currently not even possible to give a process CAP_BPF if it's running within a user namespace due to the way that the capability checks are implemented in the BPF subsystem. As a result, he argued, it is not really possible for programs running within a user namespace to make use of BPF at all. The proposed hooks were intended to provide a way around this shortcoming.

Casey Schaufler, who had been in favor of authoritative hooks back in 2001, was unsympathetic now:

This doesn't sound like a problem, it sounds like BPF is explicitly designed to prevent interference by namespaces. But in some cases you now want to limit it by namespaces.
It appears that the desired uses of BPF are no longer compatible with its original security model. That's unfortunate, and likely to require a significant change to the implementation of BPF.

Or, as Moore put it: "Changing the very core behavior of the LSM layer in order to work around an issue with another access control mechanism is a non-starter". Nakryiko has received the message and has promised to come back with a different approach. It thus seems that a complete solution to the problems encountered by the BPF community is a somewhat distant prospect at this point.

Unprivileged BPF

The quiet part of the discussion is an apparent change within the BPF community with regard to security. Quoting again from Nakryiko's cover letter:

Such LSM hook semantics gives ability to have safer-by-default policy of not giving applications any of the CAP_BPF/CAP_PERFMON/CAP_NET_ADMIN capabilities, normally required to be able to use BPF subsystem in the kernel. Instead, all the BPF processes could be left completely unprivileged, and only allowlisted exceptions for trusted and verified production use cases would be granted permission to work with bpf() syscall, as if those application had root-like capabilities.

In the early days of extended BPF, some effort went into making it possible to use BPF without any special privileges. By 2019, though, the idea of unprivileged BPF use had been explicitly deprecated. BPF co-maintainer Alexei Starovoitov described Linux as "a single-user system" and proclaimed that no further attempts would be made to enable use of BPF without privilege. The amount of pain involved in keeping the system secure had simply become too much; the advent of the Spectre vulnerabilities just made things worse.

So it is interesting to see the BPF developers talking about unprivileged operation again, even if done under the watchful eye of a security policy. There does not appear to have been any discussion on the BPF list about changes in the privilege model overall, so it is not entirely clear how this all came about.

What does seem clear is that, if the BPF developers want to move away from the simple CAP_BPF check, they are going to have to revisit many of the security-related decisions that they have made so far. The method of adding authoritative LSM hooks does not appear to be viable for mainline inclusion, so some thought is going to have to be put into other solutions, including perhaps rethinking the user-namespace issue. This does not look like a problem that is amenable to a quick solution.

Comments (20 posted)

6.4 Merge window, part 1

By Jonathan Corbet
April 28, 2023

As of this writing, nearly 7,500 non-merge changesets have been pulled into the mainline repository for the 6.4 kernel release. The 6.4 merge window is thus clearly off and running, with a number of significant changes merged already. Read on for a summary of the most significant pulled so far.

BPF

It is now possible to store kptrs in more map types (specifically per-CPU hashmaps, LRU hashmaps, and local-storage maps).
BPF programs can now use absolute time values in bpf_timer_start().
There are improved kptr types for use with packet and XDP buffers. Other new kptr types include support for RCU-protected kptrs and reference-counted kptrs.
Developers have added an awareness of Android APK packages for uprobe programs. This makes it easier to attach uprobes to code stored in an APK package.
The generic iterators patch set has been merged, with the eventual goal of making it easier to write loops in BPF programs.
The BPF verifier log, which contains vital information about why the verifier has rejected a program, can now be used in a rotating mode. This makes it more likely that the information actually needed by developers is still in the log when they look for it.

Core kernel

There are two new ptrace() operations — PTRACE_GET_SYSCALL_USER_DISPATCH and PTRACE_SET_SYSCALL_USER_DISPATCH — which allow one process to manipulate the system-call user dispatch settings of another. The target use case for this feature is the Checkpoint/Restore in Userspace mechanism.
The io_uring subsystem can perform multiple direct-I/O writes to a file in parallel if the underlying filesystem supports it; currently, ext4 and XFS have that support. There is also a new "multishot" timeout option that repeatedly generates timeouts without the need to re-arm the timer.

Filesystems and block I/O

Calls to open() with both the O_DIRECTORY and O_CREAT flags have strange semantics that have varied over the years. As of 6.4, this flag combination will simply cause the call to fail with an EINVAL error.
The F2FS filesystem can now support zoned block devices where the sizes of the zones are not a power of two.
The command codes for the ublk driver have changed. This change will obviously break any programs using the old codes; for them, there is a configuration option (UBLK_LEGACY_OPCODES) that will cause the old codes to continue to work as well.

Hardware support

GPIO and pin control: Loongson 64-bit GPIO controllers, Fairchild FXL6408 8-bit I2C-controlled GPIO expanders, and Intel Elkhart Lake PSE GPIO controllers.
Graphics: Magnachip D53E6EA8966 DSI panels, Sony TD4353 JDI panels, Novatek NT36523 panels, Freescale i.MX LCD controllers, and Samsung MIPI DSIM bridges.
Hardware monitoring: Starfive JH71x0 temperature sensors and ACBEL FSG032 power supplies.
Miscellaneous: Qualcomm Cloud AI accelerators, Freescale i.MX8 image sensor interfaces, MSI laptop embedded controllers, Lenovo Yoga tablet-mode switches, Richtek RT5739 regulators, Richtek RT4803 boost regulators, and HiSilicon STB random-number generators
Networking: Realtek RTL8710B(U) wireless interfaces, MediaTek MT7530 and MT7531 switches, STMicroelectronics STM32 basic extended CAN controllers, StarFive dwmac Ethernet controllers, AMD/Pensando core-device adapters, Realtek 8822BS, 8822CS, and 8821CS SDIO wireless network adapters, NXP 100BASE-TX PHYs, and Microchip 10BASE-T1S Ethernet PHYs.

Miscellaneous

Noteworthy documentation additions include the kernel contribution maturity model and a detailed tutorial on how to build a trimmed kernel.
The nolibc library has gained loongarch support.

Networking

The kernel now supports the fair capacity and weighted fair queuing stream schedulers for the SCTP protocol.
There is now generic support for binding LEDs to network switches or PHYs in the system devicetree.
There is a new, netlink-based API for calling out to user space for helper functions. See this commit for an overview of the functionality and this commit to see how it is used to implement the TLS-handshake request.
It is now possible to attach BPF programs to netfilter hooks, where they can make packet-forwarding decisions; this merge commit has some more information.

Security-related

As expected, the SELinux runtime disable feature has been removed. This feature has been deprecated for years, and most distributions have long since disabled it themselves, so chances are good that nobody will notice.
The SELinux "checkreqprot" functionality, which could be used to circumvent policy restrictions on the creation of executable memory mappings, has also been removed. Here, too, there was a lengthy deprecation cycle and it seems unlikely that anybody will be affected.
The kernel can now restrict the .machine keyring, which holds machine-owner keys, to keys that are properly signed by a recognized certificate authority. The intent is to allow this keyring to be used with the Integrity Measurement Architecture (IMA) subsystem.

Internal kernel changes

There is a new, generalized mechanism to enable the creation of kernel worker processes from user space; see this commit message for some more information.
As expected, the SLOB memory allocator has been removed.

Assuming that the usual two-week schedule holds, the 6.4 merge window can be expected to remain open until May 7. Once the window closes, you will of course find a summary here on LWN.

Comments (17 posted)

A kernel without buffer heads

By Jonathan Corbet
May 1, 2023

No data structures found in the Linux kernel — at least, in any version that escaped from Linus Torvalds's development machine — are older than the buffer head. Like many other legacies from the early days of Linux, buffer heads have been targeted for removal for years. They persist, though, despite the problems they present. Now, Christoph Hellwig has posted a patch series that enables the building of a kernel without buffer heads — but the cost of doing so at this point will be more than most want to pay.

The first public release of the Linux kernel was version 0.01, and struct buffer_head was a part of it:

    struct buffer_head {
	char * b_data;			/* pointer to data block (1024 bytes) */
	unsigned short b_dev;		/* device (0 = free) */
	unsigned short b_blocknr;	/* block number */
	unsigned char b_uptodate;
	unsigned char b_dirt;		/* 0-clean,1-dirty */
	unsigned char b_count;		/* users using this block */
	unsigned char b_lock;		/* 0 - ok, 1 -locked */
	struct task_struct * b_wait;
	struct buffer_head * b_prev;
	struct buffer_head * b_next;
	struct buffer_head * b_prev_free;
	struct buffer_head * b_next_free;
    };

While the best disk drives available decades ago were nominally "fast", accessing data on disk was still slower, by several orders of magnitude, than accessing data in main memory. So the importance of caching file data was well understood long before Linux was born. The approach that was generally in use at that time was to cache disk blocks, with filesystem code operating on data in that cache; Torvalds followed that model with Linux. Thus, from the beginning, the Linux kernel included a "buffer cache" that held copies of blocks found on the system's disks.

The buffer_head structure was the key to managing the buffer cache. The combination of the b_dev and b_blocknr fields uniquely identified which block a given buffer cache entry referred to, while b_data pointed to the cached data itself. The other fields tracked whether the block needed to be written back to disk, how many users it had, and more. It was a core part of the kernel's block I/O subsystem — and of its memory management code as well.

Over time, it became clear that file caching could be done better if it were implemented as a cache of file data, rather than of disk blocks. During the 1.3 development cycle, Torvalds began implementing a new feature known as the "page cache", which would manage pages of data from files, rather than disk blocks. A number of advantages came from that change; many operations on file data could avoid calling into the filesystem code entirely if that data could be found in the cache, for example. Caching data at a higher level better matched how that data was used, and the ability to cache full pages (generally eight times larger than the 512-byte block size typically found at that time) improved efficiency.

The only problem was that the buffer cache was deeply wired into both the block subsystem and the filesystem implementations, so this cache continued to exist, alongside the page cache, for several more years until the two were unified. Even then, the buffer cache was at the core of the API used for block I/O. This was not optimal: filesystems worked hard to store data contiguously on disk, and the page cache could keep that data together in memory with at least page granularity, but the buffer-head interface required every I/O operation to be broken down into 512-byte blocks — each with its own buffer_head structure. That was a lot of overhead, much of which just added work for storage drivers, which had to try to reassemble larger chunks for reasonable I/O performance.

The 2.5 development series (the last of the odd-number development kernels under the older model) addressed this problem by reworking the block layer around a new data structure called the "bio" that could represent block I/O requests more efficiently. Over the years, the bio has evolved considerably as the need to support ever-higher I/O rates has grown, but it still remains the way that block I/O requests are assembled and managed.

Meanwhile, though, struct buffer_head can still be found in current kernels. And, more to the point, a number of filesystems still use it. The role that buffer heads once played in cache management has long since ended, but they still handle an important task in parts of the kernel: tracking the mapping between data cached in memory and the location on persistent storage where that data lives. The kernel has a rather more modern interface (iomap) for this purpose, but not all subsystems are using it.

One of the holdouts is ext4, which still makes heavy use of buffer heads. This filesystem, of course, is derived from ext2, which first entered the kernel with the 0.99.7 release in early 1993. Ext2 was based on block pointers; each file would have a list associated with it containing the numbers of the blocks on disk holding that file's data. Such a layout, where each block on disk is a separate entity (even if the filesystem tries to keep them together) fits the buffer head model reasonably well. So it is not surprising the buffer heads were embedded deeply within ext2, and are still there 30 years later in ext4, even though ext4 gained support for extents — a rather more efficient representation of large files — in 2006.

Buffer heads, clearly, still work, but they still add overhead to file I/O. They also present an obstacle to changes that developers want to make to the memory-management and filesystem layers, including the ongoing folio work. So the desire to get rid of buffer heads, which has been present for a long time, seems to be getting stronger.

But, as Hellwig's patch series shows, ext4 is not the only place where buffer heads persist. That series, after a bit of refactoring, adds a new BUFFER_HEAD configuration option that controls the compilation of buffer-head support. Any code that needs buffer heads will select that option; if a kernel is built without any code needing buffer heads, then the resulting kernel will not have that support. Such a kernel will be lacking a few important features, though, including the ext4 filesystem, but also F2FS, FAT, GFS2, HFS, ISO9660 (CDROM), JFS, NTFS, NTFS3, and the device-mapper layer. On the other hand, it is possible to build a buffer-head-free kernel that supports Btrfs and XFS.

It seems unlikely that there will be many kernels built without buffer-head support in the near future. This work does, however, make it easier to see where the remaining users are, which should help to focus work toward getting rid of buffer heads for real. That job is still likely to take some time — one does not perform major surgery on a heavily used filesystem in a hurry — and it may accelerate the removal of some old and unloved filesystems (like JFS). One of these years, though, it will become possible to drop this core kernel data structure that has been there since the beginning.

Comments (8 posted)

Ruff: a fast Python linter

May 2, 2023

This article was contributed by Koen Vervloesem

Linters are tools that analyze a program's source code to detect various problems such as syntax errors, programming mistakes, style violations, and more. They are important for maintaining code quality and readability in a project, as well as for catching bugs early in the development cycle. Last year, a new Python linter appeared: Ruff. It's fast, written in Rust, and in less than a year it has been adopted by some high-profile projects, including FastAPI, Pandas, and SciPy.

Linting tools are often part of an integrated development environment, used in pre-commit hooks, or as part of continuous-integration (CI) pipelines. Some popular linters for Python include Pylint, Flake8, Pyflakes, and pycodestyle (formerly called pep8), which are all written in Python as well. Each linter checks whether the code violates a list of rules. Ruff reimplements a lot of the rules that are defined by these other popular Python linters, and combines them into one tool.

Orders of magnitude faster

In August 2022, Charlie Marsh announced Ruff, which he called "an extremely fast Python linter, written in Rust". He showed how Ruff is 150 times faster than Flake8 on macOS when linting the Python files in the CPython code base, 75 times faster than pycodestyle, and 50 times faster than Pyflakes and Pylint. While the exact speed gains aren't that important (and Ruff has become even faster since then), it's clear that it's orders of magnitudes faster than its competitors, as Marsh explained:

Even a conservative 25x is the difference between ~real-time feedback (~300-500ms) and sitting around for 12+ seconds. With a 150x speed-up, it's ~300-500ms vs. 75 seconds. If you edit a single file in CPython and re-run ruff, it's 60ms total, increasing the speed-up by another order of magnitude.

In his example, Marsh touches on Ruff's caching. When re-running Ruff on a code base, it only lints the files that have been changed since the previous run. In contrast, Flake8 re-lints all of the files every time, except when running it in a wrapper such as flake8-cached.

The gist of his message is: when linting a code base happens almost instantaneously, there's really no reason to not do it. This means that more developers will add the linter to their pre-commit or CI configuration. So speed is not just a nice-to-have, but is an essential element of improving code quality.

One reason why Ruff is faster than the alternatives is that it's compiled into machine code instead of running in an interpreter. However, a second reason for its speed is that it runs all of its checks in a single pass over the code. Marsh contrasts this with how Flake8 works:

Flake8 is really a wrapper around other tools, like pyflakes and pycodestyle. When you run Flake8, both pyflakes and pycodestyle are reading every file from disk, tokenizing the code, and traversing the tree (I might be wrong on some of the details, but you get the idea). If you then use autoflake to automatically fix some of your lint violations, you're running pycodestyle yet again. How many times, in your pre-commit hooks, do you read your source code from disk, parse it, and traverse the parse tree?

Ruff uses RustPython's abstract syntax tree (AST) parser. For every file, Ruff generates the AST exactly once, traverses all the nodes in the tree, applying the linter rules in a single pass as it goes.

Ruff rules

Ruff has over 500 built-in rules, many of them inspired by popular tools such as Flake8, isort, and pyupgrade, as well as including some of its own rules. There's a category for each of these linters, and each category comes with a collection of rules. By default, Ruff enables all rules from the F category (Pyflakes) and a subset of the E (Flake8's errors) category. Ruff also reimplements some of the functionality in the most popular Flake8 plugins as well as in other code-quality tools.

For its configuration, Ruff uses pyproject.toml. The various categories of rules can be enabled and/or configured in this file. For example, this snippet of the configuration enables the rules from four linters, ignores two specific rules, and enforces the Google style for docstrings:

    [tool.ruff]
    select = [
      "ANN",     # flake8-annotations
      "D",       # pydocstyle
      "E",       # pycodestyle
      "F",       # Pyflakes
    ]
    ignore = [
      "ANN101",  # missing-type-self
      "D107",    # undocumented-public-init
    ]

    [tool.ruff.pydocstyle]
    convention = "google"

For strict linting, select = ["ALL"] enables all built-in linters. On most code bases, this will result in a lot of errors, but specific categories or rules can be ignored, as seen in the example configuration above. Alternatively, individual violations for specific rule codes can be ignored by adding # noqa: {code} at the end of a line in the source file.

Not all of the rules of the original tools are reimplemented in Ruff. The documentation has a comparison with Flake8, with Pylint, and a list of implemented Flake8 plugins. Ruff also supports import sorting comparable to isort, as well as linting docstrings based on various conventions.

Contrary to the popular Flake8, Ruff doesn't support plugins. So users who want to extend the linter need to get their code accepted into Ruff's repository. There's a GitHub issue about plugins, with a comment by Marsh that adding support for plugins is not a top priority now because he considers the unification of multiple tools into Ruff as a feature, not a bug. But this means that developers who want to enforce their own quirky rules, that won't be accepted into Ruff because they're too specialized, will need to run another tool.

Using Ruff

Despite being written in Rust, Ruff can simply be installed with pip install ruff, at least if there's a wheel built for the environment. This is based on the Maturin project that allows building and publishing Rust binaries as Python packages. Most users shouldn't even notice that Ruff isn't written in Python. Ruff supports any Python version from 3.7 onward.

Running Ruff manually is as easy as executing ruff check src/ when the Python files are in the src directory. It then shows a list of files with lines and columns where the linter finds an error, followed by a code and short description of the rule. For example:

    def is_uint16(number: int) -> bool:
	 """Check whether a number is a 16-bit unsigned integer."""
	 return isinstance(number, int) and 0 <= number <= 0xFFFF

With the Pylint refactor rules enabled, ruff check gave me the following warning:

    src/bluetooth_numbers/utils.py:192:55: PLR2004 Magic value used in 
    comparison, consider replacing 65535 with a constant variable

I found that complaint to be overly pedantic, so I silenced it by adding # noqa: PLR2004 to the end of the return statement.

A command like ruff rule PLR2004 shows some more information about a specific rule. For some rules, the information provides a clear example. Unfortunately, for many others the information is rather sparse, only telling the developer what is not allowed, but not why or how to solve the problem. Often there's a reference to the original project the rule is derived from, so searching that project's home page or repository can help.

There's also a --watch option that continuously re-runs the linter when source files change. Ruff designates some errors as fixable: it can resolve them automatically when running ruff check with the --fix option. Some examples of these fixable errors are things like unused imports or invalid unescaped characters in strings.

Most projects benefit from using a configuration file for Ruff, where specific linters and/or rules are enabled. For some of my own Python projects, I was able to migrate from a combination of isort, pyupgrade, Pylint, and Flake8 (with a lot of plugins) to Ruff with a short configuration file.

What worked for me to get strict linting—all rules enabled—for my relatively small Python projects was to look at the violations rule by rule. Pick a rule that the code violates a lot, restrict the output to only this rule with:

    $ ruff check --select RULECODE src

Then fix the issues one by one or add a # noqa: RULECODE comment where needed. If none of the violations for a rule seem relevant, add it to the ignore list. Then run Ruff again without restrictions, pick another rule, look into that one, and so on, until there are no violations left.

Apart from running Ruff manually, users can also integrate the linter directly into their code editor (e.g. Visual Studio Code, Vim, Neovim, Emacs) or it can be used with the language server protocol for any tool that supports it. The linter can be used as a pre-commit hook or as a GitHub action as well.

Ruff-like developer tools

In his announcement of Ruff, Marsh had already hinted at the possibilities of other Python developer tools using the same approach as Ruff:

The question I keep asking myself is: could we take the Ruff model and apply it to other tooling? You could probably give autoformatters (like Black and isort) the same treatment. But what about type checkers? I'm not sure! Mypy is already compiled with mypyc, and so is much faster than pure Python; and Pyright is written in Node. It's something I'd like to put to the test.

In mid-April, Marsh announced that he has started a company, Astral, to continue building high-performance developer tools for the Python ecosystem:

Some of the things we build will look like natural extensions of Ruff (e.g., an autoformatter); others will diverge from the static analysis theme. But our North Star is pretty simple: make the Python ecosystem more productive by building tools people love to use — tools that feel fast, robust, intuitive, and integrated.

What won't change, according to Marsh, is the open-source and permissively licensed nature of Ruff (which has an MIT license) and other tools that will be created by Astral.

Conclusion

For developers who are now using Flake8 with various plugins, Pylint, isort, pyupgrade, and many other tools to check their code quality, migrating to Ruff can greatly simplify their development environment. Not only does this result in fewer dependencies, it also makes linting the code base faster. Black for code formatting, Ruff for linting, and mypy for type checking seems to be the sweet spot for many projects these days. It remains to be seen whether the Astral team will create a code formatter and type checker to complement Ruff.

Comments (17 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>