LWN.net Weekly Edition for February 22, 2024
Welcome to the LWN.net Weekly Edition for February 22, 2024
This edition contains the following feature content:
- A Spritely distributed-computing library: the Spritely project is building a new federation protocol; early versions of its libraries are available today.
- Sudo and its alternatives: there are several software projects seeking to build a more secure alternative to sudo.
- Windows NT synchronization primitives for Linux: the kernel is adding new synchronization mechanisms to make ported Windows programs more performant.
- A proposal for shared memory in BPF programs: new primitives for communication between BPF programs and user-space programs may soon be available.
- A modest update to Qubes OS: Qubes OS makes continuing usability improvements.
- Open-source AI at FOSDEM: presenters at FOSDEM discuss terminology around AI model licensing; there are projects dedicated to gathering open data sets to train open models.
This week's edition also includes these inner pages:
- Brief items: Brief news items from throughout the community.
- Announcements: Newsletters, conferences, security updates, patches, and more.
Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.
A Spritely distributed-computing library
Spritely is a project seeking to build a platform for sovereign distributed applications — applications where users run their own nodes in order to control their own data — as the basis of a new social internet. While there are many such existing projects, Spritely takes an unusual approach based on a new interoperable protocol for efficient, secure remote procedure calls (RPC). The project is in its early stages, with many additional features planned, but it is already possible to play around with Goblins, the distributed actor library that Spritely intends to build on.
The Spritely project states its long-term goal as building "a new
architecture for the internet: removing the necessity of
client-server architecture, replacing it with a participatory peer-centric
model
". The project is working toward a future where many different
distributed applications communicate over the same protocol, and users' data is
contained on their own devices.
The Spritely Institute — the charitable organization which provides funding for the
Spritely project — presents this vision as "social media done
right
".
An early example is
visible in the form of
Fantasary, a distributed, collaborative, text-based virtual world initially
built for the
2023 Spring Lisp Game Jam. There is also
Goblin Chat, a simple chat
application demonstrating what it might be like to write a distributed chat
application using Spritely's libraries.
Christine Lemmer-Webber, the founder of the project and a co-author of the
ActivityPub
specification, initially
phrased the goal of the
project as "building the next generation of the fediverse as a distributed
game
". She noted that while ActivityPub was a great success, "there
are a few things that bother me
". She views the Spritely project as a chance
to correct some of the shortcomings of ActivityPub and focus on building a
social environment that permits rich interactions.
Lemmer-Webber is not alone in believing that a better foundational architecture
for distributed systems is possible. The Spritely Institute has a grant from
the NLnet Foundation to work on creating a
specification for a protocol called OCapN
(short for "Object Capability Network").
Jessica Tallon, another ActivityPub co-author,
is one of several people being paid by that grant to further develop Spritely.
In
an interview, she said that she hoped the project could make OCapN a new
foundational layer of the internet which not only permits the creation of
performant peer-to-peer applications, but that makes decentralization "the
natural way to develop
" applications on the web.
OCapN
OCapN is not yet a standard and is subject to change. Unlike other distributed-computing projects, the OCapN pre-standardization group has support from several different projects with existing code that is at least theoretically compatible. The Spritely Institute is the primary contributor to OCapN, though Agoric and performance-oriented open-source RPC library Cap'n Proto are two other regular contributors.
OCapN is designed around the concept of a capability — an unforgeable token representing the bearer's right to take some action. They should not be confused with Linux capabilities — which are based on POSIX capabilities, themselves inspired by the same security research as OCapN-style capabilities. Linux capabilities have notably different security properties from those discussed in capability-based-security research. In the POSIX model, a program running as a user has permission to do everything that user can do. In a capability-based model, the program only has permission to do the specific things that it has been given a capability for. This is especially useful in a distributed system, because it allows a program to securely delegate certain operations to remote systems, without giving those remote systems any ability to act beyond their allowed interfaces.
In OCapN, a capability is a reference that grants permission to send messages to a remote object. When a local object sends a message to a remote one, the local protocol implementation immediately returns a "promise", representing the eventual response to that message. If the remote computer responds to the message, the promise will be "kept" and associated with the returned value. If the remote computer runs into an error or goes offline, the promise will be "broken".
Representing responses from remote machines in this way permits an important performance optimization: promise pipelining. When a promise will resolve (in the successful case) to another reference to a different object, OCapN permits sending messages to that eventual object using the promise, even if the object has not been created yet. This lets applications avoid unnecessary round trips, by streaming batches of related messages to remote objects before receiving the replies. The Cap'n Proto project lists this as one of the major benefits of its existing RPC mechanism in comparison to other RPC libraries.
Another important performance advantage of OCapN is third-party handoff. In a scenario with three different nodes, A, B, and C, A can send a capability that references an object on B to C, and then C will connect directly to B to make use of it. This is in contrast to some other RPC mechanisms where node A would continue to act as a relay, requiring node A to remain online for any continuing communication to occur between nodes B and C.
OCapN is not the first attempt to build a mechanism along these lines. Spritely is directly inspired by the E programming language; Spritely adopts some of the same terminology, as well as OCapN's basic design. In order to make implementing compatible libraries easier, however, the Spritely project will intentionally avoid including some features of E.
What exists today
Despite continuing work on the OCapN protocol, the Spritely project has a long way to go before reaching its goal of a complete platform on which to build distributed social applications. The project's discussion forum has many plans for additional network transports, a distributed debugger, serialization of distributed networks of objects, and more. Right now, only one component of the eventual Spritely ecosystem has been written: the actor-based concurrency and RPC library Goblins. There is a version for Guile (the GNU project's Scheme implementation), and a version for Racket. The documentation for Goblins explains how to begin experimenting with the library.
These different versions of the library are kept in sync with each other, and are interoperable — code written for Racket can call code written for Guile and vice versa. The 0.12.0 versions of the library support connecting over TLS-encrypted TCP sockets, although the project has not yet implemented Network Address Translation (NAT) traversal, making communicating with nodes that are not on the local network difficult. To address this, the libraries also support communicating via Tor, which can connect even between networks that are both behind NATs. This was actually the first network transport the project implemented, to permit testing connections between remote collaborators right from the beginning of the project. Work on a network transport using libp2p (the communication library that powers IPFS) is planned for version 0.14.0.
Libp2p support will theoretically set the stage for Goblins to be usable in the browser, since libp2p permits WebSocket-based connections between nodes. One barrier to adopting Goblins in the browser is the fact that neither Guile nor Racket support compiling to WebAssembly. Spritely is looking to fix that with Guile Hoot, a compiler that targets WebAssembly. Guile maintainer and Spritely contributor Andy Wingo has written a series of posts about the challenges that he has encountered trying to compile Scheme to WebAssembly. The recent 0.3.0 version of Hoot is the first version to support whole-program compilation.
While still young, Spritely has a charitable institute, a growing body of supporters, increasingly usable prototypes, and a long history of supporting research. The project provides an interesting alternative to other distributed-systems projects, with a focus on interoperability, performance, and interacting directly with remote systems. The monthly standardization meetings are open to everyone with an interest, and scheduled through the project's GitHub issues.
Sudo and its alternatives
Sudo is a ubiquitous tool for running commands with the privileges of another user on Unix-like operating systems. Over the past decade or so, some alternatives have been developed; the base system of OpenBSD now comes with doas instead, sudo-rs is a subset of sudo reimplemented in Rust, and, somewhat surprisingly, Microsoft also recently announced its own Sudo for Windows. Each of these offers a different approach to the task of providing limited privileges to unprivileged users.
The origins of sudo go back to the beginning of the 1980s and 4.1BSD running at the State University of New York, Buffalo. The full history of the program will not be repeated here, but a nice overview of it is available on the sudo website. That history is sparse on details about the first release of "CU sudo", which is the currently prevailing sudo; it simply says that it was CU sudo version 1.3 that was released in 1994. The exact date appears to have been February 9, 1994 from a post to comp.unix.admin — just a bit over 30 years ago.
Sudo has been through multiple iterations and reimplementations over the years. CU sudo was named after the University of Colorado, where it was created by Todd C. Miller. He still maintains it, although naturally many people have contributed to it over the decades. The "CU" prefix was dropped from the name in 1999.
What does sudo do?
Sudo works by authenticating the user and then executing the program given as a command-line parameter, with the effective user ID of the user indicated in the -u parameter (root by default). Most commonly, a system administrator configures sudo by listing specific users or groups and their allowed capabilities in the /etc/sudoers configuration file. Sudo supports fine-grained control to, for example, allow a user to only run specific commands with the identity of a specific other user, rather than any command as any user. A user could run sudo cat foo to read the file foo that their regular account does not have permission to read, or sudo -i to get an interactive superuser shell.
As an important contrast to the related su utility, users do not have to know the superuser password of the system; instead they authenticate with their own password. Authentication without a password can also be set up. For convenience, sudo keeps "ticket files" for recently authenticated users, so that they do not have to re-input their password for a short time after their last sudo invocation.
OpenBSD and doas
To assume the identity of another user, sudo must invoke some privileged system calls, and therefore its file mode must be setuid root. Those types of programs have extraordinary security requirements and must do their job carefully in order to prevent unintended privilege escalation.
Having identified that sudo is such a program and a complicated one,
OpenBSD developers came to the conclusion that it was too risky. Starting
with OpenBSD 5.8, released in October 2015, the default user-identity
switcher tool has been "doas", or "Dedicated OpenBSD Application
Subexecutor". It was developed by Ted Unangst, who had "personal issues
"
with the configuration of sudo. He wrote a blog post in July 2015
explaining his reasons for the creation of doas. One of those was
that sudo was simply too complicated for most people's needs:
The core of the problem was really that some people like to use sudo to build elaborate sysadmin infrastructures with highly refined sets of permissions and checks and balances. Some people (me) like to use sudo to get a root shell without remembering two passwords. And so there was considerable tension trying to ship a default config that would mostly work with the second group, but not be too permissive for the first group.
Another was that it contained too much code for a privileged process:
There were some concerns that sudo was too big, running too much code in a privileged process. And there was also pressure to enable even more options, because the feature set shipped in base wasn’t big enough. (As shipped in OpenBSD, the compiled sudo was already five times larger than just about any other setuid program.)
Prior to the blog post, there was a thread on the openbsd-ports mailing list announcing a decision by Miller, who is also heavily involved with OpenBSD, and Theo de Raadt to move sudo from OpenBSD's base repository to the so-called ports tree. BSD operating systems generally come with the core operating system in a repository called "base", along with a "ports" tree offering third-party software in source-code format. Apart from a separate openbsd-tech thread, there isn't much overt discussion to be found about sudo's problems prior to the removal from base. It appears that one day Unangst just thought it best to make a slimmer replacement; De Raadt and Miller were seemingly on board from the start.
Doas is a bare-minimum sudo-like tool; it has a simplified configuration file syntax and does not have support for authentication schemes other than system-local BSD Authentication — see Wikipedia for an overview or the login.conf man page for details. Taking a look at the short man page for doas.conf gives a good idea of its scope.
A Linux port of doas also exists by the name OpenDoas and is available for many distributions.
Vulnerabilities and "rewrite it in Rust"
As with any software, sudo has had its share of security problems. It is not an egregious stream of vulnerabilities by any means, but something pops up every now and then. Since sudo is written in C, a portion of those bugs involve memory safety. Most recently, in CVE-2023-27320, a double-free bug was patched, albeit one that only affected rare configurations. Another, more severe, vulnerability from last year was CVE-2023-22809, though it is unrelated to memory safety. It concerned a mishandling of environment variables by the sudoedit command, which allowed a local attacker to append extra files to be edited with extra privileges.
Sudo-rs is an
effort to write a drop-in replacement for "all common use cases of
sudo
" in
Rust; its GitHub README calls it "a memory safe implementation of
sudo and su
".
The project only targets Linux-based systems with a 5.9 kernel or newer.
Development is sponsored by the Prossimo project, which is part of
the Internet
Security Research Group (ISRG); there is an announcement blog
post of the project from April 2023.
The project is also affiliated with Ferrous Systems, a company offering a safety-standards-qualified Rust compiler and consulting, which reported on a security audit of sudo-rs in November 2023. The audit discovered one moderate and two low-risk issues; the moderate one being a path traversal vulnerability that was found to affect Miller's sudo as well.
Similarly to doas, sudo-rs also only targets a subset of sudo's capabilities. From the README:
Some parts of the original sudo are explicitly not in scope. Sudo has a large and rich history and some of the features available in the original sudo implementation are largely unused or only available for legacy platforms. In order to determine which features make it we both consider whether the feature is relevant for modern systems, and whether it will receive at very least decent usage. Finally, of course, a feature should not compromise the safety of the whole program.
Sudo-rs does not seem to get much use currently. At the time of this writing, the Crates.io statistics show a figure of 663 all-time downloads. The most notable deployment of sudo-rs that was found is in Wolfi OS — a minimalist distribution (or "undistro" as the project calls itself) from Chainguard, Inc., focused on solving supply-chain issues in container images.
Cutting down on features
Both doas and sudo-rs achieve a portion of their goals by intentionally omitting features that sudo supports. This is a sensible angle to minimize attack surface and reminiscent of the OpenBSD-originated LibreSSL project. After the major OpenSSL vulnerability dubbed Heartbleed in 2014, OpenBSD forked OpenSSL into LibreSSL and removed substantial amounts of legacy and esoteric functionality in an effort to improve the security of the library.
While the vulnerabilities found in sudo have not been as severe as Heartbleed, it might be prudent to get ahead of such a hypothetical event by switching to a streamlined alternative, especially when some of the more advanced or more complicated features of sudo are not required. Though sudo is not exposed to the network like OpenSSL, many of the same concerns that led to LibreSSL were factors in the development of doas — and in the same time frame.
Doas supports only a core subset of sudo's feature set, so it cannot really be recommended for anyone who has even slightly more complicated authentication needs than local user accounts. There is no support for integration with LDAP or Kerberos, for instance. However, sudo-rs does call out to Pluggable Authentication Modules (PAM) to authenticate the user, so it can support non-local authentication schemes such as LDAP and Kerberos via the usual Linux mechanism for that.
Notably, sudo-rs maintains a list of past sudo CVEs with an estimation of their applicability to sudo-rs. Most of them are listed as not applicable because the affected functionality is not implemented in sudo-rs.
Others
Once an organization starts to get larger, it quickly becomes advisable to maintain privilege and identity information in a centralized system such as LDAP or Active Directory. Some su- or sudo-like tools exist precisely with these use cases in mind. Sudo itself, for instance, has support for LDAP integration.
For the Kerberos network-authentication protocol, there is ksu or "Kerberos su", provided by the MIT package. Sudo's website lists various other sudo alternatives as well. Most of these are outdated or otherwise not noteworthy, and some are system-specific tools for non-Linux systems or commercial products. Some on the list, such as priv, GNU userv and ssu, look like long-abandoned pet projects or academic research from roughly 25 years ago.
As a surprise to many, Microsoft
announced
"Sudo for Windows" on February 7 as part of a Windows 11
insider-preview build. The blog post claims that Microsoft is "open-sourcing
this
project
" on GitHub, but
the only code available on the repository at the time this article was
written is a PowerShell
wrapper that calls out to sudo.exe. It is unclear where the
source for that binary is hosted.
The announcement outlines a few different ways to configure the tool's behavior; it can either launch the privilege-elevated process in a new terminal window or in the existing window. The actual privilege elevation looks to be handled by the User Account Control (UAC) subsystem, complete with the graphical confirmation dialog. Sudo for Windows is not a port or fork of Miller's sudo, nor does it work the same way. The blog post also links to a separate sudo-like tool for Windows called gsudo, which it says has more features.
Identity and access management is certainly a rich and complicated topic, not to mention a delicate one. The tools and frameworks that we rely on daily for security in authentication and authorization are under constant scrutiny. The 30-year-old sudo has had a long run as the most popular tool for what it does, but perhaps the security diehards of OpenBSD, along with the memory-safety-focused Rust developers behind sudo-rs, are onto something. We shall have to wait and see what the future holds for sudo and its alternatives.
Windows NT synchronization primitives for Linux
The futex mechanism provided by the kernel allows for the creation of efficient and flexible locking primitives in user space. Futexes work well for many applications, but not all. One of the exceptions, it seems, is that perennially difficult-to-support use case: Windows games. With this patch series, Elizabeth Figura seeks to provide the sort of locking that those games need, by way of a special-purpose virtual device.The performance of a futex can be hard to beat when it is used as intended; in the uncontended case, there is no need for a system call at all to acquire one. Surprisingly, though, the Windows NT locking primitives were not designed with the objective of being easy to implement efficiently with futexes; as a result, there are certain operations supported by Windows that are not straightforward to implement on Linux. At the top of the list is operations requiring the simultaneous acquisition of multiple locks.
Applications written for Unix-like systems normally do not suffer from the lack of Windows-style locking primitives, but Windows applications that have been made to run on Linux often will. Until now, these applications have been supported in Wine by creating a special process to arbitrate access to locks. That solution can work, but it adds an interprocess-communication overhead to every locking operation, which hurts performance. The new device takes the place of that process, handling locking in the kernel without the communication overhead.
To use this feature, a process opens the new special file /dev/ntsync. Every open of that file creates a new instance that is distinct from all of the others, so the intended use case is a single process that shares an instance across multiple threads. Each instance provides a whole set of ioctl() operations (all described on this patch). The first step to use those operations will be to create the locks to be managed by the device; they come in three flavors:
- A mutex is similar to the kernel equivalent; it is a lock that can be held by a single owner at a time. Locking calls can nest, though: once a thread has acquired a mutex it can do so again any number of times. Once all of the acquisition calls have been matched with release calls, the mutex is freed.
- A semaphore is a counter, as one would expect. Every acquisition decrements the counter by one; as long as the counter is nonzero, the semaphore remains available.
- An event is a condition variable; it has a boolean value, and threads can wait until it becomes true. If the event is marked for auto-reset, it will be reset to false as soon as a wait is satisfied, meaning that only one thread will see the event become true. Otherwise, an event, once set to true, stays that way until explicitly reset.
The NTSYNC_IOC_CREATE_MUTEX, NTSYNC_IOC_CREATE_SEM, and NTSYNC_IOC_CREATE_EVENT ioctl() calls can be used to create a mutex, semaphore, or event, respectively. On success, each of these operations returns a file descriptor that can be used to operate on the created object. The API is a bit different than one might expect, in that the file descriptor is not the return value from ioctl(); instead, it is stored in a structure passed by user space.
For example, to create a mutex, a thread starts with this structure:
struct ntsync_mutex_args { __u32 mutex; __u32 owner; __u32 count; };
On entry to the NTSYNC_IOC_CREATE_MUTEX call, the value of mutex is ignored. The owner field is set to the (application-defined) ID of the initial owner of the mutex, while count is set to the number of times the mutex has been acquired by that owner. To create a mutex that is not yet owned by anybody, both of those fields will simply be set to zero. On a successful return, the file descriptor corresponding to this mutex will be stored in the mutex field.
A number of operations are provided for manipulating these objects. For mutexes, NTSYNC_IOC_READ_MUTEX will return the current state of a mutex, while NTSYNC_IOC_MUTEX_UNLOCK will unlock a (currently locked) mutex. A slightly strange one is NTSYNC_IOC_KILL_OWNER, which doesn't actually kill anything; it takes a thread ID as an argument and, if that ID is the owner of the mutex, that mutex will be freed and marked as "abandoned". The next attempt to acquire the mutex will return an error status of EOWNERDEAD, but the acquisition will have succeeded anyway.
For semaphores, NTSYNC_IOC_READ_SEM will read the current state, and NTSYNC_IOC_SEM_POST will add a given amount to the semaphore's count (perhaps releasing the semaphore). Events can be queried with NTSYNC_IOC_READ_EVENT and modified with NTSYNC_IOC_SET_EVENT, NTSYNC_IOC_RESET_EVENT, and NTSYNC_IOC_PULSE_EVENT. That last operation acts like an instantaneous set and reset of the event, allowing one or more waiting threads to proceed but never causing the event to appear to be set. The "pulse" operation is one of those that is hard to implement with futexes.
To actually acquire a mutex or semaphore involves calling either NTSYNC_IOC_WAIT_ANY (which will return as soon as it is able to acquire any one of a list of mutexes and semaphores or one of the indicated events is set) or NTSYNC_IOC_WAIT_ALL, which will only return when it is able to atomically acquire all of the indicated resources. The latter operation will make an attempt whenever one of the resources is freed, but will only succeed if all of them happen to be available. It will not hold a partial set of resources while waiting for the rest, so it could be subject to starvation if the resources are heavily contended. Both wait operations include an optional timeout.
The motivation behind this work becomes clear after a look at the benchmark results provided in the patch cover letter:
The gain in performance varies wildly depending on the application in question and the user's hardware. For some games NT synchronization is not a bottleneck and no change can be observed, but for others frame rate improvements of 50 to 150 percent are not atypical.
The question that has not been directly answered in the cover letter is whether the futex API could have been enhanced to provide the needed functionality without introducing an entirely new API. It would seem (though your editor, needless to say, has not tried to implement it) that the "pulse event" functionality would be relatively straightforward to add. Some aspects of the multi-resource wait operations were provided by the addition of futex_waitv() to the 5.16 kernel, but more work would clearly have to be done. It may well be that adding a standalone virtual device for this niche functionality is easier and less intrusive than trying to coerce futexes into doing the job.
The comments on the first version of the patch set were focused on the details of the API rather than whether a separate device was needed; they resulted in a number of changes leading to the API described here. Subsequent versions, the last of which was posted on February 14, have received relatively few comments so far. So, perhaps, the community is happy with this proposal in its current form, and Linux gamers can look forward to a 131% faster Lara Croft in the near future.
A proposal for shared memory in BPF programs
Alexei Starovoitov introduced a patch series for the Linux kernel on February 6 to add bpf_arena, a new type of shared memory between BPF programs and user space. Starovoitov expects arenas to be useful both for bidirectional communication between user space and BPF programs, and for use as an additional heap for BPF programs. This will likely be useful to BPF programs that implement complex data structures directly, instead of relying on the kernel to supply them. Starovoitov cited Google's ghOSt project as an example and inspiration for the work.
BPF programs already have several ways
to communicate with user space, including
ring buffers,
hash maps, and
array maps. However, there are drawbacks to each of
these methods. Ring buffers can be used to send performance
measurements or trace events to user-space processes — but not to receive data
from user space. Hash maps can be used for this purpose, but
accessing them from user space requires making a
bpf()
system call. Array maps can be mapped into a user-space process's address space
using
mmap(), but Starovoitov notes that their "disadvantage is that
memory for the whole array is reserved at the start
". Array maps (and the
new arenas) are stored in non-pageable kernel memory, so unused pages have a
noticeable resource-usage cost.
His patch series allows BPF programs to create arenas of up to 4GB in size. Unlike array maps, these arenas do not allocate pages up front. BPF programs can add pages to the arena using bpf_arena_alloc_pages(), and pages are automatically added when a user-space program triggers a page fault inside the arena.
Seamless pointers
The patch series handles pointers inside the arena in an unusual way, ensuring that structures inside the arena can have pointers to other areas of the arena, and that this works seamlessly for both user-space programs and BPF programs. Neither kind of program needs to be aware that there are implicit conversions happening — even though the two programs have entirely different pointer representations. BPF programs represent pointers into the arena as 32-bit pointers in a separate address space (which the verifier ensures are not used as normal pointers or vice versa), but user-space programs see the pointers as normal pointers for their architecture. The user-space representation is the one that is actually stored in the arena memory. The kernel maps space for the arena such that the lower 32 bits of the user-space pointer always matches the BPF pointer, to keep conversions between the two representations fast.
For example, the series includes a program as part of the test suite that implements a hash table in BPF which uses linked lists to hold items that fall in the same bucket. The hash table can be populated in the kernel and then consumed from user space, or vice versa, with both being able to follow the pointers in the data structure.
The patch series introduces two functions, bpf_cast_kern() and bpf_cast_user() to cast between the kernel representation of a pointer and the user-space representation. There is an associated patch to LLVM's BPF backend to insert these conversions automatically where appropriate, to ensure that the user-space version is the one stored in memory in the arena. The patch series does introduce a new flag (BPF_F_NO_USER_CONV) to let BPF programs turn off this behavior. Arenas that do not perform pointer conversion can still be mapped to user space, but user-space programs will not be able to follow pointers contained therein.
Review concerns
Barret Rhoden
pointed out a problem with one detail of the implementation of this
conversion. The initial version of the patch series leaves a hole in the arena
(depending on where the arena is mapped in user space), so that BPF won't
produce an object with a pointer ending in 0x00000000. Such an object would have
an all-zero representation in the BPF program when converted into a 32-bit
pointer, which could be confused with a null pointer and cause problems.
Rhoden noted that
"we'll have more serious issues if anyone accidentally
tries to use that zero page
", pointing out that if the BPF program tries to
access the missing page, it will trigger a page fault and then die.
Starovoitov agreed, saying that he would
remove the missing page in version 2 of the series and that
the logic was "causing more harm than good
".
With the hole in the arena removed,
BPF programs will need to avoid putting an object at the
zero address and then trying to take a pointer to it, which is easily
accomplished by adding some padding to the start of the arena.
Ensuring that the kernel and user space agree on the lower 32 bits of arena
pointers is useful because it keeps the code generated by the BPF just-in-time
(JIT) compiler simpler and therefore faster.
If user space could map the arena at any address — as was the case in
the initial version of this patch series —
this would make the representation of
the arena in the kernel somewhat more complex, and could require additional
logic to handle wraparound of the arena addresses cleanly.
Rhoden and Starovoitov continued discussing this detail, and eventually
concluded that there was no reason to support mapping arenas to truly arbitrary
addresses. Rhoden
remarked that "the restriction of aligning a 4GB mapping to
4GB boundary is pretty sane.
"
Lorenzo Stoakes
objected to the way in which the patch series allocates pages because it
uses
vmap_pages_range() to allocate pages for the arena, which is a
function internal to the kernel's virtual-memory allocator. Stoakes said:
"I see a lot of checks in vmap() that aren't in vmap_pages_range()
for instance. [Are we] good to expose that, not only for you but for any other
core kernel users?
"
Johannes Weiner
responded to say that the "vmap API is generally public
", and that
the "new BPF code needs the functionality of vmap_pages_range() in
order to incrementally map privately managed arrays of pages into its
vmap area
". He went on to note that the function used to be public, and was
made private when other external users of the function disappeared. Christoph
Hellwig expressed dissatisfaction in
another branch of the conversation:
"We need to keep vmalloc internals internal and not start
poking holes into the abstractions after we've got them roughly into
shape.
"
While reviewing the changes internal to the BPF code,
Andrii Nakryiko
raised concerns about how the new arenas calculate their size.
Existing BPF maps keep track of the size of keys, the size of values, and the
total number of entries that can fit in the map. This works well for hash maps
and array maps, but is not a good fit for the new arenas. Starovoitov
decided to
represent the arenas as having a key size and a value size of 8 bytes "to be
able to extend
it in the future and allow map_lookup/update/delete to do something useful
".
Nakryiko asserted that they "should probably make bpf_map_mmap_sz() aware of
specific map type and do different calculations based on that
", going on to
point out that arenas are unlikely to be operated on using the normal BPF
interfaces for looking up map entries.
Donald Hunter
questioned
why arenas were being represented in the code as a new kind of map at all,
asking whether this was "the only way you can reuse the kernel / userspace
plumbing?
" Starovoitov
replied that many of the existing maps usable by
BPF programs don't support some map operations. Bloom filters and ring buffers
in particular (two existing map types similar in some ways to the new arenas)
do not support lookup, update, or delete operations. He went on to say that
arenas "might be
one the last maps that we will add, since almost any algorithm can be
implemented in the arena
".
Starovoitov quickly incorporated this feedback, and published
version 2 of the patch series. He had not addressed Hellwig's concerns about
exposing the low-level details of the virtual memory allocation code, however.
Hellwig
reiterated his position, saying: "The vmap area is not for general abuse
by random callers
". Starovoitov
responded that Hellwig ought to
suggest an alternative if exposing the vmap_pages_range() function is
unacceptable. Linus Torvalds
chimed in to say that it is "not up to maintainers to suggest
alternatives
"; "The onus of coming up with an acceptable
solution is on the person who
needs something new
".
Discussion of this version of the patch series is ongoing but, other than Hellwig's concerns about exposing low-level details of the virtual memory allocation code, most of the other concerns are relatively minor or have been addressed. Being able to seamlessly share memory between BPF programs and user-space code is an attractive proposition, so it seems likely that this work will eventually make it in, even if doing so will require finding a different way for the BPF arena to allocate pages on demand.
A modest update to Qubes OS
Qubes OS is a security-focused desktop Linux distribution built on Fedora Linux and the Xen hypervisor. Qubes uses virtualization to run applications, system services, and devices access via virtual machines called "qubes" that have varying levels of trust and persistence to provide an open-source "reasonably secure
" operating system with "serious privacy
". The Qubes 4.2.0 release, from December 2023, brings a number of refinements to make Qubes OS easier to manage and use.
A quick overview
Qubes OS is designed to be a single-user desktop operating system that provides strong security out of the box through isolation between applications and services, rather than trying to ensure that the applications or services are secure in and of themselves. The vision for Qubes is laid out in the Qubes OS architecture document written in 2010. While that specification isn't fully implemented yet, each release brings Qubes a bit closer to the ideal.
As currently implemented, Qubes uses the Xen hypervisor to run a Fedora-based admin qube (dom0) with direct hardware access that provides administration and orchestration of unprivileged guest domains (domU) based on templates (VM data stored as LVM volumes) that are used to run applications (app qubes) or provide services (service qubes) like networking, USB access, and more to the app qubes. For example, networking and firewall services are each provided by separate system qubes ("sys-net" and "sys-firewall", respectively), and access to USB devices is through "sys-usb". Note that the Qubes website and documentation tend to use the term "VM" and "qube" interchangeably.
Templates are the starting point for app and system qubes—app qubes take their root file system (that is, programs and system files) from templates. Any software that users want to persist in an app qube should be installed in a template, rather than an app qube, otherwise it will be discarded when the app qube restarts. If a user wants Emacs or LibreOffice, the Qubes way is to install it into one of the templates and then spin up an app qube based on that template to use the application.
Each qube has a level of trust somewhere between "unsafe and untrusted" to "safe and ultimately trusted". The admin qube, for example, is considered safe and ultimately trusted. The sys-net and sys-usb qubes are considered untrusted, and the firewall qube is considered moderately trusted. Qubes OS ties all of that together and presents the user with a coherent desktop experience. To the user, it is meant to feel like using a regular desktop environment and applications, rather than using half-dozen or more VMs that are unaware of one another. Qube windows are displayed with colored borders, to give users visual cues about which qube is running the application and its safety level.
LWN last looked at Qubes ahead of the 4.1.0 release in October 2021. That release made major overhauls to the Qubes architecture, splitting out display handling to its own domain and making changes to the Qrexec policy system. This release follows up those changes with a number of more user-visible changes such as rewrites of several Qubes GUI management tools, simpler split GPG management (which lets users store private GPG keys in a trusted qube and make use of them in less trusted qubes), changes to default Fedora and Debian templates, and more.
Qubes's approach to security means a more complex, and sometimes cumbersome, user experience. Moving from a Linux distribution like Fedora or Debian to Qubes OS will take more adjustment than one might expect. For example, installing software on a Fedora desktop is usually as simple as "dnf install package". But installing software to use within a Fedora-based qube requires several additional steps on Qubes OS, plus restarting VMs. Other activities, such as configuring a Bluetooth input or audio device is much more complicated and not well-documented. Then again, it's also not encouraged—Bluetooth isn't considered secure, so why focus on making it easier to configure? But when it comes to using Qubes OS as intended, this release includes some major work to add polish and improve the user experience.
GUI application improvements
One of the first improvements users will notice is the redesigned application menu, first made available as a preview in Qubes 4.1, and now the default. On a "normal" Linux distribution, the menu of applications generally only has to display one version of Firefox, one terminal, one file manager, and so forth. Qubes, however, helps users work more securely by compartmentalizing applications to qubes by task or profile. How users organize their work is up to them, but Qubes offers "work", "personal", and "untrusted" qubes by default—each qube with its own installation of Firefox, terminal, and file manager. (These are color coded when running, so users might see a yellow border for personal applications, a blue border for work, and red for untrusted.)
The Qubes model of separating activities into isolated compartments is good for security—users can visit untrusted sites in the untrusted qube, restrict banking to another qube, and separate work in yet another qube—but more challenging to present in a user-friendly fashion. Prior versions of Qubes had a single-menu layout that was unwieldy as the number of applications, templates, and services grew. The current application menu organizes application qubes, template qubes, and service qubes separately, and breaks out Qubes tools like the global configuration and policy editor into their own menu. The effect is still busy compared to a "regular" desktop distribution, but it does seem a marked improvement over the old menu. The ability to add applications from various qubes to a Favorites menu is a great improvement, though there is no obvious way to configure the application menu to display favorites immediately when first opened. Perhaps this will show up in the next Qubes release—if it does, it will probably appear in the Qubes global configuration application.
The global configuration application in 4.2.0 represents work that the project started discussing in September 2021. In the ticket discussing the design, Nina Eleanor Alter described target demographics for the global UI as non-technical, high-risk users, and technical users "excited about Qubes but lacking the attention span or time to copiously read whitepapers or the docs
". Alter said that Linux users may be comfortable with multiple applets to configure system behaviors but, "it delivers a poor execution and discovery experience to all users
"; and users coming from Windows or macOS expect a single settings UI.
The idea is to make Qubes more discoverable, and the new UI does this by bringing together settings for file access, clipboard handling, updates, USB devices, URL handling, miscellaneous general settings, and device information. Users have a single GUI for working with system-wide settings that were not particularly discoverable in prior versions, such as setting up split GPG.
The Create New Qube application has been updated too, though Qubes 4.2.0 seems to have shipped with the old and new applications with different labels in the Applications Menu. The new application is titled "Create New Qube" and the old application is listed as "Create Qubes VM", though both show "Create New Qube" in the title bar when running.
As shown in the screenshot, the new and improved version provides access to more options and settings, as well as some guidance provided via tooltips. (One note on tooltips in Qubes—while working in Qubes, tooltips displayed in various applications lingered long after moving the mouse, switching windows, or even navigating to another workspace.) The current iteration of the Create New Qube application does seem more intuitive than the old, and provides the ability choose the default applications available, set initial RAM for the qube, and more.
The Qubes Update application (appropriately) received an update in this release as well. Qubes includes Fedora, Debian, and Whonix templates as part of the default installation and provides access to many others. Over time it would be trivial to have half-a-dozen template OSes that need regular updates. The Update application streamlines this by checking in the background for updates and then notifying of updates for running qubes at regular intervals. It will also attempt to perform updates every seven days for templates that are not used in that timeframe, though this interval is configurable, or users can update them manually. After updates have been staged, the updater will offer to restart qubes based on the updated templates. Qubes that have running applications will not be targeted for restart by default, so users can run updates without fear that Qubes will unceremoniously shut down their work.
Template updates
Another interesting change with this release the use of Xfce editions for Fedora and Debian instead of GNOME to reduce memory usage and provide a better selection of default applications. Marek Marczykowski-Górecki said that Fedora's GNOME template has too many "problematic
" packages that "either conflict with something or simply don't work with our GUI agent
". The project had been looking for ways to slim memory usage in Fedora qubes for some time, with a number of GNOME packages targeted for exclusion, including GNOME Tracker. Note that the Qubes OS default desktop has been Xfce since the 3.2 release in September 2016.
Support for SELinux in Fedora templates has been a long time in coming. The issue tracking the work was opened in 2018, while the work finally landed in February 2023 and then made its way into the 4.2.0 release. One might wonder why exactly users might need or want SELinux in Fedora qubes, given that Qubes OS is meant to be a single-user system. Each qube is already isolated from others and and the user has full run of each qube. Templates, for example, allow sudo with no password because all of the user data in a running qube is available to the same person anyway, so there's little sense in forcing them to type a password every time they use sudo. Even though Qubes does little to restrict user privileges within each qube, Marczykowski-Górecki noted that the addition of SELinux is useful for applications that provide sandboxing inside a Fedora template, like Podman or bubblewrap, and also help provide extra hardening when using qvm-copy to send files between qubes.
A modest update
Overall, 4.2.0 is a somewhat modest update in terms of new features—though it does contain plenty of the usual version updates and bug fixes. But the focus on improving Qubes OS usability is important. While popular Linux distributions like Fedora or Ubuntu count users in the millions, the Qubes project counts its users in the tens of thousands. Surely more users need what Qubes has to offer, but security tools that are too hard to use tend not to be used. Bolstering Qubes usability is just as important as striving toward implementing the Qubes architecture specification.
Open-source AI at FOSDEM
At FOSDEM 2024 in Brussels, the AI and Machine Learning devroom hosted several talks about open-source AI models. With talks about a definition of open-source AI, "ethical" restrictions in licenses, and the importance of open data sets, in particular for non-English languages, the devroom provided an overview of the current state of the domain.
An AI model is a program that has been trained on a data set to recognize patterns, mimic the learned data in its output, or to make some kinds of decisions autonomously. Most notably, large language models (LLMs), which are extensive neural networks capable of generating human-like text, were a recurrent subject at FOSDEM. This report comes from the live-streams of the talks, as the flu unfortunately prevented me from attending FOSDEM in-person this year.
Characteristically, an LLM incorporates up to several hundred billion "weights", which are floating-point numbers that are also referred to as "parameters". Companies developing large language models are not inclined to release their models and the code to run them as open source, since training the models requires significant computing power and financial investment. However, that doesn't stop various organizations from developing open-source LLMs. Last year, LWN looked at open-source language models.
License restrictions
Niharika Singhal, project manager at the Free
Software Foundation Europe (FSFE), talked about the trend of imposing
ethical restrictions on AI models through licensing. Singhal provided
several instances of added restrictions of that sort, related to field
of endeavor, behavior, or commercial practices. One is the Hippocratic License, which restricts
the licensee from executing numerous actions deemed harmful based
on various "international agreements and authorities on fundamental
human rights norms
". There's also the Llama 2 v2 use policy,
which prohibits use of the LLM for violent or terrorist activities, as well
as "any other criminal activity". Similarly, BigScience's OpenRAIL-M
License imposes restrictions on the use of models for various harmful
activities.
According to Singhal, these additional restrictions have serious implications: "They create barriers against the use and reuse of the models, which also makes it more difficult to adapt and improve the models." She believes that to preserve "openness" in AI, the licenses of AI models must be interoperable with free-software licenses, which isn't the case with these restrictions. She concludes that licenses can't be a substitute for regulation: "Restrictive practices to comply with ethical rules shouldn't be in licenses: these belong to the domain of regulations."
A definition of open-source AI
Stefano Maffulli, executive director of the Open Source Initiative (OSI), described OSI's efforts to define open-source AI. In 2022, the OSI started contacting researchers, other "open" organizations, technology companies, and civil-rights organizations, to ask them about their ideas for an open-source AI system.
As a general principle, Maffulli maintains that the GNU Manifesto's Golden Rule should be applicable to AI: "If I like an AI system, I must be free to share it with other people." For an AI system to be categorized as open-source, it needs to grant us adaptations of the four basic freedoms applicable to open-source software: to use, study, modify, and share.
We need to be able to use the system for any purpose and without having to ask for permission. We need to be able to study how the system works and inspect its components. We need to be able to modify the system to change its recommendations, predictions, or decisions to adapt to our needs. And we need to be able to share the system with or without modifications, for any purpose.
According to Maffulli, a pertinent question to pose in this context is: "What is the preferred form to make modifications to an AI system?" To get an answer to this question, OSI has created small working groups to analyze some popular AI systems. "We're starting with Llama 2 and Pythia, two LLMs. After this, we'll repeat the same exercise with BLOOM, OpenCV, Mistral, Phi-2, and OLMo." For each of these AI systems, the working group will identify the requirements to guarantee the four basic freedoms. For example, understanding why, given an input, you get a particular output, is necessary to being able to study an AI system.
In 2024, the OSI will release a new draft of the open-source AI definition monthly, based on bi-weekly virtual public town halls. "Our goal is to have a 1.0 release by the end of October", Maffulli said. Everyone is welcome to partake in the discussions regarding the drafts in OSI's public forum.
According to Maffulli, there can't be a spectrum when it comes to
open-source AI: either an AI system is open source, or it
isn't. Nevertheless, many players within the domain of large language
models misuse the term "open source". For example, one of the most popular
"open" LLMs is Meta's Llama 2. When Meta's Yann LeCun announced this
model on Twitter last year, he wrote: "This is huge: Llama-v2 is open
source, with a license that authorizes commercial use!
". However, the Llama 2
license has limitations on its commercial use that are based on the number of
active users. It also forbids using Meta's model to improve other
LLMs. Both limitations are at odds with the OSI's Open Source Definition.
Open data sets
Julie Hunter, a research engineer at the French software company Linagora, discussed building open-source language models. According to Hunter, the LLMs developed by Meta, as well as those by MosaicML and the Technology Innovation Institute's Falcon models, are so-called "open-weight models": the weights of the neural networks are published. This allows a choice of how the model is run and the model can be fine-tuned by adapting the weights with additional training. However, the weights don't explain why something works or doesn't work. "Without access to the data the model is trained on, it leaves a lot open to guesswork", Hunter said.
There has been a push for open training data and, as a result, a lot of data sets have been added to web sites like the one run by Hugging Face. "Anyone can train their new LLM on these data sets", Hunter said. "However, there are several problems with many of these data sets. They are often crawled from the web, packed with personal information, toxic language, and low-quality sentences. Furthermore, they are predominantly English."
The OpenLLM France consortium aims to build open-source AI models and technologies for the French language. For its first model, Claire, the main goal was to create a French data set with traceable licenses. The Claire French Dialogue Dataset is a corpus containing 140-million words from transcripts and stage plays in French, as well as from parliamentary discussions. This data set, Claire-Dialogue-French-0.1, is mostly using the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license, though some parts have other (traceable) licenses.
The data set was used to fine-tune an open-weights model, Falcon-7B. "The main purpose of this approach was to evaluate the impact of a good data set on the performance of the model", Hunter said. Michel-Marie Maudet, Linagora's general manager, added that the company's idea of developing a language model based on a small and high-quality corpus of data was inspired by Microsoft Research's paper "Textbooks Are All You Need". He continued:
The quality of a data set is more important than its quantity. A small and high-quality corpus results in a compact, specialized model with superior control over its responses in terms of interpretability and reliability. It also makes training faster, which allows us to continuously update it.
In October 2023, the model Claire-7B-0.1 was published on Hugging Face. The code to train the model was also made public, under the AGPLv3.
Beyond English
OpenLLM France is now working on a 100% open-source language model, Lucie, slated for release in April 2024. Maudet explained: "This model is trained with 100% open-source data sets of French, English, German, Spanish, and Italian texts, as well as some computer code." The data sets include the archives of the French national library and academic publications with open access.
Maudet's talk presented some details about OpenLLM France and its mission. The community, which started in July 2023, boasts over 450 active members, ranging from academic institutions to companies. Why is a France-focused LLM consortium required? Maudet explained that an exploration into the geographical distribution of LLMs with more than a billion parameters since 2018 reveals that nearly 70% of them are created in North America, and only 7.5% in Europe. Upon examining the language distribution in Llama 2's training data, the figures seem even more dismal: "While English comprises almost 90% of the data, European languages such as German and French account for just 0.17% and 0.16% of the data, respectively." Because European languages are underrepresented in their data sets, models like Llama 2 exhibit subpar performance in these languages.
There have been similar initiatives in other parts of Europe to build a European-language open-source LLM, such as LAION and openGPT-X in Germany and Fauno in Italy. At FOSDEM, Maudet announced that OpenLLM France is renaming itself to OpenLLM Europe (though the web site is not available yet). "Our mission is to develop an open-source LLM for each European language."
Conclusion
The fact that organizations call their AI systems "open source" even if their license is at odds with the four basic freedoms is a sign that we really need to have a clear definition of open-source AI. Hopefully, OSI's definition—expected by the end of 2024—will also help stop the proliferation of licenses with various well-meant but detrimental ethical restrictions. Beyond that, it would be beneficial for a consortium such as OpenLLM Europe to attract enough members to build powerful open-source LLMs beyond English.
Page editor: Jonathan Corbet
Inside this week's LWN.net Weekly Edition
- Briefs: KeyTrap vuln; User-space scheduler in Rust; Agama; Hare 0.24; RawTherapee 5.10; Quotes; ...
- Announcements: Newsletters, conferences, security updates, patches, and more.