LWN.net Weekly Edition for November 10, 2022
Welcome to the LWN.net Weekly Edition for November 10, 2022
This edition contains the following feature content:
- Moving past TCP in the data center, part 2: a description of the proposed "Homa" network protocol for data-center use.
- Using certificates for SSH authentication: a relatively unused SSH feature that can ease login administration.
- Two performance-oriented patches: epoll and NUMA balancing: a pair of patches taking different approaches to performance optimization.
- Better CPU selection for timer expiration: even old kernel subsystems have room for improvement.
- A report from the 2022 Image-Based Linux Summit: a recently held meeting charting out new ways to build secure Linux distributions.
This week's edition also includes these inner pages:
- Brief items: Brief news items from throughout the community.
- Announcements: Newsletters, conferences, security updates, patches, and more.
Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.
Moving past TCP in the data center, part 2
At the end of our earlier article on John Ousterhout's talk at Netdev 0x16, he had concluded that TCP was unsuitable for data-center environments for a variety of reasons. He also argued that there was no way to repair TCP so that it could serve the needs of data-center networking. In order for software to be able to use the full potential of today's networking hardware, TCP needs to be replaced with a protocol that is different in almost every way, he said. The second half of the talk covered the Homa transport protocol that he and others at Stanford have been working on as a possible replacement for TCP in the data center.
The Homa project set out to design a protocol from scratch that would be ideal for the needs of data-center networking today. It turned out to be different from TCP in each of the five aspects that he had covered in the first half of the talk; the choices made by Homa for those work well together to "produce a really really high-performance data-center protocol". But, he stressed, Homa is not suitable for wide-area networks (WANs); it is only for data centers.
Homa overview
To start with, Homa is message-based, rather than byte-stream-based, which means that the "dispatchable units" are recorded in the protocol. Each message is made up of multiple packets on the wire that get assembled into a full message for application processing. This solves the thread load-balancing problems that TCP has, because multiple threads can safely read from a single socket; it will also allow network interface cards (NICs) to directly dispatch messages to threads in the future should NICs ever gain that support.
Homa is connectionless; its fundamental unit is a remote procedure call (RPC), which consists of two messages: a request and a response. "The notion of 'round trip' is explicit in the Homa protocol." RPCs are independent with no ordering guarantees between them; multiple RPCs can be initiated and can finish in any order. There is no long-lived connection state stored by Homa; once an RPC completes, all of its state is removed. There is a small amount (roughly 200 bytes) of state stored for each peer host, for IP routing information and the like, however. There is also no connection setup overhead and a single socket can be used to send any number of RPCs to any number of peers. Homa ensures that an RPC either completes or an error is returned, so application-level timers are not needed.
Congestion control in Homa is receiver-driven, which has major advantages over the sender-driven congestion control used by TCP. Ousterhout described how flow control works in Homa. When a message needs to be sent to a receiver, the sender can transmit a few "unscheduled packets" immediately, but additional "scheduled packets" must each wait for an explicit "grant" from the receiver. The idea is that there are enough unscheduled packets to cover the round-trip time between the hosts; if there is no congestion, the sender will have received some grants by the time it has sent all of the unscheduled packets. That allows sending at the hardware line speed.
A receiver can choose not to send grants if it detects congestion in its top-of-rack (TOR) switch; it can pause or slow the grants until that condition subsides. It can also prioritize messages by sending or denying grants based on the remaining size of the message. Having the message size available in the protocol is helpful because "message sizes allow us to predict the future". Once a single packet of a message is received, the receiver knows how many unscheduled packets are coming, how many scheduled packets remain after that, and it can quickly decide on a strategy to minimize the congestion that could result from all of the different in-progress messages it is receiving. It can react much more quickly than TCP can.
Homa also takes advantage of the priority queues that modern switches have; each egress port in those switches typically has 8-16 priority queues. It uses the queues to implement the "shortest remaining processing time" (SRPT) algorithm; when a receiver has multiple messages in progress, it allows the shorter messages to use the higher-priority queues on the switch. In addition, the queues provide a way to achieve a "much better tradeoff between throughput and latency".
"Buffers are funny; you can't live with them, but you can't live without them either." Some amount of buffering is needed at the switch to ensure that link bandwidth is not wasted if the higher-priority senders stop sending for any reason. Throughput is maintained by having lower-priority packets queue up and be sent when there is a lull from a higher-priority sender; Homa can then send grants to other senders to get them ramped up. This "overcommitment" of granting more messages than can be handled, coupled with the prioritized buffering leads to higher throughput with low latency on the important, shorter messages, he said.
One might think that long messages could potentially suffer from starvation on a loaded network. He said that he tried to produce that behavior in Homa so that he could study it, but he found it really hard to do; he can create starvation scenarios, but had to "contort the benchmarks" to do it. But, in order to avoid starvation, Homa takes a small portion of the receiver's bandwidth (typically 5-10%) and uses it for the oldest message, rather than the shortest message as strict SRPT would dictate. That guarantees that those messages make progress; eventually their remaining size gets small enough to be prioritized with the other small messages.
Homa does not rely on in-order packet delivery; packets can arrive in any order and the receivers will sort them as needed. In practice, the packets arrive nearly in order anyway, he said, so it is not computationally expensive to do the reordering. He believes that Homa eliminates core-congestion problems in data centers, unless there is simply too much traffic overall, because packets can take different paths through the core fabric. That leads to better load balancing in the fabric as well as across CPU cores on the receiving hosts.
Replacing TCP?
It is hard to imagine a standard more entrenched than the TCP protocol is, Ousterhout said, so "I undertake this with full realization that I may be out of my mind". Based on the results he has seen from Homa, he has set a personal mission to figure out a way for Homa to take over a substantial portion of TCP's traffic in the data center. Either that, or learn why it is not possible; "I'm going to keep going until I hit a roadblock that I simply can't solve".
The first step to doing that is to have a production-quality implementation of Homa "that people can actually use". He is "maybe a little bit of a freak among academics", he said, because he loves to write code. A few years ago, he set out to write a Linux kernel driver for Homa; he "knew nothing about the Linux kernel", but now has a working driver for Homa.
The Homa module runs on Linux 5.17 and 5.18 and is not meant as a "research prototype", which is a term that he does not like. A research prototype is something that does not work, "but you can somehow write a paper about it and make claims about it, even though it doesn't actually really work". The Homa module is nearing production quality at this point, he said; the only way to find and fix the remaining bugs is to start running it in production.
In terms of performance, Homa "completely dominates" TCP and Data Center TCP (DCTCP) for all workloads and all message sizes, he said. He gave a few sample benchmarks that showed 3-7x improvements in short-message latency at the 50th percentile, and 19-72x improvements in the 99th percentile (the "tail latency"). More information about the kernel driver for Homa can be found in his paper from the 2021 USENIX annual technical conference.
Applications
The biggest problem he sees with his mission is the huge number of applications that directly use TCP via the socket interface. Homa's message-based API is different and he has been unable to find a way to "slide it in under the existing TCP sockets interface". But it is not realistic to plan to convert even a substantial fraction of those applications. Many, perhaps most, of the applications that directly use TCP sockets also need to work over the WAN, where Homa is not a good choice; others really do not need the performance boost that Homa brings. The applications that will really benefit from Homa are the newer data-center applications that mostly use one of a handful of RPC frameworks.
Adding Homa support to RPC frameworks would allow applications that use them to switch to Homa with a small configuration change, instead of a major code change; much like applications can change the server name they use, they would be able to choose Homa instead of TCP. He has work in progress on adding Homa support to gRPC. The C++ gRPC integration is working now, though without support for encryption, while the Java gRPC support for Homa is "embryonic" but currently underway.
Working with gRPC has made him "very sad", however, because it is "unimaginably slow". The round-trip latency for gRPC on TCP is 90µs, with two-thirds of that time spent in the gRPC layers on the client and server (30µs each). If the goal is to get to 5µs round trips, it is pretty clear it cannot be done using gRPC, he said. With Homa, gRPC is roughly twice as fast, even in the gRPC layers on the endpoints, which he does not really have an explanation for. But even that is far from the goal, so he believes a "super lightweight" framework needs to be developed eventually.
Even with Homa, though, performance is still an order of magnitude off from what the hardware is capable of, he said. "Network speeds are increasing at this amazing rate, CPU speeds aren't changing much." Software simply cannot keep up and he sees no solution to that problem. So software implementations of transport protocols no longer make sense; those protocols need to move into the NICs.
Moving to user space
Some people will say that you simply need to get "those terrible operating systems" out of the way and move everything to user space; that will help some, he said, but it is not enough. In the RAMCloud storage project, Stanford researchers implemented Homa in user space and were able to achieve 99th percentile round-trip latencies of 14µs, while the tail latency for Homa on Linux is 100µs.
The primary culprit for high tail latencies is software overhead and, in particular, load-balancing overhead. A single core cannot handle more than 10Gbps but as soon as you split the job over multiple cores, there is an inherent performance penalty just for doing the split. Beyond that, there is the problem of hot spots, where too much work ends up on certain cores while others are idle or nearly so; those can cause spikes of high latency.
He has not measured it, but his theory is that the inherent slowdown from splitting up the job to multiple cores is due to cache interference. Lots of cache-coherency traffic coming from the split reduces the performance when compared to doing the processing on a single core. It is not just his Homa implementation that sees a problem of this magnitude; he looked at Google's Snap/Pony system, which sees a 4-6x efficiency loss when moving from single to multiple cores. He said that the rule of thumb is that if you need to go beyond a single core, a second or third core is not enough; "you actually have to go to four cores before you actually get any performance improvement over one core".
If you look at some of the numbers from various papers, Ousterhout said, you might conclude that moving the protocol processing to user space is a good idea. But, all of those user-space implementations are "basically just research prototypes and they are way oversimplified"; they do not do a lot of the processing that any production system would require. They are also measured under the most ideal conditions, with either no load balancing or perfect by-hand partitioning, handling only short messages ("that's easy"), and so on.
But Snap is a production-quality user-space implementation that can be compared; it is roughly 2x better than Homa in the Linux kernel. If you look at the number of cores needed to drive 80Gbps (an 80% loaded 100Gbps network) bidirectionally, Snap requires 9-14 cores, while Linux Homa requires 17. So user space is better, but not on the scale needed to hit the goals.
To the NIC
The only alternative that he sees is to move the protocol processing of the transport layer into the NIC. Applications would access the NIC directly, bypassing the kernel, via a message-based interface; "the notion of a packet never gets to software". Other features like load balancing by choosing an idle application thread, virtualization, and encryption should be added to the NICs as well.
The architecture for these NICs is not going to be easy to figure out, he said; they need to process packets at line rate, while being programmable to support multiple protocols. It is also important that the programs for those protocols can still be open source, Ousterhout said. None of the existing "smart NIC" architectures is adequate, he thinks, so it makes for an interesting problem for computer architects to solve.
Homa is "not without its controversies"; there have been several papers that have claimed to find problems with Homa. There are also some alternatives being proposed in other papers, but "all of these papers have pretty significant flaws in them, unfortunately". They have used Homa in unrealistic configurations or hobbled it in some fashion in his opinion; while he does not "want to get into a food fight" about them, he has gathered some of his responses on the Homa wiki.
But there is a bigger "meta question" that needs to be answered: "do applications actually care about harnessing the full network performance?" Today, we are "stuck in a no-chicken-no-egg cycle", there are no applications that require the full hardware capabilities, because there would be no way to run them. So today's applications make do with the performance provided by today's network stacks and there is no real incentive to make the infrastructure faster, because there is no one who is "screaming for it".
So, he wondered, if we make the networking "as blindingly fast as I think we can do", will new applications appear that people have not even thought about today because the capability is not available? He would be interested in hearing about any applications that would be dramatically improved by these performance improvements. As an academic, he does not actually need to have a market; he can build something and hope that the applications come along later, but that "would be a terrible way to do a startup".
He concluded by reiterating the main points of what he thinks needs to be done if applications are going to take advantage of the amazing advances in networking speeds. A new transport protocol is needed; he is obviously a Homa advocate but would be happy to discuss other possibilities for that. Beyond that, a new lightweight RPC framework will be important, and, ultimately, whatever transport protocol is used will need to move down into the NIC. Ousterhout's keynote laid out a vision of a fairly radical reworking of the state of data-center networking today; it will be interesting to see how it plays out over the coming years.
Using certificates for SSH authentication
SSH is a well-known mechanism for accessing remote computers in a secure way; thanks to its use of cryptography, nobody can alter or eavesdrop on the communication. Unfortunately, SSH is somewhat cumbersome when connecting to a host for the first time; it's also tricky for a server administrator to provide time-limited access to the server. SSH certificates can solve these problems.
Verification
The OpenSSH project maintains what is, in practice, the reference implementation of the SSH protocol. In the default configuration, its SSH client asks the user to verify the server's identity (its "host key") when first connecting to it, by examining a key fingerprint, which is a secure checksum for the server's public key:
The authenticity of host 'stamina (10.2.2.176)' can't be established. ED25519 key fingerprint is SHA256:cawPF9UX0DwRb5F1kr55JkQ9xNNXZEeqG8MtuV82dRU. Are you sure you want to continue connecting (yes/no/[fingerprint])?
The user is asked to verify the host key by comparing long strings of hexadecimal. In the example, the fingerprint is the letters and numbers after SHA256:. This manual check is tedious and error-prone.
There are alternatives to checking the fingerprint manually. The server administrator can publish a list of host keys for their servers, in a format that the client can use in its "known hosts" file. The file could even be updated by a centralized IT department on the user's machine. If there is no centralized IT, the user could update the list themselves. There's also a way to publish the host keys over DNSSEC, to avoid having to update known-host lists on user machines.
But what usually happens is that the user answers "yes" by reflex.
Authentication
The user then needs to authenticate themselves to the server. Traditionally this has been done using passwords, but passwords are a security weakness. Humans just aren't good at remembering secure passwords. SSH allows key-based authentication, which can be much more secure. By default, this works by the user maintaining a file with authorized public keys in their home directory on the server.
At login time, if the user's client can prove to the server that the user has a private key corresponding to one of the keys listed, the user can log in without a password. Overall, this is a big win for security, except for the first time a user accesses the server. How can they add their public key to the list of authorized keys before they can log in?
For this, too, there are alternatives. An organization might centrally collect the public keys of their users. The server administrator could use this to maintain the list of authorized keys for each user. This solves the first-login situation. However, all of these alternative mechanisms need to be added on top of SSH; they also need to be developed and maintained.
In principle, a host's key never changes. Sometimes, it does, however. For example, a server might be replaced, reinstalled, or its domain name changed to refer to a new one. This happens often with virtual machines and in the cloud environment. The server might also be a board used for embedded-system development, where the operating system gets replaced from time to time.
Likewise, a user may need to get a new key for various reasons. They might accidentally post their private key on social media, or just lose the file and not have backups. If nothing else, attacks on cryptography only get stronger and, over time, a strong key will become weaker, resulting in a need to generate a new key, with newer, stronger cryptography.
Certificates
OpenSSH supports an "SSH certificate", which includes an SSH public key along with some metadata related to the key, all cryptographically signed by a trusted party. The certificate is used instead of the SSH public key when connecting, though the corresponding private key is still needed in order to complete the connection; the included signature allows the key to be trusted without manual checking. The trusted party is typically the server system administrator, who controls access to their servers. They create an SSH key dedicated for use as a certificate authority (or CA), and use that key to create host and user certificates. A CA key is just a plain SSH key, created with ssh-keygen, like any other SSH key. The CA key is used to sign (ssh-keygen -s) host and user keys, creating certificates.
The system administrator configures their servers to provide host certificates and to trust user certificates made by the CA, while users configure their clients to trust host certificates signed by the CA. This allows a user to trust a host's identity without having to manually verify the host key. It also allows a server to authenticate a user without passwords, or without the user's public key being in a list of trusted keys for that user. Almost.
There is still one step where some trust needs to be established in a way that isn't automated. How does a CA get a user's public key and how does the user get the CA's public key in a secure way? There's no generic, scalable solution for this, but the problem is somewhat simplified because it only needs to be done rarely.
A solution for this initial exchange of public keys will build on existing trust relationships. Perhaps the parties have OpenPGP keys and can use the Web of Trust for this? Maybe they can access the same intranet of the organization? It might even be enough to rely on TLS and get public keys over HTTPS.
Note that the SSH CA mechanism does not require or support a network of globally trusted certificate authorities the way that TLS does. This means it can be used without paying, or at least involving, a third party, but also means that obtaining the CA key is a manual process. An SSH client does not come with a list of implicitly trusted SSH certificate authorities.
The SSH certificate doesn't work on its own, just like a public key doesn't: the party presenting a certificate must prove (using cryptography) that they have the corresponding private key. The metadata in a certificate makes certificates more useful than plain public keys, however. An SSH public key does not specify which host or user it belongs to, for example, whereas a certificate has a list of "principals" that it vouches for. A principal is a host or user name. The metadata in a certificate can be examined with ssh-keygen -L.
Type: ssh-ed25519-cert-v01@openssh.com host certificate Public key: ED25519-CERT SHA256:V/jJ9f+0sxWuK4Kmjr9x26OUmrQV+pPE8PmnKnFL650 Signing CA: ED25519 SHA256:6qIHXzaIX8CJ0jO/Iv0rvhZIlwaWxmOVItuD+973HUI (using ssh-ed25519) Key ID: "certificate for host stamina" Serial: 0 Valid: from 2022-11-03T10:02:00 to 2023-02-01T10:03:26 Principals: stamina Critical Options: (none) Extensions: (none)
Certificates can have a validity period, which allows them to automatically expire without having to be explicitly revoked. OpenSSH does support revocations, but revocation information can be difficult to communicate to all parties in a fast, reliable way. A common way to sidestep the need for revocation lists is to use short-lived certificates (this is what Let's Encrypt does for TLS certificates). For example, a continuous-integration (CI) worker process might be given a certificate for deployment that only lives for 60 seconds. A student at a university might get one that is only valid for a semester, while an employee might get one that is only valid during working hours of that specific day.
Short-lived certificates are only useful, of course, if a user can get a new one when they need it. For example, a company might store the public keys of all its employees in a secure database and then automatically issue a new one-day certificate to employees every morning. The employees can even retrieve it from an unauthenticated web server. Certificates don't contain any secrets, and they only work with specific private keys, so the transport doesn't necessarily need to be protected strongly.
User certificates can also be tied to specific IP addresses or restrict the commands that the user can invoke. This further limits the damage an attacker can cause if they can steal someone's private key.
Certificate management
Managing the SSH CA infrastructure can be done using only the tools provided by OpenSSH, particularly ssh-keygen. The man page (and the certificates section in particular) has all of the necessary information, though in quite a dense form. There are various HOWTO documents around the Internet to explain things in more accessible ways; I maintain a HOWTO for SSH CA and the OpenSSH Cookbook has a section on SSH CA too.
The SSH CA mechanism first appeared in OpenSSH version 5.4 in 2010, so it has been around for quite some time at this point.
An organization (or even just a solitary system administrator) with many hosts or many users may want to create automation around ssh-keygen. This will allow securely managing a collection of host and user public keys, as well as CA private keys, in order to conveniently and quickly issue certificates on demand. SSH certificates can then be integrated into configuration management with ease. I am in the process of developing such tooling, creatively called sshca, and would appreciate feedback.
Conclusion
SSH certificates are a way to make the use of SSH more convenient and more secure at the same time. The technology is widely available, and implementing it for an organization is straightforward, at least from an SSH point of view. For anyone who is struggling with the key-management aspects of SSH, it is a mechanism that is worth investigating.
Two performance-oriented patches: epoll and NUMA balancing
The search for better performance from the kernel never ends. Recently there has been a stream of smaller patches that promise incremental performance gains, at least for some types of applications. Read on for an overview of two of those patches, which make changes to the epoll system calls and to NUMA balancing. This work shows where developers are looking for performance improvements — and that not everybody measures performance the same way.
An epoll optimization
The epoll family of system calls is aimed at event-loop applications that manage large numbers of file descriptors. Unlike poll() or select(), epoll allows the per-file-descriptor setup be performed once (with epoll_ctl()); the result can then be used with multiple epoll_wait() calls to poll for new events. That reduces the overall polling overhead significantly, especially as the number of file descriptors being watched grows. The epoll system calls add some complexity, but for applications where per-event performance matters, it is worth the trouble.
Normally, epoll_wait() will block the calling process until at least one of the polled file descriptors is ready for I/O. There is a timeout parameter, though, that can be used to limit the time the application will remain blocked. What is lacking, however, is a way to specify a minimum time before the epoll_wait() call returns. That may not be surprising; as a general rule, nobody wants to increase an application's latency unnecessarily, so epoll_wait() is designed to return quickly when there is something to be done.
Even so, this patch set from Jens Axboe adds this minimum wait time. It creates a new epoll_ctl() operation, EPOLL_CTL_MIN_WAIT, to specify the shortest time that subsequent epoll_wait() calls should block before returning to user space. The reasoning behind this seemingly counterintuitive capability is to increase the number of events that can be returned by each epoll_wait() call. Even with much of the setup work taken out, each system call still has a cost. In situations where numerous events can be expected to arrive within a short time period, it can make sense to wait for a few of them to show up and only pay the system-call cost once.
In other words, an application may want to trade off a bit of latency for better throughput overall. This is seemingly a common use case; as Axboe put it:
For medium workload efficiencies, some production workloads inject artificial timers or sleeps before calling epoll_wait() to get better batching and higher efficiencies. While this does help, it's not as efficient as it could be.
Using this feature, he said, can reduce CPU usage by 6-7%. Axboe is seeking input on the API; specifically, whether the minimum timeout should be set once with epoll_ctl(), or whether it should instead be provided with each epoll_wait() call. This would be a good time for potential users to make their preferences known.
Control over NUMA balancing
Non-uniform memory access (NUMA) systems are characterized by variable RAM access times; memory that is attached to the node on which a given thread is running will be faster for that thread to access than memory on other nodes in the system. On such systems, applications will thus perform better if their memory is located on the nodes they are running on. To make that happen, the kernel performs NUMA balancing — moving pages within the system so that they are resident on the nodes where they are actually being used.
NUMA balancing, when done correctly, improves the throughput of the system by increasing memory speeds. But NUMA balancing can also cause short-term latency spikes for applications, especially if they incur page faults while the kernel is migrating pages across nodes. The culprit here, as is often the case, is contention for the process's mmap_lock. For some latency-sensitive applications, that can be a problem; there are, seemingly, applications where it is better to pay the cost of suboptimal memory placement to avoid being stalled during NUMA balancing.
For such applications, Gang Li has posted a patch set adding a new prctl() operation, PR_NUMA_BALANCING, that can control whether NUMA balancing is performed for the calling process. If that process disables NUMA balancing, pages will be left where they are even at the cost of longer access times. Benchmark results included in the cover letter show that the performance effects of disabling NUMA balancing vary considerably depending on the workload. This feature will not be useful for many applications, but there are seemingly some that will benefit.
The kernel development community tries hard to minimize the number of tuning knobs that it adds to the kernel. Each of those knobs is a maintenance burden for the community but, more importantly, tuning knobs are a burden for application developers and system administrators as well. It can be difficult for those users to even discover all of the parameters that are available, much less set them for optimal performance. It is better for the kernel to tune itself for the best results whenever possible.
Patches like the above show that this self-tuning is not always possible, at least in the current state of the art. Achieving the best performance for all applications gets harder when different applications need to optimize different metrics. Thus, one of these patches allows developers to prioritize throughput over latency, while the other does the opposite. This diversity of requirements seemingly ensures that anybody wanting to get that last bit of performance out of their application will continue to need to play with tuning knobs.
Better CPU selection for timer expiration
On the surface, the kernel's internal timer mechanism would not appear to have changed much in a long time; the core API looks quite similar to the one present in the 1.0 release. Underneath the API, naturally, quite a bit of complexity has been added over the years. The implementation of this API looks to become even more complex — but faster — if and when this patch set from Anna-Maria Behnsen finds its way into the mainline.
Premature optimization
For context, it is worth remembering that the kernel actually has two core APIs for the management of internal timers. High-resolution timers (hrtimers) are, as their name would suggest, used for near-future events where precision is important; they are relatively expensive and only used when necessary. For everything else, there is the subsystem just known as "kernel timers" (or, sometimes, "timer-wheel timers"), centered around functions like add_timer(). Behnsen's patch set changes how these ordinary kernel timers work.
Arguably, most uses of kernel timers are the kernel ensuring that it will respond properly if an expected event doesn't happen. A driver may start an I/O operation on a device confident in the knowledge that the device will raise an interrupt when the operation completes, but it will still set a timer so that, if the device goes out to lunch, the operation doesn't just hang indefinitely. Other parts of the kernel, such as networking, use timers in similar ways.
There are a couple of interesting implications that arise from this usage pattern. One is that such timers need not expire at exactly their nominal expiration time; if the kernel takes a little while to get around to handling an expired timer, nothing bad happens. That allows the kernel to batch timer expirations and to defer them to avoid waking an otherwise idle CPU. The implication that is relevant here, though, is that kernel timers rarely expire. When things are operating normally, the expected events will occur and the timer will either be canceled or reset to a new time further in the future. As a result, the timer subsystem should be optimized for the creation and cancellation of timer events. And, indeed, much effort has gone into that sort of optimization, as can be seen in LWN coverage from as far back as 2004 up to a significant reimplementation of timers in 2015.
Behnsen has identified a place where further optimization can be performed, though. When a timer is set in current kernels, the timer subsystem spends some CPU time trying to decide which CPU should handle the expiration of that timer. The intent is to push that work to a CPU that is already busy rather than waking an idle CPU just to expire a timer. So the timer subsystem scans through the system looking for a suitable CPU and adds the timer to the appropriate queue.
There are a couple of problems with this algorithm, though. One is that a CPU that is busy now may no longer be busy when the timer expires. So the choice of a CPU to handle expiration, if made when the timer is set, is really just a guess. Perhaps worse, though, is that this work is being done at the wrong time; since most timers never expire, any effort that is put into picking the expiration CPU ahead of time is likely to be wasted, even if the guess turns out to be a good one. It would be far better to not do any extra work when a timer is set, and have a CPU that is actually busy at expiration time take care of it then — on the relatively rare occasion when a timer actually expires.
The first part — not picking an expiration CPU at setting time — is easy to implement; a timer is just put into the queue of the CPU that is setting it. Having a suitable CPU actually handle expiration is harder, though. A naive implementation might just create a simple queue of timers that a CPU would check occasionally if it's running and able to handle expirations. That would create a great deal of locking contention and cache-line bouncing, though, slowing things down even when there were no timers to handle. So something more complex is called for.
Choosing the expiration CPU
The scheme chosen is to organize the system's CPUs into a hierarchy that resembles the hardware topology, but which is independent from it. At the lowest level, CPUs are assembled into groups of up to eight, with the constraint that all eight must be contained within the same NUMA node. The groups are, themselves, organized into groups; this process continues until all CPUs in the system have been arranged into a single tree.
Consider, for
example, a simple, four-CPU system organized into two NUMA nodes as shown
to the right. The first two CPUs are organized into Group 1; the
other two, since they are in a different NUMA node, go into a separate
group. Those two groups, in turn, are placed together in Group 3. A
larger and more complex system might require more levels of group
hierarchy, but that gets awkward to show in a simple diagram.
The timer API allows timers to be pinned to specific CPUs; that does not change in the reimplementation. Each CPU will have to handle expiration for its pinned timers, even if that means waking from an idle state. Most timers, though, can be executed anywhere in the system and are not pinned; the handling of these "global" timers will be different from before. A CPU that is busy will continue to handle global timers that are in its specific queue but, if that CPU goes idle, it will instead add its soonest-expiring global timer to the queue associated with its group.
Normally, each CPU group will designate one of its members as the "migrator". That CPU, which cannot be idle, will occasionally check the queue associated with its group for expiring global timers; if the CPU that queued a timer there is still idle, then the migrator will pull it over and handle the expiration instead, and the CPU that initially queued the timer can remain idle. So, for example, if CPU 1 in the diagram above is idle, it will have enqueued its earliest-expiring global timer in Group 1; if CPU 2 is running (and is thus the migrator), it will handle that timer when it expires.
If the migrator goes idle, then another CPU in the group has to be handed the baton and become the new migrator; that is just a matter of finding the next busy CPU in that group. If all of the other CPUs are also idle, instead, then the group ends up without a migrator. In this case, the group is marked as idle in the next higher group in the hierarchy, and its first-expiring timer is queued at that next level. So, if CPU 2 also goes idle, it will take the earliest-expiring event in Group 1 and put it into the queue at Group 3.
The assignment of the migrator role happens in the higher-level groups as well. If a group contains other groups, one of those groups will be the migrator for that level. In the scenario described here, Group 2 will be the migrator for Group 3. The CPU running as the migrator within Group 2 (either CPU 3 or CPU 4) will thus have to handle timer events for Group 3 as well. In a system with a lot of idle CPUs, this migrator role can propagate all the way to the top of the hierarchy, at which point one CPU may be responsible for handling all expiring timers in the system.
If that CPU also goes idle, the system will be left without any migrator CPU at all. At that point, the last CPU standing will set a hardware timer to wake it at the expiration time for the earliest expiring timer. That ensures that timer events don't get dropped even in the absence of busy CPUs to handle them.
Implementing this machinery in the timer subsystem results in a patch set adding nearly 2,000 lines of code to a core kernel subsystem. The benefit that comes from this work is said to be an approximately 25% reduction in the time required to add and remove a timer. Since timers can be set (and changed) within performance-sensitive code, this improvement likely justifies the added complexity.
A side benefit of this work is that it should enable the removal of the deferrable timers mechanism. Deferrable timers are those for which expiration can be deferred without any ill effect; a CPU that is going idle will not wake up solely to handle a deferrable timer. Turning deferrable timers into global timers will have the same effect — they will no longer cause a sleeping CPU to wake — so there is no longer a need to handle them separately. The removal of deferrable timers, which is, according to Behnsen, coming soon, will counterbalance some of the complexity added by this work.
A report from the 2022 Image-Based Linux Summit
The first Image-Based Linux Summit was held in Berlin on October 5 and 6, 2022. The main goal of this summit was to agree on common concepts and tooling for how to build, deploy, and run modern, secure, image-based Linux distributions — a project that the organizers, Christian Brauner, Luca Boccassi, and Lennart Poettering, have been working on for some time. The result was a more refined vision of how Linux systems can be built and deployed securely.One of the motivations for the summit was the simple fact that much of the wider ecosystem has been thinking about the same set of problems. For example, our employer, Microsoft, has made use of a lot of the concepts covered by the summit in the recently announced ARM64-based Azure offload SoC, which is running a custom, security-hardened Linux distribution. While we were thinking, tinkering, and writing about new ways to improve the current state of the art, it became obvious to us that many vendors are working, more or less, in the same space, doing similar work with varying degrees of overlap. However, little to no collaboration was happening. The summit was meant to identify and agree on common concepts and come up with a set of initial specifications. Some of them already have reference implementations.
So we invited technical representatives from the engineering groups of various vendors and distributions that have been known to work on related topics. The summit was intentionally kept small, as it was meant to be a series of conversations and brainstorming sessions, with no fixed agenda or presentations — a BoF-style event. The 30 participants met in the Microsoft office in Berlin and discussed a range of topics from a list that the authors and participants had put together in advance. The topics covered were focused around the idea of shipping Linux via images and with enhanced security features.
Participants were affiliated with numerous companies or projects, including Canonical, Ubuntu Core, Debian, Gnome OS, Fedora CoreOS, Red Hat, Endless OS, Arch Linux, openSUSE, Flatcar, Microsoft, Amazon/AWS, Meta, System Transparency, systemd, image-builder/osbuild, mkosi, and rpm-ostree. As a result of this summit, the Linux Userspace API ("UAPI") Group was founded; it is a community for people with an interest in innovating how we build, deploy, and run modern Linux operating systems. It serves as a central gathering place for specs, documentation, and ideas. The associated uapi-group.org website contains the meeting minutes from the summit and links to current specifications and ideas.
Over the next couple of months the group hopes to create various specifications in this repository in the form of technical deep dives. The first one, centered around Unified Kernel Images (UKI), a concept introduced below, is already available on Lennart Poettering's blog.Building Blocks for Images
Several concepts (and, of course, acronyms) that are supported by the systemd project were at the center of many discussions:
- A UKI (Unified Kernel Image) is a secure-boot-signed UEFI executable file that wraps a kernel image, an initrd image, a kernel command line, and more. A UKI can boot on EFI and integrates nicely with systemd-boot.
- Discoverable Disk Images (DDIs) are self-described filesystem images, heavily inspired by Canonical's Snaps, that have been enhanced to follow the Discoverable Partitions Specification (DPS). DDIs are wrapped in a GPT partition table that may contain root (or /usr/) filesystems, system extensions, system configurations, portable services, containers, and more, all of which are protected by dm-verity and combined into one image.
- Credentials pass secure bundles of data across components (hypervisor, firmware, system manager, container manager) and into system services.
- A system extension (sysext) is a DDI that can be overlaid on top of /usr/ (or /opt/) in a secure and atomic manner using read-only OverlayFS. A sysext can be used to extend a base filesystem to, for example, allow modular but pre-built initrds.
- A system configuration (syscfg) is a DDI that can be overlaid on top of /etc/ and provide a way to securely extend the configuration files of a system. This idea is still being developed, and will support the initrd, rootfs, system services, portable services, or nspawn images.
These concepts have all been supported by systemd and its collection of tools for some time, with the exception of syscfg, which is being developed now. They all support signed dm-verity for online and offline, kernel-enforced, integrity protection.
Usage of sysext is planned to be proposed for Fedora 39's initrd logic; it should close a major gap in the security story of Linux. So far, initrds have always been built locally and used in an unprotected manner, neither signed nor measured; thus they are lacking any verification and are at the mercy of any attacker who gains (online or offline) write access to the local disk. By switching to UKIs, we can provide a base, shared, vendor-built initrd that provides common components and a series of sysext DDIs that are added when needed, providing support for less commonly needed hardware or storage subsystems such as iSCSI or NFS. A specification for UKIs is available. For a deeper look, readers should refer to the aforementioned blog post.
Configuration, building, and deployment
We discussed how to handle local configuration at length. Various distributions, such as openSUSE, are pushing developers to use libeconf, which follows the same configuration scheme used by systemd. With this mechanism, the vendor's default configuration resides in /usr/, the ephemeral override is in /run/, and the persistent override is in /etc/. Others, like those based on OSTree, ship defaults in /usr/etc/ and then copy them over to /etc/ on instantiation. At Microsoft, we are working on a different approach in the form of syscfg, which is similar to sysext, but for configuration. The same principles are followed: configuration overlays are protected with dm-verity, signed, stackable, and applicable to the whole system or individual services/containers.
A large variety of options came up when discussing how to build images. Every single distribution has its own build system (as expected). The systemd project provides the systemd-repart tool, which understands DDIs and can run on each system (for initial partitioning or provisioning, or for factory reset), so there is some hope that it will be adopted for other uses too. The mkosi image-building tool is gaining some traction as it supports building layered sysexts and UKIs. But, by and large, every vendor is on its own with its own custom solution, and this situation is unlikely to change. One area where we hope to standardize is the production of a software bill of materials (SBOM), with some tentative agreement among the participants that the SPDX standard is the way to go. Some distributions, including SUSE and Flatcar, already go beyond this and provide full SLSA provenance.
After building an image, the next obvious step is getting it onto a system. Again, each vendor has its own methods here. Systemd recently introduced systemd-sysupdate as the run-time tool to pull down images from a configurable source. It is likely that the server side will remain unique to each vendor, but the hope is that, at least, the local client could be shared. Systemd-sysupdate can integrate nicely with systemd-boot and the rest of the ecosystem. Also discussed was resurrecting casync and integrating it with systemd-sysupdate to replace the use of curl seen in systemd-sysupdate now.
Updates and rollback
On the topic of upgrades, participants briefly discussed how to minimize the associated disruption. This is of great interest to Amazon and Microsoft, as image upgrades cause service interruptions. There are two competing approaches used in production. One is to use CRIU, which works best for single-process containers. The other is to use persistent memory and teach individual services how to make use of it, allowing restarting from fast memory or, even, keeping state intact across a kexec reboot. A standard user-space solution is desirable for the latter but is currently missing; some work is ongoing in that direction by Microsoft, but it is still the early days. A suggestion was made to implement an "exitrd" in systemd that would be similar to a kexec, allowing the system to skip the hardware/firmware stage of a reboot and saving some time when only the operating-system DDI is updated and the kernel doesn't change. In this case, the system could simply shut down user space to an "exitrd", swap the mount point of the DDI for the new one, and start user space again. This would allow skipping the kernel phases of shutdown and booting.
Almost all of the participants have implemented a form of operating-system rollback or factory-reset mechanism. This is highly desirable when distributing an image-based Linux system; one of the advantages is being able to roll back to a known-working state. There are important differences among the distributions though. Those using Btrfs (SUSE) rely on snapshots, while Ubuntu Core provides a more traditional recovery system that is stored in the EFI system partition.
But factory reset can mean two different things: either to restore the operating system to its original version, or to restore the entire system to the factory state, wiping all local data. Tools provided by systemd have long supported the latter, and will be enhanced to let the factory reset be triggered by a UEFI variable and/or a special UUID being set on the OS partition. A new target unit will be introduced as a synchronization point, so that any custom action that needs to happen on factory reset can be pulled in automatically. Image builders can then choose to provide a boot menu entry to let users trigger this functionality.
Other vendors, including Android, are using the well-known A/B pattern for rollbacks and updates. One of the next action items is to implement secure rollback protection in systemd components such as systemd-cryptsetup using TPM counters, as this is more scalable than relying on denylists that eventually become too big to be manageable. The response to the BootHole vulnerability alone, for example, used about one-third of the revocation space available on UEFI systems; see this page for details.
One of the areas where there was immediate agreement was boot assessment; when an image-based system uses an A/B scheme, a way to signal when a boot is "good" or "bad" is needed. The "boot-complete target" will fulfill this purpose. It is already available, and it sounded like more distributions will start using it; services doing any kind of local assessment will be able to use it as the synchronization point. Systemd will integrate this mechanism with a timer, so that if a good state is not reached within the configured time limit, a rollback and reboot will be triggered. Additionally, update success or failure can be reported when using advanced update protocols such as Omaha, for instance via the open-source Nebraska server.
Security
Enabling TPM-based security by default was also the subject of a lot of discussions. Right now, such security on generic Linux is pretty much opt-in. By switching to UKIs with embedded and signed PCR policies, we can make the TPM measurements predictable. This means that automatically enabling TPM-backed disk encryption by default becomes possible because there will no longer be a need to re-seal secrets every time a component changes. A document on signed TPM PCR policies will be available soon. It was agreed to set up a "registry" for PCRs, with the intention of helping developers and vendors avoid conflicts and overlaps.
There is a lot of interest in remote attestation in the context of confidential computing, and there are various solutions and competing standards. On a local node, some solutions rely on the IMA log, some on the TPM log, and some on completely custom registries. There is a lot of ongoing change at this stage and, while the work on image-based systems is largely a prerequisite for remote attestation, no clear action item or feature request came out of this discussion.
Conclusions and future work
The projects represented at the summit can be divided in three camps: image-based deployments, ostree-based deployments, and Btrfs-based deployments. So while it is natural that there was no complete overlap covering all participant projects, there was consensus that, on a significant number of topics, collaboration and standardization are indeed possible.
Finally, given that participants felt that the summit was productive and useful, it was agreed to meet again in the future, perhaps in co-location with another conference to facilitate travel. While there is no perfect agreement between all vendors on all topics, there are enough similarities that we came out of this with a series of action items and a plan. The UAPI Group, with all participants as members, will let us collect and collaborate on documentation and specifications. The initial specifications for DDIs, DPS, and a PCR registry are already in place and published on the UAPI website, with more to come.
Page editor: Jonathan Corbet
Inside this week's LWN.net Weekly Edition
- Briefs: SystemTap 4.8; Rust 1.65; Texinfo 7; Quotes; ...
- Announcements: Newsletters, conferences, security updates, patches, and more.