|
|
Subscribe / Log in / New account

LWN.net Weekly Edition for November 19, 2020

Welcome to the LWN.net Weekly Edition for November 19, 2020

This edition contains the following feature content:

This week's edition also includes these inner pages:

  • Brief items: Brief news items from throughout the community.
  • Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

OpenWrt and self-signed certificates

By Jake Edge
November 18, 2020

The move to secure most or all of web traffic using HTTPS is generally a good thing; lots of personal information is exchanged via web browsers, after all. Using HTTPS requires web sites to have TLS certificates, however, which has sometimes been an impediment, though Let's Encrypt has generally solved that problem for many. But there are systems out there that may need the HTTPS protection before their owners even have a chance to procure a certificate, IoT devices and home routers, for example. An October discussion among OpenWrt developers explored this problem a bit.

OpenWrt is a distribution that targets wireless routers for the home and it provides a configuration interface, called LuCI, as a web application running on the device. Users can connect to the device over unencrypted HTTP, but that may be problematic in certain environments. By default, LuCI does not listen on the internet-facing side of the router, but is available via both wired and wireless access on the local network, though the wireless network is not enabled by default for OpenWrt. Since the router's authentication credentials could potentially be intercepted by malicious actors, many will want to only enable LuCI over HTTPS.

The project has suggestions for securing access to LuCI, including enabling HTTPS-only access. But LuCI comes with a self-signed TLS certificate, so users will have to click through a browser security warning every time they access LuCI, which is not a great user experience. There are instructions for creating a new self-signed certificate and installing it on the device and on the client side so that the warnings are silenced. That mechanism has a number of downsides, not least that the new certificate needs to be installed on every system that will be used to access LuCI.

In theory, getting a certificate from Let's Encrypt would solve many of the problems, but that solution is not without hurdles either. For one thing, Lets Encrypt uses the Automated Certificate Management Environment (ACME) protocol, which requires that the system requesting the certificate be connected to the internet. Beyond that, a device will need a domain name that can be used by the issuing server to connect to them; obviously, "luci.openwrt" and similar sorts of names that are currently being used will not work for that purpose.

On the openwrt-devel mailing list, "abnoeh" proposed a somewhat complicated scheme that would provide a means for OpenWrt router devices to get working certificates. A "technically constrained subordinate CA" would be created that could sign certificates for sub-domains of "luci.openwrt.org", though it is not entirely clear which existing CA would be willing to sign subordinate CA keys—Let's Encrypt and DigiCert are mentioned as possibilities. The proposal calls the new CA the "OpenWrt CA".

A router that needs a certificate contacts the OpenWrt CA over an encrypted channel and sends its SSH host key; the CA hashes the key and reserves the "SSH-hash.luci.openwrt.org" host name for the device. The OpenWrt CA sends back a nonce to the router, which creates a certificate signing request (CSR), hashes it, and then signs the hash, nonce, and a timestamp using its SSH host key. The CSR and signed message are sent back to the OpenWrt CA, which verifies the information and signature, signs the certificate, and sends it back to the device. At that point, the router has a valid certificate; it can intercept connections to SSH-hash.luci.openwrt.org, route them to itself, and LuCI over HTTPS can be used without browser warnings, though the host name is perhaps not the friendliest.

There are still some problems to be solved, of course and routers that are not already connected to the internet will be unable to participate. In addition, abnoeh noted that the project would need to run the OpenWrt CA server and figure out a way to reasonably rate-limit requests from clients. There is also a question of what the openwrt.org DNS servers should return when SSH-hash.luci.openwrt.org requests are made.

Michael Richardson wondered if the project could find a CA that would be willing to set up the subordinate CA and whether the CA/Browser Forum (CABForum) rules will allow what is needed. If the subordinate CA violates those rules, the browser makers would remove the parent CA from the browsers unless it revoked the subordinate, which would be a horrendous mess. But abnoeh thinks those rules do not prevent the usage suggested.

Abnoeh did ask if it made sense to dispense with HTTPS entirely, though, and to wrap LuCI with SSH instead. Alberto Bursi suggested considering mobile apps on Android and iOS as a way of sidestepping the HTTPS certificate problems, but Sam Kuper cautioned that an app wouldn't necessarily solve the problems, even though it seems like an attractive approach. There are some users who lack a device to run the app, for one thing, but the app will also have to deal with certificate issues in one form or another. "Ultimate[ly], SSL/TLS on IoT is a hard problem: the two technologies are currently not *fully* mutually compatible without imposing some burden on the user."

Some were not sure of the need for CA-issued certificates, however, at least by default. Fernando Frediani said that most OpenWrt users would not find it strange to have to click through the certificate warning and, for the most part, LuCI is only accessed from trusted networks. But Richardson cautioned that those assumptions may lead to unexpected kinds of problems:

As a result, most devices are [accessible] only from internal networks, and usually never exposed to the Internet. Default passwords remain unchanged, and malware infected a vulnerable PC easily attacks the OpenWRT LuCI interface.

Bursi said that OpenWrt users already have to jump through a fair number of hoops to even get it running on their devices, so it is probably a mistake to think that they will not change the root password; by default, OpenWrt does not enable WiFi, for example, which is probably not what these users are looking for:

Then after they did that they decide to leave the device as-is with default config with LAN/Wan routing and no wifi, which is in most cases plain worse than what the stock firmware offers?

Many in the thread thought that abnoeh's idea was worth pursuing to make it easier for users who do want a real certificate, but Bas Mevissen suggested that the proposal might be "making things more difficult than they need to be". Owners should be doing the initial install on a trusted LAN and configuring the device before connecting it to the internet or allowing others to use it.

So I think it is reasonably safe to do the initial setup over HTTP (without the "S") at the first boot if there are no certificates available from a previous OpenWRT install. Then the user can setup the WAN side if needed and upload (from local PC), generate (self-signed) or acquire (e.g. Let's Encrypt) the certificates for Luci. After that, the connection is switched to HTTPS and HTTP switched off.

The main concern that Mevissen had was about how to transfer credentials to a new install. "Even though the user/admin should be alone on the connection, sending those unencrypted over the line is not desirable." Abnoeh suggested using a USB drive to transfer those kinds of credentials, at least for systems that have a USB port.

Richardson wondered about less technically savvy users, though. Mevissen said that OpenWrt should simply "aim to get the best possible security with the least effort (on user side) possible". Mevissen noted that other routers typically come with self-signed certificates for their administrative interfaces, so OpenWrt would not be out of step to do so; providing users with ways to replace that certificate is important, but the complexity of abnoeh's proposal does not seem needed.

OpenWrt is far from alone in having this problem; as noted, IoT devices have it as well. Other web-based administrative interfaces (e.g. Webmin, Cockpit) have similar certificate-bootstrap problems, though they may generally target folks who are more technical than your average home user. For good or ill, secure web access was tied to the CA system from the outset—with the self-signed option as an escape hatch. The CA model does not work for all use cases, by any means, and the escape hatch is slowly closing, however. It is not entirely clear what systems that want to provide secure HTTP access in restricted environments of various sorts (e.g. no domain name, no internet connection) can do—at least without compromising the existing HTTPS security story for other users, especially those of a less-technical bent.

Comments (55 posted)

A realtime developer's checklist

November 16, 2020

This article was contributed by Marta Rybczyńska


OSS EU

Realtime application development under Linux requires care to make sure that the critical realtime tasks do not suffer interference from other applications and the rest of the system. During the Embedded Linux Conference (ELC) 2020, John Ogness presented a checklist (slides [PDF]) for realtime developers, with practical recipes to follow. There are a lot of tools and features available for realtime developers, even on systems without the RT_PREEMPT patches applied.

What is realtime?

We want all applications to be correct and bug-free, Ogness began; in the realtime domain, correctness "means running at the correct time". The application must wake up within a bounded time limit when there is time-critical work to do. Ogness highlighted that, in realtime systems, the right timing of tasks is a requirement; things will go wrong if the constraints are not met. Developers need to define which tasks and applications are time-critical; he noted that a lot of people mistakenly think that all tasks in a realtime system are realtime, while most of them are not.

The good news for developers is that, under Linux, they can write realtime applications using only the POSIX API with the realtime extensions. The code will look familiar, and only three additional header files are required: sched.h (a member of the audience noted that the musl C library does not implement this one), time.h, and pthread.h.

There are three properties that a realtime operating system must have: deterministic scheduling behavior, interruptibility (the CPU is always running something, so there should be a way to interrupt a task), and a way to avoid priority inversion, which happens when a high-priority task must wait for a lower-priority one. The third property, which might be less familiar to non-realtime developers, was described with an example.

Consider three tasks (task1, task2, and task3 with high, medium, and low priority, respectively). Task3 is holding a lock when task1 comes along, takes the CPU and requests that lock. As task1 is high priority, the scheduler puts task3 back on the CPU so that it can finish its work and release the lock. Before that happens, though, task2 comes along. It has nothing to do with the lock, but it has a higher priority than task3. The scheduler then makes a good decision to give the CPU to task2, but this is indirectly blocking task1.

In this case we have a priority inversion between task1 and task2. A situation like that can come about easily in a complex system like Linux, Ogness noted. The way Linux handles priority inversion is to temporarily boost the priority of task3 to the priority level of task1. When task3 gives back the lock, it will be de-boosted and task1 can run with the lock.

Scheduling and affinity

Ogness started his realtime checklist with scheduling policies, which vary between the realtime and non-realtime domains. Non-realtime policies implement fixed time slices; if an application is running an infinite loop, it will still lose the CPU when its time slice runs out. On a realtime system, most tasks, including logging daemons, web servers, and databases, will still use non-realtime policies. "These processes should not be running as realtime", he said. In addition, the developer can configure those non-realtime tasks to limit how much CPU they can have using nice and control groups; for example, the web browser can be constrained to never take more than 20% of the available CPU time. He "strongly encourages" developers to evaluate all tasks, not only the realtime ones, for their resource requirements.

Realtime tasks, instead, normally run until they give up the CPU or are preempted by a higher-priority task. If there is a task with priority 30, it will run until a task with a higher priority comes in (like priority 31). For realtime tasks, one needs to be especially careful when writing the code, and to avoid infinite loops, which can render the system unusable.

The SCHED_FIFO realtime scheduling policy is what people typically use, Ogness said. Tasks running under this policy execute until they are blocked (waiting) or freely give up the CPU. Priorities range from one to 99 (highest). "Please never use 99", Ogness said, as kernel threads using this priority are "more important than your application". Another policy, SCHED_RR is similar, but it uses time slices for tasks at the same priority level. The third scheduling policy is SCHED_DEADLINE, where the scheduler runs the task with the closest deadline. Ogness did not cover this policy, but noted that if it is used, SCHED_DEADLINE wins over highest priority tasks from other scheduling policies; mixing scheduling classes is "strange".

Ogness pointed out one important kernel feature with regard to realtime scheduling policies: by default, the scheduler limits the amount of CPU time allocated to all realtime tasks together to 95% of the available time per second. In the remaining 50ms of each second, no realtime task is allowed to run. This policy gives an administrator a chance to intervene if a realtime task goes out of control. This situation should be avoided, though, since it constitutes a sort of priority inversion. He noted that, if the kernel hits the limit, it prints a throttling message in the kernel log, but the realtime system is already broken at this point. This feature can be disabled by writing -1 into /proc/sys/kernel/sched_rt_runtime_us. The change is not permanent, so it needs to be added to boot scripts.

Developers can set priorities and scheduling policies using the chrt tool; the -p option sets the priority of a given task. The tool can also be used to start an application with a given priority. The same operations are possible in the code by using the sched_setscheduler() system call.

CPU affinity is another important element in a realtime system. It might be critical to isolate code onto some CPUs. Ogness gave an example of an eight-CPU system divided into six CPUs for non-realtime applications and two for realtime. In Linux, CPU affinity is defined for each task and takes the form of a bitmask with one bit set for each allowed CPU. In addition to user-space tasks, interrupts and kernel threads can also have their affinity set. This is important, as they may interfere with realtime code. Ogness noted that the internal architecture of a processor will influence the realtime configuration. If two CPUs are sharing L2 caches, then they should be both realtime, as cache sharing between realtime and non-realtime applications may have an effect on realtime latency.

The tool to set and query affinities is taskset, which can either start a task or modify an existing one. taskset also applies to threads. If run without a new mask, it will show what the current mask is. As with priorities, an associated system call exists: sched_setaffinity(). Ogness noted that it requires that _GNU_SOURCE be defined, since the sched_setaffinity() wrapper is a GNU C library (glibc) extension and not part of the POSIX API.

It is also possible to influence CPU affinity in a deeper way using the maxcpus and isolcpus boot parameters. maxcpus limits the number of CPUs the kernel can see. If set to four in an eight-CPU system, it means that the kernel will see only four processors and the other four CPUs can be used differently, such as by a bare-metal realtime application. On the other hand, isolcpus indicates to the kernel that it should not put kernel threads on the indicated processors. When using isolcpus, Linux is aware of all of the CPUs and can use the isolated CPUs when threads are explicitly set to run on them.

For interrupt affinities, the default settings for new interrupts can be viewed and changed in /proc/irq/default_smp_affinity. For an existing interrupt, intr, the file /proc/irq/intr/smp_affinity allows doing the same. Developers should be aware, Ogness said, that there might be hardware limitations when setting affinities; after setting an interrupt affinity they should consult /proc/irq/intr/effective_affinity to check what was set after taking all limitations into account.

Avoiding page faults

Memory management is "probably most important" in terms of latency. He explained that, when an application allocates memory, it is not really allocated in the kernel, but only marked in the page tables. The actual allocation takes place when that memory is first accessed and results in a page fault. Page faults are "quite expensive" as the kernel needs to find a physical page and assign it before the application can continue. Developers can verify this by observing that malloc() calls are fast, but the application slows down when it starts using the allocated memory for the first time. This applies to not only the heap, but to all memory, including code and the stack. Moving to the next stack page will cause a page fault just like the allocation of new memory.

Ogness described three steps to avoid page faults. The first is to tune glibc to use the heap only for new allocations. malloc() has two ways to allocate memory: the heap or mmap() for separate chunks. Realtime developers do not want separate chunks, as they may go back to the kernel after free(); then if they are reallocated they will fault again. The developers can direct glibc to use heap memory only by using mallopt():

    mallopt(M_MMAP_MAX, 0);

Another useful option is to disable the possibility that glibc gives unused memory back to the kernel:

    mallopt(M_TRIM_THRESHOLD,-1);

The second step is to lock down allocated pages so that they "never get back to the kernel". This can happen when the kernel reclaims memory or starts swapping. Embedded developers may think their system will not swap, but this is not true, Ogness explained. When low on memory, the kernel starts reclaiming any memory it can, including the text segments of applications. For realtime "this is horrible", because when the application runs again it will need to get the pages back from the disk. To lock all pages into memory, a realtime application should use mlockall():

    mlockall(MCL_CURRENT | MCL_FUTURE);

The third and final step is to perform pre-faulting, which means causing all page faults ahead of time and having all memory "ready to go". Heap pre-faulting is done by allocating all of the memory that might be needed, then writing a non-zero value to each page to make sure that the writes are not optimized away. Stack pre-faulting should be done in a similar way by creating and writing a big stack frame, something that developers have "learned never to do" in general.

Synchronization without priority inversion

In realtime systems, developers should use POSIX mutexes for locking, Ogness said. This is important because mutexes have owners, and only the owner can release the lock. This gives the kernel the information needed to boost a lock-holding task when a higher-priority task needs that lock. By default, however, priority inheritance is not activated (even with PREEMPT_RT, he pointed out), so the developers need to activate it to avoid priority inversion using:

    pthread_mutexattr_setprotocol(&mattr, PTHREAD_PRIO_INHERIT);

For signaling between threads, Ogness recommended conditional variables as they can be associated with mutexes. If an application needs to wait and then take a lock, then pthread_cond_wait() is the tool for the job. Realtime developers should avoid signals because the environment that the signal handler runs in is hard to predict.

Clocks and measurements

"Use the monotonic clock", Ogness recommended. This is the clock that always moves forward, without regard to time zones and leap seconds. The absolute time it gives is ideal to calculate the time the task should sleep. The technique he presented consists of getting the current time once, at the beginning of the program, and then just incrementing the clock thereafter. This way the task will wake up in the right moment "even in 10 years", he said.

The last part of the talk covered tools that can be used to evaluate the realtime system. The first two tools come from the rt-tests package. cyclictest measures latencies at the given priority level. It repeatedly sleeps, then checks the clock to measure the difference between the expected and real wake-up times. This tool can prepare histograms with the resulting data. Developers will want to run it with a high load on the system to measure the latencies that the system may generate. The other tool, hackbench, can generate system load with packets, CPU load, and even out-of-memory events. A realtime system should run well even in those conditions. Ogness noted the developers should be doing stress tests for all components. For example, if the system uses Bluetooth, then it requires a Bluetooth test. He also reminded developers to include idle-mode testing, as the system latency might be different when going to low-power modes.

perf can also help developers, as it can show the number of page faults and cache misses. Ogness mentioned the kernel tracing infrastructure that can help to show not only what happens during an application's execution, but also when. A member of the audience asked if there is a way to calculate the worst response time of a task, and Ogness replied that the only solution is to test.

Ogness finished with a list of kernel configuration options to check. For example, CONFIG_PREEMPT_* activates more kernel realtime properties (more still if the PREEMPT_RT patches have been applied). CONFIG_LOCKUP_DETECTOR and CONFIG_DETECT_HUNG_TASK run at realtime priority 99, so they should be disabled if not explicitly needed. CONFIG_NO_HZ eliminates kernel clock ticks and can reduce power consumption, but it can also increase latency. He said that setting any particular option is not necessarily an error, but they need to be analyzed and checked; there might be decisions to make for example between power saving and realtime.

At the end of the talk, he walked through the checklist again, highlighting the need to verify the realtime behavior. For the members of the audience who would like to learn more there is a realtime wiki. "Have fun with realtime Linux", he concluded.

Comments (19 posted)

iproute2 and libbpf: vendoring on the small scale

By Jonathan Corbet
November 12, 2020
LWN's recent article on Kubernetes in Debian discussed the challenges of packaging a massive project with hundreds of dependencies. Many of the issues that arose there, however, are not limited to such projects, as can be seen in the ongoing discussion about whether a copy of the relatively small libbpf library should be shipped with the iproute2 collection of networking tools. Fast-moving projects, it would seem, continue to feel limited by the restrictions imposed by the Linux distribution model.

Iproute2 is a collection of network-configuration tools found on almost any Linux system; it includes utilities like arpd, ip (the command old-timers guiltily think they should be using when they type ifconfig), ss, and tc. That last command, in particular, is used to configure the traffic-control subsystem, which allows administrators to manage and prioritize the flow of network traffic through their systems. This subsystem has, for some years, had the capability to load and run BPF programs to both classify packets and make decisions on how to queue them. This mechanism gives administrators a great deal of flexibility in the management of network traffic.

The code for handling BPF programs within iproute2 is old, though, and lacks support for many of the features that have been added to BPF in the last few years. Since that code was written, the BPF community has developed libbpf (which lives in the kernel source tree) as the preferred way to work with BPF programs and load them into the kernel. This is not a small task; libbpf must interpret the instructions encoded as special ELF sections in an executable BPF program and make the necessary calls to the sprawling bpf() system call. This work can include creating maps, "relocating" structure offsets to match the configuration of the running kernel, loading programs, and attaching those programs to the appropriate hooks within the kernel. Libbpf has grown quickly, along with the rest of the BPF ecosystem.

The obvious way to add support for current BPF features to iproute2 is to dump the old code used there in favor of libbpf; patches to that effect have been posted by Hangbin Liu, starting in late October. Shortly thereafter, David Ahern, one of the iproute2 maintainers, pointed out that the posted patches fail to compile on an Ubuntu 20.10 system; Liu responded that Ahern needed to update to a newer version of libbpf, and libbpf developer Andrii Nakryiko suggested including libbpf as a submodule of iproute2. That is where the trouble began; Ahern asserted that libbpf is provided by distributions and that iproute2 needs to work with the version that is available. Nakryiko argued instead that libbpf is "a fast moving target right now" and that packaging it with iproute2 is "pragmatic". Needless to say, this feeling did not find universal agreement.

The arguments against "vendoring" libraries like libbpf in this way have been made many times, and they appeared here as well. Jesper Dangaard Brouer repeated the usual security argument: copies of libraries buried within other packages are difficult to find and update if a problem is found. Jiri Benc added that the end result of this process would be a massive bundling of dependencies, which "would be nightmare for both distros and users". Distributors have long been opposed to bundling dependencies in this way, and the iproute2 developers do not see any reason to go against this policy for libbpf.

The BPF developers see things differently and were not shy about expressing their feelings. Alexei Starovoitov asserted that Ahern is "deaf to technical arguments" and raised the idea of forking iproute2 in response. He later said that using the distribution-provided libbpf would be worse than not using libbpf at all, and that if he could remove the existing BPF support from iproute2, he would do so. Things reached a point where Toke Høiland-Jørgensen asked him to "stop with this 'my way or the highway' extortion" so that the discussion could make some progress.

The argument for bundling libbpf with iproute2 seems to come down to a distrust in the version of libbpf that distributors will ship. There are two aspects to this, one being that iproute2 could end up linking to a version of libbpf that introduces bugs; as Starovoitov put it:

When we release new version of libbpf it goes through rigorous testing. bpftool gets a lot of test coverage as well. iproute2 with shared libbpf will get nothing. It's the same random roll of dice. New libbpf may or may not break iproute2. That's awful user experience.

Benc disagreed:

"Random roll of dice" would be true only if libbpf did incredibly bad job in keeping backward compatibility. In my experience it is not the case.

In a separate branch of the discussion, Starovoitov extolled the compatibility of libbpf releases, saying that "users can upgrade and downgrade libbpf version at any time". Others agreed that libbpf releases are managed well and do not create compatibility problems. So it is not really clear what the concern is here.

The related issue, though, is the worry that using the distribution-provided libbpf will lead to inconsistency across systems and an inability for users to know which features will be supported just by looking at the version of iproute2 they are running. Beyond that, the BPF developers fear that distributors will stick with an old libbpf, depriving users of newer features even if they have a current version of iproute2. Nakryiko said that users do want the newer features supported by libbpf, but BPF maintainer Daniel Borkman repeatedly claimed that distributors cannot be trusted to keep up with libbpf releases. That, it is argued, will lead to a poor experience for iproute2 users. Tying a specific version of libbpf to each iproute2 release, instead, would make life more predictable.

Ahern didn't entirely buy that line of reasoning, though:

User experience keeps getting brought up, but I also keep reading the stance that BPF users can not expect a consistent experience unless they are constantly chasing latest greatest versions of *ALL* S/W related to BPF. That is not a realistic expectation for users. Distributions exist for a reason. They solve real packaging problems.

That brought out a few other developers who complained about the need to keep multiple components all at the latest versions for things to work properly. Edward Cree, for example, complained about the lack of attention to interoperability in general:

The bpf developers seem to have taken the position that since they're in control of clang, libbpf and the kernel, they can make their changes across all three and not bother with the specs that would allow other toolchains to interoperate.

Starovoitov actually agreed that the situation needs to improve — someday. "BPF ecosystem eventually will get to a point of fixed file format, linker specification and 1000 page psABI document." For now, though, things are evolving too quickly for that, he added.

Toward the end, Nakryiko suggested that, if distributors make a point of updating libbpf and setting the dependencies for iproute2 to require the updated versions, things might work well enough. Benc replied that Red Hat, at least, works that way now, leading Nakryiko to reply that this "would be sufficient in practice". Starovoitov still pushed for iproute2 to require a minimum libbpf version that increases with each release so that there would be at least some correlation between the iproute2 and libbpf versions. Stephen Hemminger, who also maintains iproute2, seems unwilling to impose that sort of requirement, though.

In the end, it appears that iproute2 will be packaged like almost all other Linux utilities; it will use the version of libbpf shipped by the distribution it is running on rather than supplying its own. The distributors will just have to be sure that they keep libbpf current with other kernel-derived utilities; this is something they have long since learned to do.

The tensions around the quickly evolving BPF subsystem seem unlikely to go away, though; one group of users is already relying on BPF-based tools while the developers continue to evolve the whole ecosystem in a hurry. Ahern described the situation this way: "The trend that related s/w is outdated 2-3 months after a release can be taken as a sign that bpf is not stable and ready for distributions to take on and support". But that, too, should be a tractable problem; Linux as a whole has been supporting users while undergoing constant modification for nearly 30 years, after all.

Comments (68 posted)

Systemd catches up with bind events

By Jonathan Corbet
November 13, 2020
The kernel project has a strong focus on not breaking user-space applications; if something works with a given kernel release, it should continue to work with subsequent releases. So it may be discouraging to read the lengthy exposition on an apparent user-space API break in the announcement for the systemd 247-rc2 release. Changes to udev configuration files will be needed to keep systems working, but the systemd project claims that it "is not [the] fault of systemd or udev, but caused by an incompatible kernel change that happened back in Linux 4.12". It seems like an appropriate time to look at what happened, how administrators need to respond, and whether anything can be done to avoid this kind of thing from happening again.

Modern computers tend to be highly dynamic, with devices (of both the physical and virtual variety) appearing and disappearing while the system is running. The kernel handles the low-level details with regard to these device events, but it is up to user space to take care of the rest. For that to happen, user space needs to know when something has changed with the system's configuration.

To that end, events are emitted to user space from deep within the kernel's driver-core subsystem whenever something changes; for example, plugging in a USB device will result in the creation of one or more ADD events to tell user space that the new device is available. The udev daemon is charged with responding to these events according to a set of rules; it can create device nodes, set permissions, notify other user-space components, and more, all in response to properties attached to events by matching rules. The set of possible events is relatively small and does not change often.

Breaking systemd

In July 2017, though, Dmitry Torokhov added two new event types called BIND and UNBIND. They are meant to allow user space to handle devices that need help before they can become fully functional — those that need a firmware load, for example. For drivers that support the new mechanism, a BIND event for a device will follow the ADD event once the device is ready to operate. This change was a part of the 4.14 kernel release in November 2017 (not 4.12 as stated in the systemd announcement).

Later that same month, a bug report landed in the KDE bug tracker; this was perhaps the first case where somebody noticed a problem related to the new events. That report only made it to the kernel lists at the end of 2018, though — over one year later. By then, 4.14 had been made into a long-term support kernel and shipped by distributors, with relatively few complaints from users. Indeed, Greg Kroah-Hartman was mystified as to why problems were turning up a year later. That turned out to be a change to systemd that caused it to propagate the new events.

Specifically, the problem would appear to originate in the way that udev (which is a part of the systemd project) attaches tags to events. These tags, which are set and used by udev rules, control how user space will set up the new device. There is an assumption built in that there will only be a single event to announce the existence of a new device, so attaching tags to that event is sufficient. When the second event (the BIND event) shows up, the device state is reset and those tags are forgotten, leading to the associated device not being set up properly.

As a short-term "fix", systemd was patched to simply ignore the new events. That caused things to work as they did before, at the cost of hiding those events entirely. That was never a long-term solution; the new events were added for a reason and some devices need them for proper setup. So a better solution had to be found for the longer term; that solution has two aspects, one of which may be disruptive for users who have created their own udev rules.

Fixing systemd

The first piece is a reworking of the "tag" mechanism provided by udev. Tags are special properties that can be attached, then matched in subsequent rules or consumed by user space. Rather than attaching tags to events, as has been done until now, udev attaches them to devices, so tags added in response to an ADD event will still be there for the BIND event as well. For cases where rules need to respond only to tags added to the current event, a new CURRENT_TAGS property lists only those tags; it thus holds the value that the TAGS property held in previous releases.

The other part, though, is a change that must be applied to a number of udev rule sets. Consider, for example, this snippet taken from a randomly chosen rules file (10-dm-disk.rules in particular) on a Fedora 32 system:

    # "add" event is processed on coldplug only!
    ACTION!="add|change", GOTO="dm_end"

The ACTION line causes the entire file to be skipped for anything other than ADD or CHANGE events; in particular, that is what will happen with BIND events. That will cause properties associated with those events to be lost — and the device in question to be set up improperly (if at all). The fix is to change that line to read:

    ACTION=="remove", GOTO="dm_end"

That causes the rules to be skipped (and their associated state forgotten) only when the device is removed from the system.

The problem here is that these rules were written under the assumption that no new event types would be added, so anything that wasn't recognized as adding or modifying a device could be ignored. There is, evidently, a certain amount of code that runs in response to device events that has a similar problem. What this shows is, in effect, a sort of protocol ossification effect that has made it much harder to add event types to the API provided by the kernel. Indeed, in 2018, Torokhov remarked:

Well, it appears that we can no longer extend uevent interface with new types of uevents, at least not until we go and fix up all udev-derivatives and give some time for things to settle.

At the time, there was discussion of possibly reverting the change, causing the new events to disappear. But that approach had the potential to create regressions of its own, as some systems may well have depended on getting those events; the kernel release adding them was a year old by that point, after all. There was also discussion of adding some sort of knob to enable or disable the creation of BIND and UNBIND events, but that never came to pass. Instead, Torokhov described the work in the systemd project to make the changes described above, and Kroah-Hartman responded: "So all should be good".

A regression?

With luck, all will be good, but it has come at the cost of some work within the systemd community over the last two years; the systemd developers have made their displeasure known:

We are very sorry for this breakage and the requirement to update packages using these interfaces. We'd again like to underline that this is not caused by systemd/udev changes, but result of a kernel behaviour change.

Was this a violation of the kernel's "no regressions" rule? The answer must almost certainly be "yes"; code that worked with 4.13 no longer worked with 4.14. What should have been done about it is a bit less clear. Had the issue been reported to the kernel community more quickly, it might have been possible to revert and redesign the change; after it had been deployed for a year, though, that was not a simple option. One could argue that the kernel community should have found some other way to fix the regression; the systemd 247-rc2 announcement tries to make that case. But once Torokhov posted that the problem was being addressed on the systemd side, there was no longer any pressure to do that.

Perhaps the real lesson here is that the community would be better served by closer relations between the kernel project and projects managing low-level utilities like systemd. Those relations have been somewhat strained at times, and there are not a lot of places where cooperative, cross-project discussions can take place. The presence of systemd developers at events like the Linux Plumbers Conference is limited at best, and those developers — not without reason — do not find the kernel mailing lists to be an entirely welcoming place. We are all working on the same system, though, and we would probably have an easier time of it if we could talk things through a bit more.

Comments (83 posted)

Changed-block tracking and differential backups in QEMU

November 17, 2020

This article was contributed by Kashyap Chamarthy


KVM Forum

The block layer of QEMU, the open-source machine emulator and virtualizer, forms the backbone of many storage virtualization features: the QEMU Copy-On-Write (QCOW2) disk-image file format, disk image chains, point-in-time snapshots, backups, and more. At the recently concluded 2020 KVM Forum virtual event, Eric Blake gave a talk on the current work in QEMU and libvirt to make differential backups more powerful. As the name implies, "differential backups" address the efficiency problems of full disk backups: space usage and speed of backup creation.

There's also the similar-sounding term, "incremental backups". The difference is that differential tracks what has changed since any given backup, while incremental tracks changes only since the last backup. Incremental backups are a subset of differential backups, but both are often lumped under the "incremental backups" term. This article will stick to "differential" as the broader term.

With differential backups, one of the two endpoints when creating backups is always the current point in time. In other words, it is not like Git, where, if the latest version of a file is, say, v4, you can still diff between v2 and v3 — with differential backups, one of the two diff points is always v4, the current point in time.

QEMU has had block-layer primitives to support full backups for some time; these were most commonly used for live-migrating the entire storage of a virtual machine, or for point-in-time snapshots. But over the past couple of years, QEMU and libvirt have picked up steam toward the goal of making differential backups a first-class feature that is enabled by default.

QCOW2 "backing chains"

QCOW2 was designed with the notion of sparse image files coupled with other images that can be used as overlays; these are known as "backing chains". Data that is not present in an overlay can be accessed from another image, known as the "backing file". Each QCOW2 file is divided into clusters of 64KB (by default); if a cluster is not present in the overlay, it is retrieved from the backing file. Here is a simple backing chain:

    base.raw
        overlay1.qcow2 (live QEMU)

The base.raw image is the backing file for the overlay image, overlay1.qcow2. QEMU will read from the overlay if a cluster is allocated in it, otherwise it will read from the backing file. But all writes go only to the overlay; hence the name "COW" — use the copy (i.e. the overlay) on write. Note that the base image can be either raw or QCOW2 format; overlays must be in the QCOW2 format. It is also possible to create a chain of overlays using other overlays:

    base.raw
        overlay1.qcow2
	    overlay2.qcow2
	        overlay3.qcow2 (live QEMU)

In this case, all the guest writes happen in the last overlay, overlay3.qcow2. Some common use cases for backing chains are spinning up multiple copies of a virtual machine (VM) based on a single "golden" disk image, thin provisioning, point-in-time snapshots, and disk image backups.

Bitmaps to track "dirty" blocks

The core of differential backups involves keeping track of modified blocks over the lifetime of a disk image; also referred to as changed-block tracking (CBT). Bitmaps, or bit arrays, allow tracking the modified (or "dirty") segments in a block device, such as a QCOW2 image. QEMU uses bitmaps to determine which portions of a virtual disk image to copy out during differential backups. It provides a set of APIs — QEMU Monitor Protocol (QMP) commands — to manage the entire bitmap lifecycle.

Blake noted that using in-memory bitmaps isn't new in QEMU: they were first used internally, back in 2012, to implement one of QEMU's long-running block operations, block-stream. That operation allows shortening a lengthy QCOW2 disk image chain by merging overlays at run time, while the guest is running, to improve I/O performance. QEMU also extended the QCOW2 format specification to describe "persistent bitmaps" that can live within a QCOW2 disk image. The in-memory bitmaps are flushed to disk when the guest shuts down, and reloaded on guest boot, so that the differential record of changes can persist beyond a single cycle of booted guest activity.

While bitmaps provide a lot of flexibility, libvirt developers realized managing bitmaps can get really unwieldy, especially in scenarios involving long backing chains. One has to carefully manage the bitmaps, by enabling, disabling, or merging them when required. This complexity in tracking of bitmaps led them to change their mind a few times about how to manage them.

"Checkpoint" is libvirt's term for a point in time against which a differential backup can be created; whereas a bitmap tracks changes over a time period. Both are created at the same time, but the checkpoint is immutable, while the bitmap will be modified according to later actions by the guest. An initial libvirt implementation — involving multiple bitmaps — tried to reduce the write overhead due to bitmaps by having at most one active bitmap regardless of how many checkpoints existed. But the libvirt developers eventually settled on an approach where all checkpoints have their own active bitmap, which shifts the burden of optimizing parallel bitmap write complexity to QEMU but eases the management aspects needed in libvirt.

Revitalized NBD

The network block device (NBD) protocol was first introduced 23 years ago (Linux 2.1.55; April 1997), as a way to access block devices over a network. The protocol started with two simple commands — one for reading a block, the other for writing — but, in the last couple of years, NBD protocol gained momentum by virtue of new virtualization use cases. For example, the ability to query dirtied blocks via NBD allows for fine-grained differential backups, especially the "pull-based" model discussed further below. (For more on NBD, refer to the "Making the Most of NBD" presentation [YouTube] by Blake and Richard W.M. Jones at the 2019 KVM Forum.)

QEMU added client-side support for NBD back in 2008 to be able to connect to a standalone NBD server such as qemu-nbd (or the more powerful nbdkit server) and access a remote block device. Later on, QEMU gained a built-in NBD server, so that it can export disk images as NBD drives. One of the most common uses of QEMU's built-in NBD server is to allow live migrating a VM's disks in a non-shared storage setup: The source QEMU prepares to transfer disk contents while the VM is running. Meanwhile, the destination QEMU sets up an NBD server advertising an empty network-accessible "export"; the source QEMU will connect as an NBD client to this export to copy the guest disk image over to the destination. Once the source completes the transfer, the memory is migrated over, the destination QEMU tears down the NBD server, and the VM can be resumed on the destination.

Copying out the dirty blocks: "push" vs. "pull"

QEMU combines bitmaps and NBD to allow copying out modified data blocks. There are two approaches to it. In the first, "push mode", QEMU internally tracks the modified blocks (or "dirty clusters" in QCOW2 parlance), and when a user requests, it creates a differential or a full backup in an external location (i.e. QEMU "pushes" the data to the target). For some use cases, QEMU can be a bottleneck here, as it controls the entire backup creation mechanism.

In the other, "pull mode", QEMU exposes the data that needs to be written out — by serving the modified blocks via QEMU's built-in NBD server — and allows a third-party tool to copy them out reliably (i.e. the data is being "pulled" from QEMU). Thus, pull mode avoids the bottleneck by letting an external tool to fetch the modified bits as it sees fit, rather than waiting on QEMU to push a full backup to a target location.

Blake's talk involved a demo of several use cases. One such example showed how to track the "dirty" blocks of a QCOW2 image. It involved creating a bitmap (with the recently added qemu-img bitmap command) and adding it to a QCOW2 image that's attached to a Fedora guest; this now enables the bitmap to track any guest writes from then on. He then did some writes to the disk image (using the guestfish utility from libguestfs), and serving the bitmap via the standalone QEMU NBD server, qemu-nbd. Finally, he used nbdinfo --map (a new command that is part of the libnbd NBD client library), examining the served dirty bitmap to see which areas of the QCOW2 image were dirty. This allows a client tool to selectively copy out only the modified blocks. Other demonstrations showed combining bitmaps plus overlays; and both the push- and pull-based backup workflows.

"A full backup is always correct, as long as the guest data has not been corrupted. Changed-block tracking is merely an optimization," Blake emphasized early in the talk. If something ever goes wrong, (e.g. if you lose a bitmap) the fallback option is to take a full backup. With this, the guest data isn't lost, just the efficiency of handling that guest data is lost.

The push-based backups are suitable when one is okay with delegating the work to QEMU on how to perform the backup and the preferred output format (QCOW2). But the pull-based backups truly unlocks the full integration into any other backup framework, by allowing it to control how and when exactly to copy out the dirty blocks.

Conclusion

As of this writing, most of the important parts to support differential backups are in QEMU 5.2 (due in December 2020). One improvement that is targeted for QEMU 6.0 (due in early 2021) is improving run-time control for reopening backing chains to allow libvirt to shorten long QCOW2 disk-image chains (called "block commit"), preserving the desired bitmap semantics across a backing chain.

Differential backups are an "opt-in" feature in libvirt now. Turning them on by default is waiting on a final QEMU interface to be marked as stable; upstream is working on it.

The initial design for differential backups was first presented back at KVM Forum 2015 ("Incremental Backups - Good things come in small packages" [PDF] by John Snow and Vladimir Sementsov-Ogievskiy), with follow-ups in 2016 (Backups (and snapshots) with QEMU [PDF] by Max Reitz) and 2018 ("Facilitating Incremental Backup" [PDF] by Blake). Since then much work has been done in QEMU and libvirt's block layer, including a major rework of how block devices are configured. All of this, combined with the recent work in improved bitmap handling and NBD support in QEMU, has opened up new backup-related use cases. That includes the ability for external backup clients to take pull-based backups and provide changed-block tracking solutions on top of the QEMU and libvirt stack. This now only requires an NBD client capable of reading the bitmap and then processing the data in any pattern it likes. Recent projects like libnbd (started in 2019) have made it easier to write such an NBD client.

[I would like to thank Eric Blake for a critical review of this article.]

Comments (5 posted)

No LWN.net Weekly Edition next week

Thursday, November 26 is the Thanksgiving holiday in the US. We traditionally take that week off to celebrate with our families and eat far too much food. The pandemic situation has turned family celebrations into a poor idea, but time off and food still seem like entirely appropriate ideas. Thus, there will be no LWN Weekly Edition on November 26.

We wish the best of the holiday for all, whether you celebrate it or not. Regular service will resume on December 3.

Comments (1 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

  • Briefs: Firefox 83; Adios Flash in Firefox; youtube-dl repo restored; Quotes; ...
  • Announcements: Newsletters; conferences; security updates; kernel patches; ...
Next page: Brief items>>

Copyright © 2020, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds