Leading items

Welcome to the LWN.net Weekly Edition for July 20, 2023

This edition contains the following feature content:

Rust for embedded: a conference talk on how to build a temperature-monitoring system using embedded Rust.
Stabilizing per-VMA locking: a memory-management scalability change creates some trouble for 6.4-stable users.
The proper time to split struct page: work on teasing apart the kernel's intimidating page structure shows where the memory-management subsystem is heading, but some developers think it may be getting there too quickly.
A Q&A about the realtime patches: Thomas Gleixner talks about the current and future state of the realtime preemption work.
Debian looks forward to 2038: it turns out that not all architectures are equal when it comes to plans for the year-2038 problem.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Rust for embedded

By Jake Edge
July 19, 2023

EOSS

The advantages of the Rust programming language are generally well-known; memory safety is a feature that has attracted a lot of developer attention over the last few years. At the inaugural Embedded Open Source Summit (EOSS), which is an umbrella event for numerous embedded-related conferences, Martin Mosler presented on using Rust for an embedded project. In the talk, he showed how easy it is to get up and running with a Rust-based application on a RISC-V-based development board.

Mosler works for Zühlke Engineering in Switzerland; both he and his company see the potential of Rust for new development, particularly for embedded projects. That is why the company supports his community work on Rust, including holding Rust meet-ups and traveling to Prague to attend EOSS and give the talk. He is building up Rust knowledge within the company so that it is available for new projects as they arise.

Over the last few months, he has been working on his own project: an Internet of Things (IoT) temperature-measurement device. He based the device on the ESP32-C3-DevKitM-1, which is a development board with a RISC-V processor and onboard WiFi. He chose that board because it is well-supported by the community, with various articles, libraries, and the like, and that it has the WiFi needed to send out the recorded measurements.

There are some additional benefits to the hardware he chose, including no need for a proprietary toolchain to build for RISC-V; he can simply install Clang and LLVM for what he needs. The board is also fairly inexpensive, around €10, and, importantly, is widely available, unlike Raspberry Pi devices over the last few years. Beyond that, it has 4MB of flash; he is currently only using around 1MB, but the extra space will allow him to do over-the-air updates and to store measurement data while the device is offline.

Hello Rustaceans

The traditional "hello world" program is used to introduce a new a programming language and Mosler followed that lead. Using his laptop and the development board, he did a live demo of creating, building, deploying, and running the program. He noted that Rust comes with a lot of different tools, including "one quite important tool called 'cargo'"; it is a build tool and dependency manager, but it can do even more than that. He had installed the cargo-generate extension that can be used to create a new project from a template; in his case, he used it to create a project using a template for the ESP IoT development framework (ESP-IDF).

He used cargo generate with a local path for the template file (though one could instead retrieve it from GitHub using different command-line parameters); it asked a couple of questions, then created a pre-populated directory for the project. The application code lives in src/main.rs, while the Cargo.toml file describes the application and its dependencies; the rest of the files configure the cross-compilation toolchain to build for his board.

The template project comes with code that initializes the device and will jump to the main() function in the binary that gets built. He looked inside his version of main.rs, where he had added calls to initialize the Rust logger and then to log messages at different severity levels (using info!(), warn!(), and error!()). In keeping with the theme, "Hello, Rustaceans!" was the info!() message.

To compile, deploy, and run that code, all that he needed to do was issue the "cargo run" command. He used the --release option to cargo, which builds a somewhat smaller binary image that will take less time to send to the board. The code in his local directory had already been built, so it simply asked him to confirm which serial device to use and sent the binary to the development board. If the code had not yet been built, cargo would have downloaded and built all of the dependencies, which takes a fair amount of time, he said.

But just flashing the code to the device took a bit of time, so he went back to his slides to show how he had installed the toolchain. He noted that the rustup project provides an easy-to-use mechanism to get what is needed for Rust development. Beyond that, he needed to install a handful of packages for his Fedora virtual machine, add a udev rule to recognize the USB serial device, and then install a few crates via cargo install, including the espflash crate that provides a tool to flash the development board.

He switched back to the output of the download, which had completed; once that happened, the serial device switched to a console displaying the system log for the device. It showed that it took around 300ms to initialize and jump to his code, then it printed his messages, in three different colors based on the severity, to the serial console.

Temperature monitor

With those preliminaries out of the way, he moved on to the IoT temperature monitor that was his real project. He chose to use the DS18B20 temperature sensor in the monitor because it uses a one-wire communication protocol, thus it only uses one pin (plus power and ground). The board can be connected to an external power supply, but he uses three AA batteries since the monitor is installed in his garden. He can run for about three weeks before he has to replace the batteries, which is good, but "there's still room for improvement".

The device needs to measure the temperature using the sensor and send it to his web site where it can be displayed. An important optimization is to put the device into a deep sleep between operations; otherwise, the device consumes the batteries in less than a day. In addition, the device needs to periodically connect to a WiFi network in order to send the measurements out to the cloud.

The code for all of what he demonstrated lives in his GitHub repository; looking at the history shows the evolution from the simple logging example he showed earlier to the more complex—though not horrifically so—temperature monitor. After checking out the version he wanted to demonstrate, which is simpler than the full-featured monitor described above, and starting to build and download it in the background, he began describing the new code, starting with:

    // Initialize one-wire-bus on GPIO3
    let peripherals = Peripherals::take().unwrap();
    let driver = PinDriver::input_output_od(peripherals.pins.gpio3).unwrap();
    let mut one_wire_bus = OneWire::new(driver).unwrap();

The Peripherals singleton comes from the IoT framework and the take() method takes ownership of that singleton for his application; no other part of the application can access it once that happens. That is the first real Rust-specific concept that he presented; the ownership is something that the compiler will enforce for a Rust program. While Mosler did not mention it here, the unwrap() call extracts the payload from the returned value, and will panic the program if the result (of the take() in this case) is an error.

He then uses the peripherals object to get to the specific GPIO pin (3) that he chose to use for the sensor; that gets turned into a driver object for a pin that is used as both an input and an output via the PinDriver::input_output_od() call. Once again, his code takes ownership of the GPIO 3, so no other part of the application can use or redefine it. That driver is used to create a OneWire object that can be used to communicate with the sensor chip.

He then described the main (infinite) loop of the program, which consists of a match statement (followed by a ten-second sleep):

    // Temperature measurement on one-wire-bus
    match measure_temperature(&mut one_wire_bus) {
	Ok(measurement) => send(&measurement).unwrap(),
	Err(MeasurementError::NoDeviceFound) => warn!("No device found on one-wire-bus"),
	Err(err) => error!("{:?}", err),
    }

The measure_temperature() function follows the Rust idiom of returning a Result (an enum containing either the value of interest or an error indication); the match statement does pattern matching as in functional languages (and similar to the match recently added to Python). In this case, if the function did not return an error (i.e. Ok), then measurement is assigned to the return value and is sent out using send(). If the result is the specific MeasurementError::NoDeviceFound error, the warning is issued, otherwise, all other error results get logged using error!().

The final, catch-all line for all errors in the match assigns the result to the err variable so that it can be emitted as part of the error message, but it has an additional role. In Rust, the match must have all of the possible outcomes covered by a pattern, he said, so removing that error catch-all line would cause a compilation error. But removing the NoDeviceFound line is just fine, since the error catch-all will cover that case.

Mosler then looked at his send() function and the data type that it uses:

    #[derive(Serialize)]
    struct Measurement {
	device_id: String,
	temperature: f32,
    }

    fn send(measurement: &Measurement) -> std::io::Result<()> {
	let message = serde_json::to_string_pretty(&measurement)?;
	info!("{}", message);
	Ok(())
    }

The Measurement type consists of a string and a floating-point value, but it has been declared to be serializable, which means that the serde library ("serde" for serialize-deserialize) can be used to serialize the type into a variety of formats. In send(), he uses serde_json to do JSON serialization (in a "pretty" string form with indentation using to_string_pretty()) before logging the value.

He returned to the measure_temperature() call in the match:

    match measure_temperature(&mut one_wire_bus) {

The one_wire_bus variable cannot be passed to the function, in an ownership sense, since the call is in a loop; subsequent calls would cause an error because the main function would no longer own one_wire_bus, he said. So the & operator is used to indicate that a reference to the variable is being passed (and the function is "borrowing" the value); Mosler did not note it, but send() is also borrowing its value in the match. Beyond that, though, one_wire_bus can be both read and written in the function, thus it is a mutable value, so the mut keyword is used in its declaration and in the call to and the definition of measure_temperature().

The measure_temperature() function is somewhat complicated, in part because multiple sensor chips can be installed in parallel and the ds18b20 library he is using will try to retrieve data from all of them. One Rust lesson from his description is the use of the ? as at the end of the line that starts the temperature measurement process:

    ds18b20::start_simultaneous_temp_measurement(one_wire_bus, &mut Ets)?;

The ? says that if the called function returns an error (start_simultaneous_temp_measurement() in this case), the current function should simply immediately return that error to the caller. It is a shortcut mechanism for an early error return from a function and it is used in several places in measure_temperature(). "All the error-handling code is hidden behind the '?'; that's pretty nice." He also showed how a function returns a non-error result, by using an Ok() expression:

    return Ok(Measurement {
	device_id: format!("{:?}", device_address),
	temperature: sensor_data.temperature,
    });

He switched over to the serial console; his program had downloaded and started running on the board. Every ten seconds, it was printing the measurement (as pretty-printed JSON). The room was warm and the temperature was measured at around 28°C; he touched the sensor chip for ten seconds to see it rise ("hopefully not above 40°C"). The next reading was over 32°C.

More features

The infinite loop for measuring the temperature means that the board is always running; it draws around 20-30mA so his batteries would last around ten hours before he would need to recharge them. He made some small changes to his main program, removing the loop and calling a new function, deep_sleep(), for nine seconds after a measurement is done (and logged). That function simply calls into the framework:

    fn deep_sleep(seconds: u64) -> ! {
	info!("Powering down for {} seconds", seconds);
	unsafe {
	    esp_deep_sleep(seconds * 1_000_000);
	}
    }

The return type of "!" tells the compiler that the function never returns. It simply logs a message, then calls esp_deep_sleep() in an unsafe block, which means that the compiler will not check the code in the block for undefined behavior; unsafe suspends other rules that apply to Rust code as well. That function is written in C, so the compiler cannot check it; if he removes the unsafe, he gets a compilation error. esp_deep_sleep() powers down the system, except for the realtime clock and a timer, for the specified time; when it wakes back up, it goes through the reset vector and reruns the program.

He flashed the program to the board and demonstrated it running. As before, every ten seconds it put out a log message with the JSON-formatted measurement. This version was somewhat noisier, with the log message from deep_sleep() and all of the startup messages each time the board came out of its sleep and reset. That change was enough to get the temperature monitor to run for three weeks before needing a battery change.

He then moved on to show the changes he made to support WiFi. Most of the changes are in a new wifi.rs file, but the main program needs to initialize the Wifi object before taking the measurement and then shut it down afterward. Each of those steps is accompanied by a hard-coded two-second delay, in order to ensure that a DHCP lease has been granted during initialization and to wait for the network queue to drain before shutting it down. Changing those to actually test for the conditions would be better, of course.

The send() function also needed to change; in addition to logging the measurement, it also sends it out over a socket. The new code is as follows:

    let socket = UdpSocket::bind("0.0.0.0:1337")?;
    ...
    socket.connect(format!("{}:1337", env!("SERVER_IP")))?;
    socket.send(message.as_bytes())?;

Mosler noted that this code knows nothing about WiFi, it simply opens a UDP socket, connects to the server, and then sends the JSON-formatted message. The env!() call resolves the SERVER_IP environment variable at compile time, so there is no need to commit that address to GitHub. He does a similar thing with the WiFi SSID and password in wifi.rs.

He then showed this final version of his code running on the board; it had log messages that indicated it had connected successfully to the conference WiFi network. The server address he was using was a cloud server that he then logged into. A simple netcat command was used to listen for the message; after a few seconds, it showed the temperature of the room on the server in the cloud. Mission accomplished in around 100 lines of code. He noted that the Rust compiler was quite helpful in the process of building the application; it takes some time to understand the error messages, but it is well worth doing.

He then demonstrated another cargo feature: searching. You can do a cargo search for a specific crate, then add it to your project with a cargo add. After that, cargo will download the library, build it, and link it into the project the next time it is built. It is a nice way to do rapid prototyping, he said; for production code, though, the library should be scrutinized for its stability, maintenance status, and so on.

With that, Mosler took questions from the audience. The first was about the delay function that he used, which was FreeRtos::delay_ms(); "is there a FreeRTOS running?" Mosler said that the ESP-IDF that he used has FreeRTOS as part of it; it is a high-level framework, that comes with WiFi support, which is one of the reasons he chose it. Those who want to work at a lower level can install the ESP framework instead, but then they will need to install WiFi and other libraries, do additional initialization, directly handle dynamic memory allocation, and so on.

Another question was about whether an asynchronous runtime, such as Tokio, could be used instead of FreeRTOS. Mosler said that one of the things he wanted to look into was alternative runtimes; he mentioned Embassy, which is another Rust-based, asynchronous framework. He would like to evaluate some of those in the future.

[I would like to thank LWN's travel sponsor, the Linux Foundation, for assisting with my travel to Prague.]

Comments (5 posted)

Stabilizing per-VMA locking

By Jonathan Corbet
July 13, 2023

The kernel-development process routinely absorbs large changes to fundamental subsystems and still produces stable releases every nine or ten weeks. On occasion, though, the development community's luck runs out. The per-VMA locking work that went into the 6.4 release is a case in point; it looked like a well-tested change that improved page-fault scalability. There turned out to be a few demons hiding in that code, though, that made life difficult for early adopters of the 6.4 kernel.

The mmap_lock controls access to a process's address space. If a process has a lot of things going on at the same time — a large number of threads, for example — contention for the mmap_lock can become a serious performance problem. Handling page faults, which can involve changing page-table entries or expanding a virtual memory area (VMA), has traditionally required the mmap_lock, meaning that many threads generating page faults concurrently can slow a process down.

Developers have been working for years to address this problem, often pursuing one form or another of speculative page-fault handling, where the work is done without the protection of the lock. If contention occurred during the handling of a fault, the kernel would drop the work it had done before making it visible and start over. This work never got to the point of being merged, though. Eventually, Suren Baghdasaryan's per-VMA locking work was merged instead. Since a page fault happens within a specific VMA, handling it should really only require serialized access to that one VMA rather than to the entire address space. Per-VMA locking is, thus, a finer-grained version of mmap_lock that allows for more parallelism in the execution of specific tasks.

After a few iterations, this work was pulled into the mainline during the 6.4 merge window. Entry into the mainline tends to increase the amount of testing a change is subject to, but none of that testing turned up any problems with per-VMA locking, which saw no fixup changes during the remainder of the 6.4 development cycle. Per-VMA locking, it seemed, was a done deal.

Trouble arises

A mainline kernel release gets even more testing, though. The 6.4 release happened on June 25; it took four days for Jiri Slaby to report problems with the Go language build. On July 2, Jacob Young reported "frequent but random crashes" and narrowed the problem down to a simple test case that could reproduce it. In both cases, investigation quickly pointed a finger at the per-VMA locking work. There were, it seems, cases where unexpected concurrent access was corrupting data structures within the kernel.

Random crashes and data corruption are not the sort of experience users are looking for when they upgrade to a new "stable" kernel release. For extra fun, the StackRot vulnerability led to the merging of some significant changes to how stack expansion is handled — changes that had not seen public review and which were quickly shipped in the 6.4.1 stable update. That work introduced another VMA-locking bug in 6.4 kernels, having added another case that needed explicit serialization.

Baghdasaryan responded quickly to all of these problems, posting a patch to simply disable the feature until the problems were worked out. That patch needed some tweaking of its own, but it was never applied to the mainline in any form. Memory-management maintainer Andrew Morton went into "wait-a-few-days-mode" in the hopes that a proper fix for the problem would emerge. Greg Kroah-Hartman, meanwhile, said that nothing could be done for the stable kernels until some patch landed in the mainline kernel. As a result, the 6.4 kernel lacked any sort of workaround for this serious bug.

On July 5, Thorsten Leemhuis (working in his regression-tracking role) wondered how long the wait would be, noting that the faster-moving community distributions would already be starting to ship the 6.4 kernel. Morton answered that he would send the fixes that existed at the time, but that did not actually happen for a few more days, leading Leemhuis to think that he needs to start sounding the alarm more quickly on serious regressions like this one. Had he done so in this case, he thought, the problem might have been addressed more quickly.

Morton sent a patch series to Linus Torvalds on July 8 that included, among other things, the disabling of per-VMA locking. Torvalds, though, undid that change when merging the set, because three other patches had gone straight into the mainline by then:

The first one added some locking to the new stack-expansion code, undoing the problem that had been introduced by the security fixes.
The next ensures that newly created VMAs are locked before they are made visible to the rest of the kernel. The problem fixed did not affect any released kernels, but it was a latent bug that would be exposed by the planned expansion of per-VMA locking in the future.
Finally, this patch fixed the problem that was being reported. If a process called fork(), then incurred a page fault on a VMA while that VMA was being copied into the child process, the result could be the corruption of the VMA. The fix is to explicitly lock each VMA before beginning the copy, slowing down fork()-heavy workloads by as much as 5%.

It's worth noting that Torvalds was opposed to the idea of disabling per-VMA locking; instead he insisted that, if it could not be fixed, it should be removed entirely:

I seriously believe that if the per-vma locking is so broken that it needs to be disabled in a development kernel, we should just admit failure, and revert it all.
And not in a "revert it for a later attempt" kind of way.
So it would be a "revert it because it added insurmountable problems that we couldn't figure out" thing that implies *not* trying it again in that form at all, and much soul-searching before somebody decides that they have a more maintainable model for it all.

By all appearances, the fixes (which were included in the 6.4.3 stable update on July 11) are sufficient, and per-VMA locking is now stable. There should, thus, be no need to revert and soul-search in this case. That said, it's worth keeping in mind that this work looked stable before the 6.4 release as well.

Closing thoughts

The per-VMA locking work made fundamental changes to a core kernel subsystem, moving some page-fault handling outside of a sprawling lock, the coverage of which is probably not fully understood by anybody. The fact that it turned out to have some subtle bugs should not be especially surprising. It is hard to make this kind of change to the kernel without running into trouble somewhere, but this is also precisely the sort of change that developers need to be able to make if Linux is to adapt to current needs.

In theory, avoiding subtle locking problems is one of the advantages of using a language like Rust. Whether, say, a VMA abstraction written in Rust could truly ensure that accesses use proper locking while maintaining performance has not yet been proven in the kernel context, though.

There is a case to be made that this regression could have been handled better. Perhaps there should be a way for an immediate "disable the offending feature" patch to ship in a stable release, even if that change has not gone into the mainline, and even if a better fix is expected in the near future. Kernel developers often make the point that the newest kernels are the best ones that the community knows how to make and that users should not hesitate to upgrade to them. Responding more quickly when an upgrade turns out to be a bad idea would help to build confidence in that advice.

Meanwhile, though, developers did respond quickly to the problem and proper fixes that allowed the feature to stay in place were found. The number of users exposed to the bug was, hopefully, relatively small. There is, in other words, a case to be made that this problem was handled reasonably well and that we will all get the benefits of faster memory management as a result of this work.

Comments (19 posted)

The proper time to split struct page

By Jonathan Corbet
July 14, 2023

The page structure sits at the core of the kernel's memory-management subsystem; one such structure exists for every page of installed RAM. This structure is increasingly seen as a problem, though, and phasing it out is one of the many side projects associated with the folio conversion. One step in that direction is currently meeting some pushback from memory-management developers, though, who think that some of these changes are coming too soon.

The purpose of struct page is to allow the kernel to keep track of the status of each page — how it is being used, its position in a least-recently-used list, how many references to it exist, and more. The information needed varies considerably depending on how a given page is being used; a page of user-space anonymous memory is managed differently from, say, memory used for a kernel-space DMA buffer. Since page structures must be kept as small as possible — there are millions of them in a modern system, so every byte hurts — data must be stored as efficiently as possible. As a result, struct page is declared as a maze of nested unions, allowing the data fields for each usage type to be overlaid.

This all leads to a structure that is too big; about 1.6% of the memory in a system is used just to track that memory at the lowest level. Many uses do not require all of the space that struct page provides, but the size of the structure cannot vary and the extra memory is wasted. At the same time, struct page is too small, requiring constant efforts to shoehorn another bit into it. The structure itself is nearly incomprehensible to human minds, even after efforts have been made to clean up its definition. Which fields are available for a given memory type is not always clear. This structure also exposes a lot of internal memory-management details that would be better hidden within the memory-management subsystem, making many changes harder than they should be.

One of the many goals of the current churn in that subsystem is to get rid of struct page in its current form. The system's memory map, which is currently an array of these structures, would be reduced to an array of pointers, each of which would point to a descriptor of a type suited to the current use of the page it represents. Those descriptors would be dynamically allocated and sized appropriately for the information they need to contain.

This is not a simple change to make; since this structure has been exposed to the entire kernel, there is code all over the place that deals with it directly. This includes a lot of device drivers. Changing all of that code will not be done in a day — or in a year, for that matter.

Thus, smaller steps need to be taken on the way toward this goal. One of those steps is for code to stop dealing with struct page directly and, instead, work with a usage-specific structure type. The 5.17 kernel saw the introduction of struct slab, which describes a page of memory managed by the slab allocator. This structure overlays struct page exactly and is carefully designed to avoid stepping on the fields of that structure that have other uses. This change doesn't change the fact that the information lives in the same page structure as before, but it makes the slab-specific parts explicit and hides the rest of struct page from the slab allocator.

The next step may be the struct ptdesc proposal from Vishal Moola. This structure describes the form that struct page takes when the memory it describes holds a page table:

    struct ptdesc {
    	unsigned long __page_flags;
    
    	union {
    	    struct rcu_head pt_rcu_head;
    	    struct list_head pt_list;
    	    struct {
    		unsigned long _pt_pad_1;
    		pgtable_t pmd_huge_pte;
    	    };
    	};
    	unsigned long _pt_s390_gaddr;
    
    	union {
    	    struct mm_struct *pt_mm;
    	    atomic_t pt_frag_refcount;
    	};
    
    	union {
    	    unsigned long _pt_pad_2;
    #if ALLOC_SPLIT_PTLOCKS
    	    spinlock_t *ptl;
    #else
    	    spinlock_t ptl;
    #endif
    	};
    	unsigned int __page_type;
    	atomic_t _refcount;
    #ifdef CONFIG_MEMCG
    	unsigned long pt_memcg_data;
    #endif
    };

As can be seen, even after this use case has been separated from the rest of of struct page, a number of unions remain. Many of them represent architecture-specific usages; pt_mm is used on x86 systems, for example, while pt_frag_refcount is needed on PowerPC and s390. But this structure is still much simpler, and it makes the page-table-specific usage clearer and more explicit.

This work is in its sixth revision, and most of the concerns that have been raised about it would appear to have been addressed. This time around, though, Hugh Dickins complained, saying: "I don't see the point of this patchset: to me it is just obfuscation of the present-day tight relationship between page table and struct page." He went on to say that, "in a kindly mood", he would describe the work as being ahead of its time, but would be willing to live with it if need be. David Hildenbrand added that he is "not a friend of these 'overlays'", adding that they only make sense once the descriptors can be dynamically allocated. Both developers seem to see this work as churning the memory-management code without providing any immediate benefit.

Matthew Wilcox answered that one reason to do this work now is to better document how each usage type manages the page structure:

By creating specific types for each user of struct page, we can see what's actually going on. Before the ptdesc conversion started, I could not have told you which bits in struct page were used by the s390 code. I knew they were playing some fun games with the refcount (it's even documented in the s390 code!) but I didn't know they were using ... whatever it is; page->private to point to the kvm private data?

There are, he said, assertions being added to ensure that the usage-specific structures continue to line up properly with struct page on each architecture; these can be seen in the form of the TABLE_MATCH() macros toward the end of this patch from Moola's series.

While there seems to be a consensus among the memory-management developers regarding the replacement of struct page with dynamically allocated, usage-specific descriptors, there apparently has not been a conversation about the order in which those changes should take place. It might be possible to do the dynamic allocation first, but that, too, would be a lot of code churn without a huge immediate benefit. Both transformations are needed to get to where the developers are trying to go. This work has started by adding the new structure types first; chances are it will continue that way for the duration (with, perhaps, zsmalloc descriptors being the next step).

Comments (15 posted)

A Q&A about the realtime patches

By Jake Edge
July 18, 2023

EOSS

In a session at the 2023 Real Time Linux Summit, Thomas Gleixner answered questions about the realtime feature of the kernel, its status, and the Real-Time Linux project's plans for the future. The talk was billed as a "Q&A about PREEMPT_RT" with a caveat: "anything except printk() and documentation". As might be guessed, the first two questions were on just those topics, but there were plenty of other questions (and answers) too. The summit was held in conjunction with the inaugural Embedded Open Source Summit in Prague, Czechia at the end of June.

Documentation and `printk()`

Right at the start of the session, Steven Rostedt could not resist asking: "what's wrong with documentation?" That was met with a big laugh from Gleixner; "lots", he said. The biggest problem with documentation "is that it mostly doesn't exist" for realtime Linux. His printk() caveat was because the usual question is "when will it be done?", but that is "subject to crystal balls". He would be happy to answer technical questions about "why printk() is a horror-show".

With that, he advanced to his second (and final) slide: "Questions?", which elicited a big laugh from the audience. Tim Bird asked the next question, inevitably on the second "off-limits" topic: "is printk() okay if you are not using a serial console?" Gleixner said "no ... I mean, kind of"; there were some problems in the printk() core, aside from using consoles, that made it unsuitable to use with the realtime patches. Those have been fixed, but there are still the problems with using the console driver. Those problems are not truly realtime-specific, but running the kernel with a realtime configuration makes them even more obvious.

The printk() code contains a large amount of duct tape, he said, which is a pattern that the realtime developers have encountered in multiple parts of the kernel along the way. For example, the CPU hotplug code was in a similar position; everyone knew that the code was broken from a design perspective. Instead of fixing the design problems, more and more duct tape and ... other stuff ... has been added in, to the point where it "slowly composts into concrete, but it doesn't work". Eventually, "you have to bite the bullet and rewrite it".

Bird said that he had looked at the realtime patches recently, noting that there are around 80 of them scattered around the kernel, though mostly related to the serial console, with only about 4000 lines of code. He has been telling people that most of it is now upstream and that they do not have to apply the patches; "is that correct?" Gleixner had an initial one-word answer for that: "no".

You still cannot enable realtime on the mainline kernel due to the lack of the "printk() bits and pieces"; the other patches in the set are for things that can be disabled, so those are not required. Once the threaded printk() patches hit the mainline, then it will be time to ask Linus Torvalds to enable realtime for x86 and arm64. The problems with printk() have been solved, he said, according to John Ogness, who has been working on the code, and printk() maintainer Petr Mladek. "I will believe it once it hits upstream", Gleixner said.

THP and networking

An attendee asked about transparent huge page (THP) support; currently, it is disabled when PREEMPT_RT is chosen. He wondered if there is something that can be done about that. Gleixner said that the problems with THP for realtime need to be fixed, "patches welcome". The realtime project has been focused on getting other things done, and has not tackled THP yet. There is no technical reason why the two cannot work together, they just do not right now. The THP migration and coalescing for memory have unacceptable latencies for the realtime kernel.

The attendee mentioned the advantages of reduced translation lookaside buffer (TLB) pressure that come with THP, which Gleixner acknowledged, but said that the project needed to prioritize getting the core of the patch set into the mainline. There is nothing stopping others from doing that work (or hiring a consulting company to do so); the project will probably look into it at some point in the future, but it would be better if others who need the feature take it on now.

The priority of the software interrupts was the subject of a second question from the attendee; he wondered if their priority could be increased for the realtime kernel. Gleixner said that the priorities should be set from user space by the administrator, based on the needs of the system as a whole. The problem is that "software interrupts are semantically ill-defined", so the priority that might work for one application would be totally wrong for another. Those interrupts are "context stealing and not really controllable"; the network developers have defended using software interrupts "tooth and nail for a decade", but they have come around to the idea that they need to rethink that, he said.

The attendee said that currently networking is basically broken for realtime processes; but Gleixner said that it was a complaint "about facts that have been well-known for years". Once again, he wondered why people were simply complaining, rather than digging in and working on the problem.

Another attendee noted that you can switch the NAPI thread to a realtime priority using sysfs. Gleixner said that the networking developers are moving to a threaded NAPI right now, which solves a lot of the problem for realtime, but not all of it. There are still lots of bottom-half disables within the networking code, but those can be removed once networking fully switches to threaded NAPI.

He likened the local bottom-half disable (i.e. local_bh_disable()), which prevent software interrupts from running, to the big kernel lock (BKL). Though it is a per-CPU lock, it is completely unspecified what local_bh_disable() protects, just like the BKL. And, as with the BKL, removing those calls breaks things, "but you couldn't tell why".

The process of removing the BKL was useful, in that regard, because it allowed the kernel developers to figure out what it was protecting everywhere within the kernel, with one exception: the TTY layer. That brought up a question about the TTY layer and the realtime patches. It turned out that the attendee really wanted to be able to use all of the UART devices available in the kernel, but the only path to those devices right now is via the TTY layer.

Toast and TTYs

"If you go through TTY, you're toast", Gleixner said. "Good luck fixing the TTY layer", he continued; he would not be "touching that with a ten-foot pole, even if you pay me money". If there is a need for serial communication from realtime processes, then some other mechanism needs to be added because TTY "is unfixable" for realtime.

The attendee said that they were not sending much data via the serial device, but Gleixner said that did not matter; if the only code path to access the device is via the TTY layer, then "you have a problem". If there is a real use case for non-TTY access to these devices, then some other code path could be added; the attendee agreed that his use case has nothing to do with TTY.

In fact, the questioner said that he has been maintaining an out-of-tree UART driver since the Linux 1.2 days, but it relies on a particular chip, which may not continue to be available. Gleixner said that a problem known since 1.2, which was released nearly 30 years ago, seems like it should have been fixed long ago by working with the upstream kernel developers on a proper solution. There is already support for so many different kinds of oddball devices, adding another should not pose much of a problem, given that there is a use case for it and a reason why the TTY layer needs to be bypassed.

A question about a return of the i915 (Intel graphics) driver brought laughter from Gleixner and much of the room. It is apparently disabled for realtime kernels. The only way to get a driver for that hardware is to wait for the new driver that is under development, Gleixner said. The current driver is not fixable and the patches that are in the realtime tree "are extremely horrible"; perhaps they could eventually be merged, but it will have to come later, and he seemed skeptical about it even then.

Some of the locking code paths in the existing i915 driver "are completely homebrew and out of any rational locking scheme in the world". That is one of the reasons that the new driver is being developed; "the replacement driver stack is coming along, you just have to wait for it". It is another example of "train wrecks in the kernel" that the realtime developers have tried to fix along the way.

Things to avoid

An attendee asked if there was a list of things to avoid using with the realtime patches beyond the TTY layer and i915 driver that had already been mentioned. Gleixner said that i915 actually works with the "hacky patches" in the realtime tree, at least "by some definition of 'works'". But using the TTY layer from a realtime task should surely be avoided; you can call printf() from your highest priority task, but it probably is not the best idea. Doing I/O from a realtime task is not generally the right design.

The questioner wondered about filesystems; are they problematic when running the realtime kernel? Gleixner said that he has not seen any problems with filesystems for a long time. Daniel Bristot de Oliveira said that he had seen lengthy latencies due to Btrfs recently, but Gleixner was not aware of those reports. It is the case that other kernel developers can always "needlessly slap a preempt_disable() somewhere in their code"; those kinds of problems need to be tracked down and the developers have to be asked not to do that. It is part of why he is "urgently needing to find the cycles to complete the 'Kernel Developers Guide to RT'"; once that is done, the realtime developers can point other kernel hackers at it.

But filesystems largely stay out of the way, because they are not part of the realtime computation, he said; "if you write your logfile from your realtime task, fine, you asked for it ... if the disk stalls, you wait". What about an in-memory filesystem of some sort, he was asked. Gleixner said that might work, "but, seriously, don't do it". Doing so violates all of the principles of realtime, he said; that is not a Linux-specific problem as all of the different realtime operating systems will warn people away from write(), read(), and the like.

A realtime program should either read its data up front or, if it needs to continuously update the data, have another, non-realtime process with large buffers to do the reading, he said. If the realtime task needs to write data continuously, it should be written to a ring buffer that a separate non-realtime task writes out. That is basic realtime theory, which is not at all Linux-specific, he said.

There are systems that need to handle streaming data in realtime, from cameras for example, an attendee said; how should something like that be done? "That's a system-design problem", Gleixner said; you will need a dedicated network queue, for example, but there is no "general recipe how to make that work". The current networking code does not work all that well with realtime, but a system can be tuned to the point where it can handle high-speed, streaming data. There are also options for handling the network traffic in user space to avoid some of the problems with realtime and the current networking code.

Long journey

Bird asked about the Intel acquisition of Linutronix, which employs Gleixner and some of the other realtime developers; he wondered if Intel was now funding work on realtime Linux. Gleixner said that Intel had always helped fund the realtime work via the Real-Time Linux project; both Intel and Arm have an interest in realtime Linux, which is reflected in their project membership. Kate Stewart, Linux Foundation VP for Dependable Embedded Systems and the organizer of the project, said that Intel, Arm, TI, National Instruments, Red Hat, and others have all been part of the "long journey" to get the full realtime patch set into the mainline.

Rostedt noted that the long journey would be 20 years in 2024, but Gleixner said that was only the public part of the journey. It was first posted to the Linux kernel mailing list in 2004, but for him the journey started at the end of 1999. That means it will be 25 years for him since the beginning of the project; "it's a long journey and there are a lot of things we need to address and improve over time", though there is "only so much capacity". He has tried working day and night, but has found that "it doesn't make things more effective".

The final question was about the role of the cyclictest tool; is it a good reference application? Gleixner said that cyclictest is useful for testing, but that it does not "resemble any particular real-world application". The questioner wondered if there were any good examples of real-world applications that could be reviewed. Gleixner said that he did not know of one, but that cyclictest and other test/benchmarking applications do provide a kind of basic reference implementation; however, real-world applications have a wide variety of requirements and levels of complexity.

Part of the problem with coming up with a reference realtime application is the need for specialized hardware, Gleixner said. That is a difficulty for testing and benchmarking realtime systems, he said; the results are effectively not reproducible without access to the same hardware. It is particularly hard to integrate such a test into continuous-integration (CI) systems. There is infrastructure that allows people running CI or other tests on their hardware to report it back to the realtime project, which can help detect and find regressions and other bugs. With that, Gleixner noted that he was no longer standing between attendees and the bar (or other evening activities), and the 2023 Real Time Linux Summit was complete.

[I would like to thank LWN's travel sponsor, the Linux Foundation, for assisting with my travel to Prague.]

Comments (19 posted)

Debian looks forward to 2038

By Jonathan Corbet
July 17, 2023

On January 19, 2038, the time_t value used on many 32-bit Linux systems will overflow and wrap around, causing those systems to believe they have returned to 1901 ~~1970 and wonder why they feel like they have heard Déjà Vu before~~. Much work has gone into preparing many layers of the system for this event, but not all distributions have completed their preparations. One of those is Debian but, as was seen in a conversation in May, the Debian developers are now grappling with the problem in a serious way. Along the way, they appear to have made an interesting decision regarding which systems will (or will not) be updated.

The time_t rollover is still over 14 years away, so solving it might not appear to be an urgent problem. The fact that it does not affect most 64-bit systems may also encourage complacency. But there are affected 32-bit systems on the market now, and some of them are likely to still be operating in 2038; that is especially true of embedded systems, which might prove harder to fix as the date approaches. It would be far better if these systems were prepared for 2038 at deployment time; that means solving the problem as quickly as possible.

Work has been underway in the kernel for some years, and it is mostly free of year-2038 problems at this point. The GNU C Library (glibc) has also seen a considerable amount of work to allow it to handle both 32-bit and 64-bit time_t values, depending on how a program has been compiled. So the core of a Linux system is ready, but there is a lot more than that to the problem.

Steve Langasek recently posted a plan for completing the year-2038 transition in Debian, in time for the upcoming "Trixie" release, likely to arrive in 2025. Rejecting the idea of completely rebuilding the affected Debian ports with the new ABI ("this makes upgrades unreliable, and doesn't help any ports for which we don’t do this work"), the plan calls for rebuilding every affected library to use 64-bit time_t, thus creating a breaking ABI change for each. Simply finding all of the affected libraries is a non-trivial task, and that is just the beginning:

Based on the analysis to date, we can say there is a lower bound of ~4900 source packages which will need to be rebuilt for the transition, and an upper bound of ~6200. I believe this is a manageable transition, and propose that we proceed with it at the start of the trixie release cycle.

The plan, in short, calls for rebuilding each library that includes time_t somewhere in its ABI, and renaming the library by adding t64 to its name (thus allowing the older library to stay in place initially). Once all packages have been rebuilt to use the new libraries, the old ones can be removed and the t64 suffix taken off again.

An LFS flashback

There is an interesting twist that shines a light on how hard these transitions can be; it's time for a brief historical digression. Originally on Linux, the off_t type used to express an offset within a file was a signed, 32-bit integer; on 32-bit systems, it might still be. As a result, no file on a Linux system could be larger than 2GB in size. Such files seemed reasonably large in the early 1990s; according to Wikipedia, 1990 saw the release of the gigantic IBM 0681 drive, with a capacity of 857MB. Now, of course, 2GB might hold a stripped-down "hello world" binary or a low-budget car-warranty spam email, but not much more.

Once money started pouring into Linux development, the file-size limit was recognized to be a problem and a focused effort was made to use 64-bit offsets ("loff_t") throughout the kernel; the result was part of the 2.4.0 kernel release at the beginning of 2001. As of that release, the kernel can handle file sizes so large that we will surely never run into limits again; problem solved.

Except, of course, for the little problem that user space needed to be updated to use 64-bit file offsets — a problem that should be looking familiar at this point. Glibc added an option to build with large file support (LFS), and applications could be built to use that support. Over time, distributions make the transition to LFS, and file-size limits have rarely been a problem.

Debian, it seems, has not completed that transition; LFS is still an open release goal. As Guillem Jover put it: "we are still not done with LFS, and I'm not seeing the end of that". LFS might not be relevant to the time transition, other than as a historical precedent, except for one little detail: the glibc developers, in an effort to reduce the number of combinations they must support, have decreed that enabling 64-bit time support will also enable large-file support. So any application that is to be made ready for post-2038 time_t values must also be made ready for post-2001 file sizes.

Langasek did not say how many packages would need to gain LFS support in this transition, but he did mention that there are about 80 packages that don't use time_t, but which would be broken by enabling LFS. To avoid forcing an LFS transition on those packages, a special build-system tweak would be added to allow them to remain in an era when drive sizes were measured in megabytes. In the discussion, Richard Laager suggested just forcing the LFS transition on those packages as a way to simplify the problem, but it's not clear that things will go that way.

Excepting i386

In general, there does not seem to be much disagreement over this plan, with one exception relating to an exception within the plan itself. The year-2038 transition is aimed at 32-bit Arm platforms; the i386 port, instead, would be passed over. "Because i386 as an architecture is primarily of interest for running legacy binaries which cannot be rebuilt against a new ABI, changing the ABI on i386 would be counterproductive". Helmut Grohne wrote that there was no consensus on that decision, and that making an exception of i386 would create a divergence within the Debian project that would make ongoing maintenance much harder:

Maintaining this kind of divergence has a non-trivial cost. Over time it becomes more and more difficult and less and less people are interested in doing it. As such, I see the addition of this kind of divergence as a way of killing i386.

He suggested that, perhaps, a general resolution to decide the fate of the i386 port would be the best way forward.

Simon McVittie responded that there were two fundamentally different use cases for the i386 port: as a fully featured architecture and as a way to run old binaries on current, 64-bit systems. The former case, he said, makes little sense at this point, while the latter remains important. People using the i386 port for old binaries are often running games, and those games will normally run just fine even if they do think it's 1970. They will not run, though, if libraries supporting 32-bit times are not available. Thus, he said, transitioning i386 to 64-bit time_t would actively harm its most important user base.

As a Steam developer, he could be expected to feel that way, especially since the Steam client itself is one of those legacy binaries. Others, though, agreed with him; Marco d'Itri said that "an i386 port which cannot run old i386 binaries would be almost useless". Gunnar Wolf also agreed, and suggested that a future i386 port could be pared down to the packages needed to run those binaries. Paul Wise asked whether the port could support both 32-bit and 64-bit times, noting that glibc has that support. McVittie answered that many other libraries, some of which are deprecated or "on life-support", would be difficult to support in this mode.

Langasek disagreed with Grohne, saying that the opposition to the i386 exception came mostly from people who are not Debian developers. Grohne eventually came around and changed his view on the plan and the community's sentiment:

I now see significantly more voices for the opinion:
i386 primarily exists for running legacy binaries and binary compatibility with legacy executables is more important than correct representation of time beyond 2038.

There was some disagreement with this summary — this is Debian, after all — with developers like Thorsten Glaser calling for "a long-term commitment against electronic waste" and ongoing support for i386. There was some discussion over whether it is better to replace old and inefficient hardware with modern systems or to simply keep them running, but nothing resembling a groundswell of support for transitioning i386 appeared. The fact that the project is not overrun with developers volunteering their time to keep a fully supported i386 port working also plays into this decision.

So it appears that the Debian project will proceed with the year-2038 transition plan as described by Langasek at the beginning of the discussion. Most 32-bit platforms will be made safe for operation after January 2038, while i386 will remain safe for game playing. The Debian project is often not the first to get a task like this done, but its developers do get there eventually.

Comments (97 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>