|
|
Subscribe / Log in / New account

14 years of systemd

By Joe Brockmeier
February 17, 2025

FOSDEM

It is a standard practice to use milestones to reflect on the achievements of a project, such as the anniversary of its first release or first commit. Usually, these are observed at five and ten‑year increments; the tenth anniversary of the 1.0 release, or 25 years since from the first public announcement, etc. Lennart Poettering, however, took a different approach at FOSDEM 2025 with a keynote commemorating 14 years of systemd, and a brief look ahead at his goals and systemd's challenges for the future.

He started the talk by reminding the audience what systemd is, to "bring everybody up to speed", using the definition straight from the systemd home page:

Systemd is a suite of basic building blocks for building a Linux OS. It provides a system and service manager that runs as PID 1 and starts the rest of the system.

Prehistory

Linux has had several predecessors to systemd beginning with System V init. Its design traces back to System V Unix beginning in 1983. The current System V style programs project that is still found in some Linux distributions traces back to 1992, Poettering said, and described it as old and cryptic.

Then came Upstart in 2006. It was an event‑based init daemon developed for Ubuntu by Scott James Remnant, a Canonical employee at the time. Upstart was designed to take the place of the traditional System V init daemon. And it did, for a while. It was adopted by Ubuntu, Fedora, Red Hat Enterprise Linux (RHEL), and others. Ultimately, it went by the wayside and was last updated in 2014, for a lifespan of just eight years.

The question of why distributions didn't stick with Upstart deserves an answer, he said. One reason is that Upstart didn't really solve the problem that a service manager, an init system, should solve. It required "an administrator/developer type" to figure out all of the things that should happen on the system "and then glue these events and these actions together". It was too manual, he said. Systemd, on the other hand, allowed users to simply specify the goal, "and the computer figures out the rest". There were other problems with Upstart, Poettering said. He cited slow development and political barriers due to Canonical's copyright assignment policy. (LWN covered Canonical's copyright assignment policy in 2009.)

Babykit

Poettering and Kay Sievers hashed out the basic ideas for what was initially called "Babykit" on a flight back from the Linux Plumbers Conference in 2009. He said the original name reflected the trend in that era of naming daemons something‑kit. "I think it's a thing that we imported from Apple." Most of those have died out, he said, though we still have Polkit and PackageKit. He and Sievers wanted to do a proper open‑source project with development in the open, no copyright assignment, using the LGPL.

From the start it was more than just an init system, more than just PID 1, he said. For example, part of the goal was to handle booting on modern Linux systems, which is "a series of processes that need to happen". Right from the beginning, it was more than one binary. People complain that systemd suffers from not‑invented‑here (NIH) syndrome, "and sure, to some degree, everyone is victim to this," he said. "But, I really try to do my homework."

By "homework", he meant studying what other projects do and what the status quo is before systemd implements something. "We want to have good reasons why we do it differently". System V init and Upstart were influences on systemd, because that was what Linux distributions were actually using. But Apple's launchd was much more interesting, he said. One feature that they loved, in particular, was its socket activation concept. A similar concept existed in the internet service daemon (inetd), which was a standard component of most Linux and Unix‑type systems at the time. But, he said, Apple pushed it to its limits, which got rid of "to some degree, explicit dependency configuration".

Solaris's service management facility (SMF) was another major influence "because it has all this enterprisey stuff, and we wanted to go for enterprise". Systemd also has some original thoughts—but just a couple of them.

I mean, I wouldn't claim that the concepts that Systemd is built from are purely ours. They are not. We looked at what's there and then tried to do maybe a little bit better at least.

Another major influence, of course, is Unix. But, Poettering said, he didn't really know what Unix is and went on a bit of a philosophical tangent about whether Linux and systemd were "Unix". Ultimately, he concluded, "maybe in some ways. In other ways, probably not".

"The world runs on it"

Fedora was the first major Linux distribution to switch from Upstart to systemd, in the Fedora 15 release back in 2011. Poettering said that it was a big win for systemd to be used by default, and that it was the goal from the beginning to make systemd something that would have mainstream use. Arch Linux and openSUSE followed Fedora's lead a year later, and then RHEL 7 included systemd when it was released in 2014. "So that was when it started, like the whole world started running on systemd". The whole world, excepting Debian and Ubuntu. Those distributions moved to systemd in 2015, which he said was "the most complex win" for the project.

[systemd logo]

For all its success, systemd did not have a logo until Tobias Bernard designed one and released it in 2019. Now systemd has its own brand page, and a color scheme that includes "systemd green". The project started its own conference in 2015, originally called systemd.conf, which expanded its focus beyond systemd and became All Systems Go! in 2017. Poettering put in a plug for this year's All Systems Go!, and suggested "you should totally go there if you're interested in low‑level operating system kind of stuff. User space only, though."

Poettering asked: so, where are we today? All major Linux distributions use systemd, "in particular, the commercial ones all default to it", and this basically means "the world runs on it".

The project has a vibrant community, he said. It consists of six core contributors, including Poettering, and 60 people with commit access. More than 2,600 people have contributed to systemd over the years. One thing that the project could do better, he said, is to release more often. Systemd has a six‑month release cycle, which is "actually not that great, we should be doing better", but said that it is difficult doing release management for such a large project.

Systemd also has "a little bit of funding" through donations to Software in the Public Interest (SPI), and from grants by the Sovereign Tech Agency (formerly the Sovereign Tech Fund). Poettering said that the project has used the funding for things that don't interest the core developers, for example reworking the systemd web site.

How big is systemd?

Today, systemd is a suite of about 150 separate binaries. Poettering said that the project sometimes gets complaints that it is too monolithic, but he argued that the project was not monolithic and was in fact "quite modular". That doesn't mean that everything within the project was modular or could be used in any other context, but "it's a suite of several different things" with a central component that keeps everything together. "You can turn a lot of it off. Not all of it, but a lot." It is primarily a C project, Poettering said, with a few exceptions. Some components are written in Python, and the project has experimented with Rust, but it is ultimately a C project.

The systemd core team doesn't share the same exact view of things, he said, but they do agree that systemd is "the common core of what a Linux‑based OS needs". It covers the basic functionality, from login, network name resolution, networking, time synchronization, as well as user-home-directory management. Every component has found adoption in some Linux distributions, though each distribution chooses different parts of systemd to adopt.

He also discussed the footprint of systemd in terms of lines of code and dependencies. He said that the project comprises 690,000 lines of code, whereas wpa_supplicant is about 460,000 lines of code, and the GNU C library (glibc) is more than 1.4 million lines. "Is that a lot, or a little? I don't know." He used to like to compare systemd to wpa_supplicant because it was roughly the same size as systemd, but in the past three years "we apparently accelerated" to outgrow it. But, systemd is still about half the size of glibc. As far as size on disk, Poettering said that a full‑blown install of systemd on Fedora was about 36MB, whereas GNU Bash is about 8MB. If the shell alone is 8MB, he said, "then it's not that bad" for systemd to be 36MB.

Poettering said that the project has always been conservative about dependencies, because if it pulls in a library as a dependency then it effectively impacts all Linux users. The project had switched to using dlopen() for all of the dependencies that were not necessary to run the basic system about two years ago. For those who don't know, he said, dlopen() is where a shared library is not loaded until it is absolutely required.

He gave an example of using a FIDO key with full‑disk encryption. Many people use FIDO but many others don't. By using dlopen() users can still have full‑disk encryption bound to something else without having to have the FIDO stack installed. "We pushed the dlopen() thing so far that there are only three really required dependencies nowadays", Poettering said: glibc, libmount, and libcap. Systemd has three build‑time optional dependencies as well: libselinux, libaudit, and libseccomp.

In the wake of the XZ backdoor, Poettering started pushing others to take the dlopen() approach too. Systemd was not the target for the backdoor, but because some distributions linked the SSH daemon against libsystemd, which then pulled in liblzma, it was used to prop open that door. "So, back then, this was not a dlopen() dependency. That's why this happened." Systemd was not at fault, he said, but "maybe we can do something about it." (LWN covered this topic in 2024.)

In summary, he said, systemd is not that big, and not tiny either. It is suitable for inclusion in initial ramdisks (initrds) and containers. "Systemd is not going to hurt you very much, size‑wise".

What belongs in systemd

Poettering admitted there is some scope creep in systemd, but said that the project does have requirements that make clear what belongs (and what does not) in systemd. The first of those is that it needs to solve a generic problem, not just a problem for one user. It needs to be foundational and solve a problem for a lot of users.

Another rule of thumb is that "it needs to have a future" and he said that the project is not going to add support for legacy technologies. The implementation needs to be clean and follow a common style, as well. "You can deliver products quickly if you cut a lot of corners, but that's not where we want to be with systemd." Even if something checks all the boxes, he said, it doesn't mean that it needs to be in systemd. It can still be maintained elsewhere. A technology also needs to fit into systemd's core concepts.

There is not a single list with the project's core concepts written down, Poettering said. It is possible to distill them from systemd's man pages and specifications, but "I can't give you the whole list". He did, however, provide a number of examples. The first of those is the clear separation of /etc, /run, and /usr; /etc is for configuration, /run is for "runtime stuff that is not persistent", and /usr is for "the stuff that comes from the vendor". This is a separation that was not traditionally followed so closely on Linux, he said.

Hermetic /usr is another concept that Poettering said that systemd is trying to push. In a nutshell, that means that /usr has a sufficient description of a system to allow it to boot, even without /etc, /var, and so forth. "It basically means that you can package up the whole of /usr, drop it on another machine, we'll boot up" and it will just work. He did not want to go into great detail on each concept, he said, but to give an example of how concepts "spill into everything else we do".

Another concept Poettering mentioned is that everything systemd does needs to have declarative behavior. "You just write down where you want to go. You don't write down code". For example, he said, boot should not involve running a shell script, because shell scripts are inherently imperative. That is not the way things should be configured. There are more concepts that systemd has, but the point is that having these concepts in place permits systemd to do things that were hard to do before, such as the options ProtectSystem= and ProtectHome= which provide "high‑level knobs" for sandboxing.

Systems also need standards, he said, and the project tries to set standards by writing down specifications of how it does things "in a generic way". And systemd consumes a lot of standards, too, such as /etc/os-release which is now used by most Linux distributions and BSD‑based operating systems. Poettering said that the project has even created a web site for standards, the Linux Userspace API (UAPI) group, where systemd people "and people close to us" are invited to put specifications. The discoverable disk image (DDI) specification, which provides a method for self‑describing file system images that may contain root or /usr filesystems for operating-system images, system extensions, containers, and more, is one example.

The future

Poettering was running out of time when he reached the slides for systemd's goals and challenges for the future. He was careful to note that the goals and challenges he outlined were from his point of view and that others on the systemd team may have different priorities. For his part, he sees four goals and challenges for systemd.

The first goal is to implement boot and system integrity, so that it is harder to backdoor the system. Not impossible, but harder. Basically locking down the system to keep attackers out, and having a well-known state for a system to return to where "you know there's not going to be anyone inside there because you can prove it". If someone attacks a server, they can be removed from it because the server can be returned to a defined state and it can be updated in a safe way. "In other words, you don't have to always sleep with your laptop under your pillow" because someone might modify the bootloader.

Poettering said that "all big OSes" have boot and system integrity addressed in one way or another. But none of the "generic distributions" have adopted it by default. It is a sad situation that matters a lot, he said. One obstacle to implementing boot and system integrity is that "it makes things more complex because you need to think about cryptography and all these kinds of things".

Cultural issues are another obstacle, and those are in part founded on FUD, he said, such as the idea that the Trusted Platform Module (TPM) is all about digital rights management (DRM) that "takes away your computers". On the contrary, the way that TPMs are designed "are actually very much compatible with our goals". Package‑based systems also make things more complicated than they need to be, but "we live in a package‑based world" since all major Linux distributions use packages by default.

Goal number two is rethinking systemd's interprocess communication (IPC), specifically moving away from D-Bus toward varlink. (LWN covered varlink in systemd v257 in December.) While D-Bus is "never going away", Poettering said that varlink allows processing IPC requests in one service instance per connection, which makes it easier to use. Writing D-Bus daemons is hard, he said, but it is easy to bind a command to an Unix stream socket using systemd's socket activation to turn it into a varlink IPC service.

The third thing on Poettering's list was a challenge, which is Rust, sort of—he made the case that systemd is doing pretty well with C, at least vulnerability-wise. If one accepts CVEs as a metric, anyway. There were three CVEs against systemd in 2023, and none in 2024, and its CVEs were "mostly not memory-related". Even so, he said, "we do think that the future probably speaks Rust".

But, systemd has a complex build and many binaries and tests. It is not a fit for Cargo, and the Meson build system currently used by systemd did not like Rust, though it has gained some Rust functionality recently. And systemd is sensitive to footprint issues, which is why it is heavily reliant on shared libraries. Static linking for 150 binaries is not an option, but "dynamic libraries in Rust are not there". Shared libraries need to be a first-class citizen. Ultimately, Poettering said that he is "happy to play ball" but does not want systemd to be the ones to solve the problems that need to be solved in order to use Rust. He mused that there might be some competition between Rust and Zig to deliver a memory-safe language that can provide stable shared libraries, including dlopen(), support for hybrid code bases, and more.

Fourth and finally, but far too briefly, Poettering said that the last challenge for systemd was about image-based operating systems and "let's leave it at that". The slide for the presentation went slightly farther and included a call for pushing the Linux ecosystem away from package-based deployments to image-based deployments. It also recommended mkosi for building bespoke images using package managers.

Poettering had time for a few questions. The first question was whether systemd would eventually replace GRUB. Poettering said that he was "not a believer in GRUB, as you might guess", and that all the pieces were there to replace it. The remaining problems were political. GRUB tries to do too much, and most of those things are a mistake, he said. Most distributions could switch to systemd-boot if the focus is EFI only.

Another audience member asked what the plans were to "fix the issues with resolved", systemd's DNS resolution daemon. Poettering said that it works fine for him, but he was aware some people had problems with the "fancier features" such as DNSSEC "and that's flaky" because DNSSEC is "really hard in real life because servers are shit". He suggested that the audience member file a bug.

The video and slides for the keynote are now available from the talk page on the FOSDEM web site. The keynote was one of four talks Poettering gave during FOSDEM, and all of the talks are now available online for those who would like a deeper dive into specific systemd features.

[I was unable to attend FOSDEM in person this year, but watched the talk as it live‑streamed. Many thanks to the video team for their work in live‑streaming all the FOSDEM sessions and making the recordings available.]


Index entries for this article
ConferenceFOSDEM/2025


to post comments

Standardization of paths

Posted Feb 17, 2025 17:06 UTC (Mon) by josh (subscriber, #17465) [Link]

One thing this article doesn't mention, but which I've also really appreciated systemd for promoting: standardizing and unifying paths across distributions when they were gratuitously different.

Before systemd, different distributions put the hostname in different places. Now, they all use /etc/hostname. Likewise for various other things.

Only one heckler, no sit-in protesters, no banners, 2/10 would not attend again

Posted Feb 17, 2025 17:11 UTC (Mon) by bluca (subscriber, #118303) [Link] (6 responses)

I was there for this talk and there were no protesters staging a sit-in. No banners proclaiming the resistance of The Unix Way (TM) being unfurled to the wind. There was only one heckler briefly shrieking for a moment, probably a dev*an neckbeard having a meltdown when realizing that systemd is more Unix-y than his distro due to the monorepo. Extremely disappointing, will ask for my money back.

Only one heckler, no sit-in protesters, no banners, 2/10 would not attend again

Posted Feb 17, 2025 17:23 UTC (Mon) by jzb (editor, #7867) [Link] (4 responses)

Let's not throw terms like "neckbeard" around at people, or groups of people. C'mon. Read the bullets again, if you haven't: that does not fall into "polite, respectful, and informative" by a country mile.

Only one heckler, no sit-in protesters, no banners, 2/10 would not attend again

Posted Feb 17, 2025 23:22 UTC (Mon) by k3ninho (subscriber, #50375) [Link] (2 responses)

Mr Brockmeier, I think Mr Bocassini is making a joke. I don't know how offensive the term really is, so I hope to be 'informative' here.

To Mr Bocassini, I apologise, Sir, for neither attending nor protesting, it's my aim in 2025 to be more inconvenient -- as protest should be -- but for now I'm only commenting online.

K3n.

Only one heckler, no sit-in protesters, no banners, 2/10 would not attend again

Posted Feb 18, 2025 0:27 UTC (Tue) by koverstreet (✭ supporter ✭, #4296) [Link] (1 responses)

It is possible to have a sense of humor that isn't at other people's expense.

Only one heckler?

Posted Feb 18, 2025 0:58 UTC (Tue) by jengelh (guest, #33263) [Link]

Well then I guess don't go to a Jimmy Carr show.

Only one heckler, no sit-in protesters, no banners, 2/10 would not attend again

Posted Feb 19, 2025 10:29 UTC (Wed) by IanKelling (subscriber, #89418) [Link]

Thank you.

Only one heckler, no sit-in protesters, no banners, 2/10 would not attend again

Posted Feb 18, 2025 6:44 UTC (Tue) by zdzichu (subscriber, #17118) [Link]

For me, systemd was very interesting in the beginning, but ceased to be around a decade ago. By then, it was mainly "complete" and integrated into most distributions. Hardest opponents created their Debian derivative (not fork, as Devuan uses original debian packages mostly). The systemd was just here, managed services and worked solidly. I expect protesters moved to tilt ad different windmills.

While during the early years I've eagerly read each new paragraph of NEWS, after ~2015 I found myself caring less and less. None of the later features around images, TPMs, sealed systems, homed etc. are relevant to me.

Coincidentally, a decade ago Kubernetes appeared, although it was still 2-3 years from being universally usable. I find K8s appearance curious coincidence, but not related to systemd getting bland.

FTR, I'm one of the 2,600 crowd.

systemd's Image Obsession Misses the Forest for the Files

Posted Feb 17, 2025 17:19 UTC (Mon) by alogghe (subscriber, #6661) [Link] (69 responses)

This focus on whole system images seems to miss key lessons from systems like Guix, which demonstrates the strengths of provenanced and verifiable artifact trees.

Rebooting an entire system image because some minor set of lib* that 99% of the system isn't using seems like a bad outcome and something the big vendors just do because reasons.

I realize systemd might not want to get into the "package" world but squashing or interfering with the developments that are and need to occur in this world, by being whole system image focused, seems like a poor path.

We could instead work toward a world of directed graphs of individually signed components, where updates could trigger targeted restarts or refresh signals only where needed.

Generally this image focus seems like its solving 2012's problems.

Images are required for verified boot

Posted Feb 17, 2025 17:43 UTC (Mon) by DemiMarie (subscriber, #164188) [Link] (17 responses)

Images can be cryptographically verified using dm-verity and the root hash can be cryptographically signed. Nix and Guix have nothing comparable to this.

Rebooting needs to be cheap, because one needs to reboot weekly for kernel updates. If there is a service that must be kept running, the solution is a high-availabilty cluster, not a single machine with high uptime.

Images are required for verified boot

Posted Feb 17, 2025 18:02 UTC (Mon) by NightMonkey (subscriber, #23051) [Link] (7 responses)

I'm just curious about how "live updates" to kernels without reboots has panned out in practice? That was touted as an alternative, at least in so-called "Enterprise Kernels", to reboots for new kernels. I never got a chance to evaluate this. Does it really mean that you don't need to reboot the OS in all cases?

Also, just a +1 on creating a HA cluster rather than relying on single OS instances.

Images are required for verified boot

Posted Feb 17, 2025 18:08 UTC (Mon) by bluca (subscriber, #118303) [Link]

No it's not all cases by all means, but it does work well for simple and small patches, which means it's good enough for a lot of security fixes. But it requires a dedicated team to handle, as you basically need to produce case-by-case specific artifacts to deploy. It is non-trivial work to do for production systems, but it is doable given enough resources.

Images are required for verified boot

Posted Feb 17, 2025 20:21 UTC (Mon) by ferringb (subscriber, #20752) [Link] (4 responses)

The part more interesting to me here isn't living patch, it's that systemd *is* pid 1; it can nuke the world, screw with root mounts, bring up the new image and pivot mount into that, thus basically "fast boot" w/out having the risks of live patching, nor the hardware initialization quirks of kernel exec. Bare min, anything that gets a userland level full reboot w/ integrity, and doesn't require paying the UEFI/BMC god awfully slow boot, that has my interest.

Basically CoreOS underlying image updates, just done w/out a reboot, and the new FS should be able to have integrity controlls leveled based on keys in the kernel already.

Images are required for verified boot

Posted Feb 17, 2025 20:58 UTC (Mon) by bluca (subscriber, #118303) [Link] (3 responses)

Images are required for verified boot

Posted Feb 17, 2025 21:05 UTC (Mon) by ferringb (subscriber, #20752) [Link]

...well that figures. Nice.

Images are required for verified boot

Posted Feb 20, 2025 8:47 UTC (Thu) by MKesper (subscriber, #38539) [Link] (1 responses)

The docs say this does NOT change the running kernel:
The OS update remains incomplete, as the kernel is not reset and continues running.
Kernel settings (such as /proc/sys/ settings, a.k.a. "sysctl", or /sys/ settings) are not reset.

Images are required for verified boot

Posted Feb 20, 2025 9:57 UTC (Thu) by bluca (subscriber, #118303) [Link]

...yes? That's the point of a userspace reboot?

Images are required for verified boot

Posted Feb 19, 2025 16:53 UTC (Wed) by Mook (subscriber, #71173) [Link]

My understanding is that live updates are usually there to tide you over until a more convenient time to reboot. Say there's an important vulnerability; you apply the live patches to shut it down immediately. However you're still expected to reboot soon, say in the middle of the night or over the weekend when there are fewer users.

Images are required for verified boot

Posted Feb 17, 2025 18:05 UTC (Mon) by alogghe (subscriber, #6661) [Link] (3 responses)

You make my points for me.

So the cloud build systems go off and make a pile of stuff that you "image sign".

Can the user reproduce the contents or its just "yes this is the same pile of stuff that some cloud something made"? Answer is no the user cannot reproduce it.

So everyone needs to run a cluster to keep up with this image based rebooting idea? What a world.

We need systems at the fs-verity and overlayfs layer for this.

Images are required for verified boot

Posted Feb 17, 2025 18:51 UTC (Mon) by mbunkus (subscriber, #87248) [Link]

> So everyone needs to run a cluster to keep up with this image based rebooting idea? What a world.

If your business case can tolerate a 2min downtime for reboots each weak, you do not need a cluster.

If, on the other hand, your business case cannot tolerate that small a downtime, then you do need a cluster, but not just for the reboots, but for regular operation 'cause applications & hardware can & do fail outside of scheduled maintenance windows, too.

Images are required for verified boot

Posted Feb 17, 2025 21:57 UTC (Mon) by mezcalero (subscriber, #45103) [Link] (1 responses)

If you build your images properly of course they are reproducible.

Most "mkfs" tools nowadays accept file system trees as input that shall be placed into the fresh file system in a reproducible way.

"mkosi" built disk images are reproducible, by most definitions of the term (though not by all).

Lennart

Images are required for verified boot

Posted Feb 19, 2025 7:50 UTC (Wed) by marcH (subscriber, #57642) [Link]

> ... reproducible, by most definitions of the term (though not by all).

I've fixed many build reproducibility issues and I hate the wrong "boolean" impression made by the term "reproducible". Build reproducibility issues are exactly like regular bugs: they do not always trigger; it depends on what you do. It depends on what you're trying to build and how, which corner cases you hit, etc. So saying "building X is reproducible" makes as little sense as saying "software X has zero bug". Some users of software X will indeed observe zero bug while other people using it differently will find a lot more. It's the same with build reproducibility: some people trying to reproduce the same binary will succeed, while others trying to reproduce another binary from the very same sources configured and used differently will hit some reproducibility issue(s) and fail.

I don't think this is a matter of "definition". It's the misunderstanding that build reproducibility is either true or false when it's neither. It's a "just" a type of bugs.

Images are required for verified boot

Posted Feb 18, 2025 12:43 UTC (Tue) by spacefrogg (subscriber, #119608) [Link] (4 responses)

You are either misinformed or misleading in your judgement. Nix and Guix have much more than that.

While signing images sounds cool, it falls apart immediately when you realise that your trust in the contents of said image stem from mere hearsay. There is usually no cryptoraphic connection between the image signature and the package signatures that this image was made from. Just the belief that somebody might have used them, only them, and used them correctly, in the right order etc.

With nix and guix you know exactly what inputs where used to construct a system X and you can verify it. You can have hourly updates if you want and can even predict if a reboot is necessary. And of course every package is cryptographically signed.

You don't have that in image world. You're not even close to it.

Images are required for verified boot

Posted Feb 18, 2025 17:27 UTC (Tue) by bluca (subscriber, #118303) [Link] (3 responses)

> You don't have that in image world.

Of course you can have that. Once again, runtime integrity and build supply chain security are orthogonal problems with different and independent solutions.

Just because all your packages are verified it doesn't stop an attacker with execution privileges from running their own code.

Images are required for verified boot

Posted Feb 19, 2025 9:07 UTC (Wed) by taladar (subscriber, #68407) [Link] (2 responses)

Wouldn't completely preventing that require the refusal to work on any and all data formats or network protocols that contain executable parts?

No PDF or Postscript, no browser with Javascript, no Office documents, no plugins or extensions,...

Images are required for verified boot

Posted Feb 19, 2025 9:49 UTC (Wed) by bluca (subscriber, #118303) [Link]

For user-facing machines, to an extent - for example mitigations like modern sandboxing browsers can offer a "good enough" protection in most cases. But really this is mostly aimed at headless nodes - servers, containers, VMs, compute nodes, infrastructure. You don't need to open office documents or view PDFs on your router, for example. Different use cases have different requirements and threat models.

Images are required for verified boot

Posted Feb 19, 2025 11:56 UTC (Wed) by pizza (subscriber, #46) [Link]

> No PDF or Postscript, no browser with Javascript, no Office documents, no plugins or extensions,...

...no vector fonts...

systemd's Image Obsession Misses the Forest for the Files

Posted Feb 17, 2025 18:00 UTC (Mon) by bluca (subscriber, #118303) [Link] (23 responses)

Worrying about whether packages are signed or not is the 2012 problem. In today's world that is not enough. We need to ensure the _runtime_ integrity of the system, cryptographically enforced by the kernel, and that's what image-based systems are all about, and neither Guix nor things like ostree can do anything about this, simply because they were designed to solve different problems.

systemd's Image Obsession Misses the Forest for the Files

Posted Feb 17, 2025 18:16 UTC (Mon) by alogghe (subscriber, #6661) [Link] (11 responses)

Noone produced a reproducible system in 2012 and redhat certainly hasn't in 2025?

This image based focus just punts these problems.

systemd's Image Obsession Misses the Forest for the Files

Posted Feb 17, 2025 18:19 UTC (Mon) by bluca (subscriber, #118303) [Link] (10 responses)

Those are different and unrelated issues. There's nothing stopping you from having reproducible builds when delivering images. Reproducible builds are a great idea, and a build-time thing. Runtime system integrity has nothing to do with reproducibility, they are completely unrelated.

systemd's Image Obsession Misses the Forest for the Files

Posted Feb 17, 2025 18:29 UTC (Mon) by alogghe (subscriber, #6661) [Link] (9 responses)

Why would I EVER want to run a binary that I can't reproduce with a local build on demand?

They aren't separate concerns whatsoever.

https://reproducible-builds.org/

systemd's Image Obsession Misses the Forest for the Files

Posted Feb 17, 2025 19:20 UTC (Mon) by bluca (subscriber, #118303) [Link] (8 responses)

Because it's not about *you* running a binary, it's about an attacker doing so. I guarantee you they don't care if the arbitrary code execution is reproducible or not. Code integrity at runtime is not about reproducibility, that's a build time/supply chain problem. It's about stopping attackers that escaped sandboxing from being able to execute arbitrary payloads, and only allowing cryptographically verified ones.

Once again, these are separate things, solving different problems.

systemd's Image Obsession Misses the Forest for the Files

Posted Feb 18, 2025 5:47 UTC (Tue) by PeeWee (guest, #175777) [Link] (7 responses)

> It's about stopping attackers that escaped sandboxing from being able to execute arbitrary payloads, and only allowing cryptographically verified ones.

But doesn't this then come dangerously close the DRM FUD? IIRC that is basically the main concern with the "trusted computing" paradigm and TPM: the vendor can prevent users from running their own non-signed code; who holds the signing key(s)? And if users can sign their own code, wouldn't that give an attacker the same capabilities? Or is this what Lennart meant by saying that TPM makes such attacks harder but not impossible?

systemd's Image Obsession Misses the Forest for the Files

Posted Feb 18, 2025 7:48 UTC (Tue) by mjg59 (subscriber, #23239) [Link] (3 responses)

TPMs are in no position to control what code runs on a system - they're a discrete cryptographic component that only knows what they're explicitly told. If you don't boot an OS that talks to the TPM, the TPM does absolutely nothing.

systemd's Image Obsession Misses the Forest for the Files

Posted Feb 18, 2025 7:59 UTC (Tue) by PeeWee (guest, #175777) [Link] (2 responses)

But my point is that this seems to be headed toward an OS that does use TPM. Otherwise, why would Lennart have mentioned it in his talk then?

systemd's Image Obsession Misses the Forest for the Files

Posted Feb 18, 2025 10:44 UTC (Tue) by ferringb (subscriber, #20752) [Link]

> But my point is that this seems to be headed toward an OS that does use TPM.

This is a good thing.

I think folks are confusing TPM with TEE. See https://docs.bunny.net/docs/widevine-drm-security-levels ; the first two levels, the end user can fully control their OS since hardware decoding + TEE keeps the decrypted content out of the users reach. Level 3- software decode, meaning the host OS *could* grab content, that requires that the OS be something signed and trusted by the DRM vendor (windows, for example). That requires secureboot (TPM) + boot chain validation. The user doesn't control the the critical parts of the OS in that scenario since they don't control the UEFI keys.

Linux wise, end users control the UEFI keys (exempting crappy laptop vendors like razer, where you have to use the shim). If you control those keys, you control the OS/userland- in full. Your distro cuts a new kernel, you validate it; if you wish to boot it, then (for UKI) you sign the UKI using the UEFI keys you control, thus allowing the hardware to boot it. Again- the keys *you* control.

Avoiding TPM has no relevance to DRM for everything but level3, and level3 is pretty much never going to happen in the linux world in my view; no content owner would trust a distro to manage the OS and kernel to their satisfaction, IMO.

TPM usage for the OS is a good thing. Any boot chain that isn't secureboot validated means someone else can swap in their own bootloader putting the kernel and OS under their control. Disk unlock that isn't based on hardware and system measurement has similar problems; "enter a password to unlock" is only as safe as the attestation of the software leading up to that point. No validation, no safety against "evil maid" type crap.

Note, hardware level attacks are a whole other beast, but if a 3 letter agency is after you, they're going to get in if they consider the effort worth it. :)

I could be wrong on a few particulars, but the broad strokes, that should be accurate.

systemd's Image Obsession Misses the Forest for the Files

Posted Feb 18, 2025 16:27 UTC (Tue) by mjg59 (subscriber, #23239) [Link]

And if you don't want the properties that provides you can boot a modified one that doesn't - no DRM can be enforced here

Cryptographically verified OSes and DRM

Posted Feb 18, 2025 10:17 UTC (Tue) by gioele (subscriber, #61675) [Link]

>> It's about stopping attackers that escaped sandboxing from being able to execute arbitrary payloads, and only allowing cryptographically verified ones.
>
> But doesn't this then come dangerously close the DRM FUD?

It is, unfortunately. Especially things like remote attestation. I love the idea of having the cryptographic-baked certainty that I'm SSHing into my own box, running my own software, unmodified. At the same time a browser could use remote attestation to let sites know that I'm not running ublock.

It's a situation similar to the HTTPS/TLS one: protects your data, but makes it hard to spot malware and data exfiltration.

systemd's Image Obsession Misses the Forest for the Files

Posted Feb 18, 2025 11:19 UTC (Tue) by bluca (subscriber, #118303) [Link]

> But doesn't this then come dangerously close the DRM FUD?

Very much not. The _owner_ of the machine is in control of what software runs, as the chain of trust of the integrity checks goes back to the DB and MOK UEFI lists of certs. So the owner is in control.

systemd's Image Obsession Misses the Forest for the Files

Posted Feb 18, 2025 14:02 UTC (Tue) by mezcalero (subscriber, #45103) [Link]

I like to say that TPMs are somewhat "democratic". They make no restrictions on what you can run, and they do not come with vendor controlled certificate lists like UEFI SecureBoot does. The only thing they can control is if you give it a secret for safekeeping it will reveal it to you (or execute ops on it) only under circumstances you define, and those can include prior measurements of the OS, but also providing a PIN or so. This means: you control your data, you define the policies, and the TPM only acts on your behalf. And the worst that happens if you lose your PIN is that access to that key is list, the system remains fully accessible and bootable all the time. If you bind FDE to your TPM then however if you are not running the software you declared good before will not be able to access your encrypted data.

So, it's really about policies you the user defines, with keys you the user owns, and it prohibits nothing.

The DRM thing FSF keeps mentioning is made up, noone would bother with the TPM for that (there are better ways for people who care about DRM to enforce DRM [e.g. your video card], really no need to involve a TPM with that).

systemd's Image Obsession Misses the Forest for the Files

Posted Feb 17, 2025 19:11 UTC (Mon) by champtar (subscriber, #128673) [Link] (10 responses)

ostree now supports composefs, allowing to use fs-verity (off, enabled but not signed, signed)

systemd's Image Obsession Misses the Forest for the Files

Posted Feb 17, 2025 19:18 UTC (Mon) by bluca (subscriber, #118303) [Link] (9 responses)

I am well aware, but that doesn't solve the problem either. The filesystem itself is still unverified, and the runtime integrity can be compromised just as well, as the entire security model hinges on doing the cryptographic verification _once_ at boot in the initrd. After that, it's again wide open.

systemd's Image Obsession Misses the Forest for the Files

Posted Feb 17, 2025 23:37 UTC (Mon) by champtar (subscriber, #128673) [Link] (8 responses)

Regarding filesystem verification, with dm-verity based systems, you have the same FS verification issue for the data partition, so you need measured boot / full disk encryption in any cases.

Regarding the verification, my understanding of composefs is that you have a file with an erofs partition containing all the metadata, and this file is also fs-verity protected, so once you have verified it and mounted it you should be good.

If you have time to detail why it's wide open, or have extra reading for me I would really appreciate (I'm a simple user but I enjoy learning on those subjects)

The big advantage of the file based approach is that you don't have to right size your A/B partitions, and you can keep more than 2 versions for cheap.

systemd's Image Obsession Misses the Forest for the Files

Posted Feb 18, 2025 0:57 UTC (Tue) by bluca (subscriber, #118303) [Link] (7 responses)

> Regarding filesystem verification, with dm-verity based systems, you have the same FS verification issue for the data partition, so you need measured boot / full disk encryption in any cases.

Sure but that's orthogonal - any writable data storage needs encryption at rest anyway as that's just table stakes, and LUKS2 can cover that just fine

> Regarding the verification, my understanding of composefs is that you have a file with an erofs partition containing all the metadata, and this file is also fs-verity protected, so once you have verified it and mounted it you should be good.

The first problem with that is that it's already too late once the partition is opened, as filesystem drivers are not resistent against malicious superblocks. The second problem is that again, that's just a one-time setup integrity check. There's nothing stopping anything with privileges in the same mount namespaces from replacing that mount with something else entirely, and then it's game over. ostree is entirely implemented in userspace, so it's defenceless against such attacks, because it's designed with different use cases and different threat models in mind (which is fine - nothing can do everything at once). dm-verity on the other hand has its cryptographic integrity enforced by the kernel, and when combined with the IPE LSM that was added in v6.12 it can enforce that this doesn't happen, and everything that is executed really comes from a signed dm-verity volume.

systemd's Image Obsession Misses the Forest for the Files

Posted Feb 18, 2025 4:43 UTC (Tue) by champtar (subscriber, #128673) [Link] (6 responses)

If you protect everything with LUKS (OS and DATA) wouldn't that mitigate attacks on the FS ?

I don't see a difference between having the dm-verity signature checked by the kernel, and having the composefs erofs file checked from user space in the initramfs (assuming the FS is LUKS protected), it's just different pieces of code in a UKI no ?

Using LUKS + fs-verity you might have more overhead than just dm-verity but that's a trade-off to have more flexibility around partition sizing / file deduplication I guess.

I see you can use IPE with fs-verity, don't know with composefs. As I use containers not sure I'll be able to use IPE anytime soon anyway.

systemd's Image Obsession Misses the Forest for the Files

Posted Feb 18, 2025 10:55 UTC (Tue) by bluca (subscriber, #118303) [Link] (5 responses)

> If you protect everything with LUKS (OS and DATA) wouldn't that mitigate attacks on the FS ?

That's for offline protection. LUKS doesn't help you against online attacks, which apply to images downloaded and executed. This is of course not a problem for the data partition, as you don't download that from the internet, it's created locally on provisioning. Every image that gets downloaded from the internet and used for payloads need to have its integrity checked and enforced before being used, as drivers are not resilient against intentionally malformed superblocks.

> I don't see a difference between having the dm-verity signature checked by the kernel, and having the composefs erofs file checked from user space in the initramfs (assuming the FS is LUKS protected), it's just different pieces of code in a UKI no ?

The difference is huge: in the former case integrity is enforced at _all times_ by the kernel, at runtime, so it's resilient against online attacks. In the latter case, integrity is only checked _once_ during boot, and never again. Again, the use cases are different: in the first case security checks are done always, in the latter case they are there for offline protection of data at rest. The second model is strictly weaker. Which might be fine, mind you - as always it depends on the threat model.
But if your threat model does include online attacks by malicious privileged processes (and for our case in Azure it very much does), the first model can mitigate it, the second one cannot.

systemd's Image Obsession Misses the Forest for the Files

Posted Feb 20, 2025 15:37 UTC (Thu) by alexl (subscriber, #19068) [Link] (4 responses)

>> I don't see a difference between having the dm-verity signature checked by the kernel, and having the composefs erofs file checked from user space in the initramfs (assuming the FS is LUKS protected), it's just different pieces of code in a UKI no ?

>The difference is huge: in the former case integrity is enforced at _all times_ by the kernel, at runtime, so it's resilient against online attacks. In the latter case, integrity is only checked _once_ during boot, and never again. Again, the use cases are different: in the first case security checks are done always, in the latter case they are there for offline protection of data at rest. The second model is strictly weaker. Which might be fine, mind you - as always it depends on the threat model.
But if your threat model does include online attacks by malicious privileged processes (and for our case in Azure it very much does), the first model can mitigate it, the second one cannot.

This is completely wrong. Composefs checks at initrd that the fs-verity digest of the composefs image is the expected value, similar to how you would verify the root dm-verity digest of the rootfs block device.

After that, each file access is validated by fs-verity at runtime. This includes reads of the EROFS image that has the metadata, as well as the backing files. And the expected fs-verity digests for the backing files are recorded in the EROFS image, and validated each time a backing file is opened.

The only difference between dm-verity and composefs is that dm-verity validates at the block level, and composefs verifies at the file level. This means that an attacker could modify the filesystem at the block level to "attack" the kernel filesystem parser. However, this difference is very small in practice. Only in a system using dm-verity for a read-only root, and *no* other filesystems does it make a difference. The moment you mount a filesystem for e.g. /var, then an attacker could just as well attack that. Any "solution" to that, such as dm-crypt would also apply to the composefs usecase.

systemd's Image Obsession Misses the Forest for the Files

Posted Feb 20, 2025 16:03 UTC (Thu) by bluca (subscriber, #118303) [Link] (3 responses)

> This is completely wrong. Composefs checks at initrd that the fs-verity digest of the composefs image is the expected value, similar to how you would verify the root dm-verity digest of the rootfs block device.

No, that is completely wrong. With signed dm-verity + IPE the enforcement is done by the kernel on every single binary or library being executed, at any time, at runtime, forever. That's the point of code integrity.

You cannot do that with composefs, because the security model is different and fully trusts userspace. So a userspace process that has escalated privileges can simply replace the composefs mounts with anything of their choosing, and run whatever they want. A userspace attacker that has escalated privileges on a dm-verity system cannot do that, as they would need a signed volume and they do not have access to a private key trusted by the kernel keyring, so taking control of userspace is not enough, you also need to take control of the kernel, which is much harder.

systemd's Image Obsession Misses the Forest for the Files

Posted Feb 21, 2025 15:17 UTC (Fri) by alexl (subscriber, #19068) [Link] (2 responses)

Theoretically there is nothing that prohibits adding kernel support for IPE to e.g only allow executing binaries from a signed composefs mount, although you are right that this doesn’t currently exist.

However, I think IPE is only useful for setups that are extremely locked down. For example, the second you have some kind of interpreter (like bash or python) available you can run essentially arbitrary code anyway. For any kind of more generic system that can run code more dynamically it will not be applicable. For example, you could never use it on a personal laptop or a generic server.

systemd's Image Obsession Misses the Forest for the Files

Posted Feb 21, 2025 16:16 UTC (Fri) by bluca (subscriber, #118303) [Link] (1 responses)

> Theoretically there is nothing that prohibits adding kernel support for IPE to e.g only allow executing binaries from a signed composefs mount, although you are right that this doesn’t currently exist.

No, that is not possible, not even theoretically, because the chain of trust MUST go up into the kernel for this to work at all. composefs mounts do not do this, by design, trust is verified in userspace. It's not a matter of implementing it, it needs to be redesigned, and mutability and integrity are essentially at opposite ends of the spectrum. Pick one or the other according to the use case, but you can't have both.

> However, I think IPE is only useful for setups that are extremely locked down.

Yes, for sure, code integrity is for cases where security requirements are stringent, and not for a generic laptop or so.

> For example, the second you have some kind of interpreter (like bash or python) available you can run essentially arbitrary code anyway.

This is being worked on, the brand new AT_EXECVE_CHECK option in 6.14 is exactly intended to allow interpreters to plug those use cases. It's absolutely true that interpreters are not covered right now, but (fingers crossed) there should be something usable for at least a couple of interesting interpreters this year if all goes according to plans.

systemd's Image Obsession Misses the Forest for the Files

Posted Feb 24, 2025 12:57 UTC (Mon) by alexl (subscriber, #19068) [Link]

>No, that is not possible, not even theoretically, because the chain of trust MUST go up into the kernel for this to work at all. composefs mounts do not do this, by design, trust is verified in userspace. It's not a matter of implementing it, it needs to be redesigned, and mutability and integrity are essentially at opposite ends of the spectrum. Pick one or the other according to the use case, but you can't have both.

I don't see exactly how this would be impossible. I mean, obviously it would require some design and implementation work in the kernel to do the kernel-side signature validation, but nothing fundamentally impossible. The main blocker is that I don't currently consider this a critical feature.

systemd's Image Obsession Misses the Forest for the Files

Posted Feb 17, 2025 18:38 UTC (Mon) by epa (subscriber, #39769) [Link] (21 responses)

Rebooting the entire system because a library has changed seems like overkill. But I have not seen an alternative proposed that matches it: certainly not for simplicity, but neither for being able to guarantee that only the new version is running.

Suppose the new version of libfoo fixes an important security hole. It is not enough to install it and make sure any newly started processes get the new code. Somehow you need to track down and restart existing processes using the vulnerable library, or, even trickier, arrange for them to unload the old library and dynamically link in the new one. I am sure such a scheme is possible in principle. I just don’t think it exists right now.

We scoff at the Windows users with their reboots, when a Linux system can keep running smoothly even after upgrading something as fundamental as libc. But when you stop to consider what the upgrade is for, perhaps the stupid approach is the right one after all.

Detected updated shared libraries

Posted Feb 17, 2025 19:30 UTC (Mon) by mussell (subscriber, #170320) [Link] (7 responses)

When a shared library (or any mmaped file) gets updated on disk, the related entry in /proc/$PID/maps will indicate that it is deleted. For example:

7f34474e0000-7f344763a000 r-xp 00024000 08:02 52661479 /usr/lib64/libc.so.6 (deleted)

Recent versions of Htop will highlight the process in yellow if there is such a deleted mapping. It is possible to have a program that scans every map file and restarts the associated systemd unit when it's shared libraries are updated.

Detected updated shared libraries

Posted Feb 17, 2025 19:32 UTC (Mon) by bluca (subscriber, #118303) [Link] (4 responses)

'needrestart' already implements pretty much that functionality. The trouble is that there are certain services that _cannot_ be restarted without fundamentally breaking the system, like the D-Bus daemon. And of course if there's a graphical session, restarting it is not too different from rebooting - it's going to be just as fast as a softreboot anyway.

Detected updated shared libraries

Posted Feb 18, 2025 5:19 UTC (Tue) by mirabilos (subscriber, #84359) [Link]

Maybe use an init system and GUI environment that does *not* use dbus, then you can easily restart it ;-)

Detected updated shared libraries

Posted Feb 18, 2025 8:33 UTC (Tue) by Wol (subscriber, #4433) [Link] (2 responses)

> And of course if there's a graphical session, restarting it is not too different from rebooting - it's going to be just as fast as a softreboot anyway.

So if I have a problem with my graphical session, it's okay to boot my wife or daughter unexpectedly off the system? I thought a graphical session was user-space, certainly conceptually, and there's no reason why there shouldn't be multiple real people with multiple different graphical sessions all using the same computer ???

Cheers,
Wol

Detected updated shared libraries

Posted Feb 18, 2025 11:00 UTC (Tue) by bluca (subscriber, #118303) [Link] (1 responses)

Just don't do that if you are one of the 3 people left in the world who use a single shared machine with GUI and multiple concurrent users, don't see what the problem is :-)

Detected updated shared libraries

Posted Feb 18, 2025 11:05 UTC (Tue) by geert (subscriber, #98403) [Link]

Number two is chiming in ;-)

Detected updated shared libraries

Posted Feb 17, 2025 19:36 UTC (Mon) by epa (subscriber, #39769) [Link] (1 responses)

Thanks. You could have something to restart system processes. But that doesn’t take care of user processes. The vulnerable version of libfoo might be linked into your web browser or mail client. I guess you could tell user sessions to log out. But then you might as well reboot.

Detected updated shared libraries

Posted Feb 18, 2025 15:52 UTC (Tue) by raven667 (subscriber, #5198) [Link]

I think there are a bunch of corner cases like this, especially when you add locally installed software which exists outside the best practices of the distro environment, so ultimately the juice isn't worth the squeeze, people have written tools like needsrestart but eventually you need to reload something disruptive, so you can either spend a lot of effort trying to be selective, that only works in certain scoped instances, possibly miss something and fail or do the absolute simplest thing and have guaranteed success.

systemd's Image Obsession Misses the Forest for the Files

Posted Feb 17, 2025 20:01 UTC (Mon) by wtarreau (subscriber, #51152) [Link] (9 responses)

> Rebooting the entire system because a library has changed seems like overkill.

Yes and no at the same time. For many users, the cost of a (quick) reboot is nothing compared to doubting. Images allow you to deliver a version of a complete system, and users update from one version to another. This also means less variations for support teams and easier to track regressions.

We've been doing this at haproxy in our appliances for the last 20 years now and it's much appreciated from users, support and developers because everyone knows what you're running.

Those who don't like it are experienced admins who would like to tweak the system a bit more, install their own tools etc. Then it becomes more complicated, even if there are R/W locations making this possible. For example installing a tool that depends on libraries not on the base system is harder to use (need to play with LD_LIBRARY_PATH etc). There are possible hacks consisting in using mount --bind to replace some binaries on the fly etc, so you're not completely locked down by default, but clearly for the vast majority of end users, knowing that they're only running the versions they're expected to run is quite satisfying. Also you end up with an extremely stable system that never fails to boot, because you don't install random stuff on them nor can you break package dependencies. It even permits smooth upgrades because you can enumerate everything that has to be handled and can afford to automatically migrate configurations. I tend to like this approach, I'm even using it for infrastructure components at home (firewall, proxies etc), because there's no degraded situation, it's either the good previous image or the good next one.

The important thing is to have a fast booting hardware, and with heavy UEFI bioses that decompress an entire operating system in RAM these days, we're slowly going away from this. 10 years ago, our system was fully up in 12s from power up. Nowadays it's more like 30-40s. In any case that only counts in case of double power outage (extremely rare), because normally each device is supposed to be backed up by another one which instantly takes over. And with some mainstream servers it can count in minutes and then I sense how it can be too long for many users for just a library update!

systemd's Image Obsession Misses the Forest for the Files

Posted Feb 17, 2025 20:15 UTC (Mon) by farnz (subscriber, #17727) [Link] (2 responses)

Out of interest, is kexec at a point where it saves you time? Or is it not yet good enough for your needs when swapping an image out?

systemd's Image Obsession Misses the Forest for the Files

Posted Feb 18, 2025 22:24 UTC (Tue) by bearstech (subscriber, #160755) [Link] (1 responses)

In my case kexec is a huge win : latest Dell server have horrible BIOSes which POST in 3 to 4 minutes (with pretty regular storage and network hardware). I update kernels far more often than BIOSes (any many BIOS updates are CPU microcode updates which Linux handles fine, thanks). Kexec makes a kernel update down to a 30s downtime, where a large chunf of the boot time is still hardware's fault (at least 10-15s to wait for the RAID controller to boot, and at least 10-15s to wait for the network cards firmware to settle).

Linux+systemd boots (and most importantly shuts down) faster and faster, while shitty firmwares boot slower and slower. Those kexec and soft-reboot features are sorely needed.

systemd's Image Obsession Misses the Forest for the Files

Posted Feb 19, 2025 16:31 UTC (Wed) by raven667 (subscriber, #5198) [Link]

Thank goodness I don't have to manage hardware very often anymore but the time it takes for a server to fully reboot can be a stressful, white-knuckle event, the longest 5m in the world, as you can never be 100% sure they come back online, and I'm not a big fan of being responsible for creating an outage because of forgetting some stupid simple thing.

systemd's Image Obsession Misses the Forest for the Files

Posted Feb 17, 2025 20:24 UTC (Mon) by alogghe (subscriber, #6661) [Link]

You make my point that this is a solution not fit for a general purpose computing device.

Real Users have complex workflow and program needs.

They have large trees of libraries and binaries not in the default image and they deserve full support and working systems that support that use.

This systemd image idea is rooted in appliance based thinking and isn't appropriate for large numbers of real users.

systemd's Image Obsession Misses the Forest for the Files

Posted Feb 17, 2025 22:34 UTC (Mon) by mezcalero (subscriber, #45103) [Link] (4 responses)

Image-based doesn't mean you really have to place the whole OS in a single image with every app and so on. It just means your basic OS is a cryptographically signed image, and everything else is too. The OS can consist of a couple of distinct images that are combined together, and each payload is an image too. So in systemd you can have quite a bunch of different images: for the underlying OS, for "extensions" of the OS (sysext), for your configuration (confext), for individual services (portable services), for containers (nspawn). Of course, how to split up things up among these different concepts requires some care to be taken, but if you do it properly, you don't need to reboot every time you update *any* library. but you still do for some.

Moreover, note that there are things like soft-reboot these days, which are really fast. And there's a perspective for even more cool stuff: for example I think we should to make systemd-nspawn + portable services systematically ready to just stay running over soft reboot. Since they run off their own disk images anyway this should be relatively straigt-forward to implement.

systemd's Image Obsession Misses the Forest for the Files

Posted Feb 18, 2025 0:12 UTC (Tue) by fraetor (subscriber, #161147) [Link] (3 responses)

There are two use cases I don't quite get how to achieve with image based OSs: Runtime introspection and local admin modifications.

Runtime introspection, the ability of an admin to poke at the state of the system while it is running to gain a better understanding of its behaviour, meshes nicely with being able to reset an image-based OS back to a known good state, but I'm not sure how one would go about it.

Local admin modifications are changes to the vendor supplied files that fix site-specific issues, such as adjusting hard coded timeouts, including additional codecs, that kind of thing. I think sysexts are meant to be the solution here, but I'm not clear how easy they are for a local admin to actually create on the fly.

How are these use cases meant to be addressed in a systemd image-based OS? Am I missing something from these various extension/portable services?

systemd's Image Obsession Misses the Forest for the Files

Posted Feb 18, 2025 14:09 UTC (Tue) by mezcalero (subscriber, #45103) [Link] (2 responses)

So for bigger installations/servers (i.e. "cattle") you probably simply don't want to allow local admin modifications (which is more a "pet" thing). If you want to do local modifications, sysext is what we propose: you can either build a DDI image (i.e. erofs+verity in a gpt image), and drop that on the target, properly signed. Or, if you want to yolo it more you can use "systemd-sysext --mutable=" to make /etc/ mutable by adding in a local overly dir at a writable location. But this is a break-glass thing of course.

And regarding introspection: if you want a debug shell to debug things, then nothing is stopping you from getting one with the usual mechanisms. just because your /usr/ is immutable doesn't mean you cannot get a root shell after you authenticated.

systemd's Image Obsession Misses the Forest for the Files

Posted Feb 18, 2025 17:44 UTC (Tue) by fraetor (subscriber, #161147) [Link] (1 responses)

The mutable sysext sounds like just the thing for experimenting. Then once you settle on a good change you can package that as a signed and read-only sysext. I'll have to dig into sysexts a bit more.

systemd's Image Obsession Misses the Forest for the Files

Posted Feb 18, 2025 17:54 UTC (Tue) by bluca (subscriber, #118303) [Link]

The GNOME OS and FlatCar distro developers use exactly those tools for developing, from what we hear they are quite happy about them

systemd's Image Obsession Misses the Forest for the Files

Posted Feb 18, 2025 6:20 UTC (Tue) by PeeWee (guest, #175777) [Link]

> Suppose the new version of libfoo fixes an important security hole. It is not enough to install it and make sure any newly started processes get the new code. Somehow you need to track down and restart existing processes using the vulnerable library, or, even trickier, arrange for them to unload the old library and dynamically link in the new one. I am sure such a scheme is possible in principle. I just don’t think it exists right now.

I believe what you are looking for are tools like `needrestart` and `checkrestart`. Both exist in the Debian derived distros, e.g. Ubuntu. IIRC `checkrestart` is the predecessor of `needrestart`. The former lives in the debian-goodies package, which is optional, at least in Ubuntu, and the latter comes in its own package, which should be installed automatically, if I am not mistaken. And as a last step of an upgrade run (`apt upgrade`), `needrestart` gets executed, finds everything that is still using old, now deleted, libs and can even identify the services that need to be restarted and does so for most but not all of them, as there are some that would have surprising consequences when being restarted, e.g. systemd-logind. Also any programs that do not belong to a service will be identified, but the restarting must be done manually. This will also inform you, if a reboot is necessary, i.e. newer kernel version. And if there are services that cannot easily be restarted a `systemctl soft-reboot` may be the quicker option, provided no kernel reboot is necessary anyway.

I believe all this is done by leveraging `lsof` and checking which of the libs cannot be found in the filesystem. IIRC that was the approach when `checkrestart` was still in its infancy. And I cannot imagine that this kind of tooling is only available in the Debian world, even though I think it was pioneered there.

systemd's Image Obsession Misses the Forest for the Files

Posted Feb 18, 2025 11:03 UTC (Tue) by pabs (subscriber, #43278) [Link]

needrestart can do all that sort of thing, it even tells you if you updated microcode and forgot to reboot.

systemd's Image Obsession Misses the Forest for the Files

Posted Feb 19, 2025 10:43 UTC (Wed) by IanKelling (subscriber, #89418) [Link]

> but neither for being able to guarantee that only the new version is running.

But it isn't a guarantee in the face of malicious code, so to be more accurate, it prevents a certain kind of bug, and so, if you've been updating your debian based system without being aware of this kind of bug for the last 15+ years, why exactly is it worth rebooting a whole lot more? I'm sympathetic to the parent poster.

systemd's Image Obsession Misses the Forest for the Files

Posted Feb 17, 2025 21:53 UTC (Mon) by mezcalero (subscriber, #45103) [Link] (4 responses)

To be frank, I find systems such as nix, guix, ostree security-wise really uninteresting, because anything they do is inherently mutable, yet also executable, and there's no clean way to correctly, cryptographically reset the system into a known good stat again. I think both this concepts are what a good system should provide these days:

1. I am pretty sure it's essential to learn from memory management that W^X is a *good* thing, and apply it file systems as a whole: a file system should *either* be writable, *or* allow executable inodes on it. The combination is always risky business. An image-based system can provide that: you cryptographically guarantee the integrity of the whole FS with everything on it via dm-verity and signing, which systems such as IPE then can hook into. A system such as nix/guix/ostree inherently does not, they maintain repositories of random inodes, both writable and (often) executable.

2. There must always be a cryptographically enforced way to return to a known good system state, i.e. flushing out local state, returning to vendor state. nix/guix/ostree generally don't support that, they cannot put the fs back into a known good state, because they modify it on the fs level, not on the block level, and they cannot just flush out the local fs modifications wholesale.

So, I think we can do better (and must do better) than nix/guix/ostree for building a secure OS, in particular in a networked world, where you know that sooner or later any software will be exploited, and hence it's so important to make it impossible to sneak executable code in, and if it happens anyway you know how to get it out fo there again.

Now, of course, you can have another security model in mind than I do, but you know the magic "reproducibility" thing is just *one* facet of security, and of course DDIs (i.e. the dm-verity disk images we like to thin in( are as reproducible as anything done by guix/nix/ostree if done properly. It's not an exclusive property of guix/nix/ostree, not at all.

systemd's Image Obsession Misses the Forest for the Files

Posted Feb 18, 2025 13:01 UTC (Tue) by aragilar (subscriber, #122569) [Link] (3 responses)

How would this interact with $HOME/bin (or $HOME/.local/bin)?

systemd's Image Obsession Misses the Forest for the Files

Posted Feb 18, 2025 14:13 UTC (Tue) by mezcalero (subscriber, #45103) [Link] (2 responses)

If having something like that shall be permissible in the image-based OS you define, then you need to allow it in your policies of course. But of course, the question if it should always be permissible. Writing your own programs is of course entirely fine, but I am pretty sure it's fine to say that this is only really a thing if you have a certain tech-savvy kind of human users for your system. Many (most?) devices won't require this. I mean, I am pretty sure there are not that many people who write and run their own shell scripts on your phone for example.

systemd's Image Obsession Misses the Forest for the Files

Posted Feb 20, 2025 8:18 UTC (Thu) by aragilar (subscriber, #122569) [Link] (1 responses)

I don't see why phones can't have shell scripts (writing them would be a pain, but that's more of an input problem rather than a fundamental thing). One could easily imagine having a pipeline, where you click a button, the camera opens up (with specific settings chosen), there's some automatic editing to put the photo in the right orientation (which could be previewed), and then the photo is sent off to some system. I know of fieldwork apps which are basically this, and have a scripting language built in for this use case.

The issue is who makes the decision as to whether users should be allowed to do this task. If this is imposed on users (e.g. their phones), then we have a problem, but if this is effectively an appliance that is provided for a specific task (and even then, the question of should things be automatable be asked), then it does make sense.

systemd's Image Obsession Misses the Forest for the Files

Posted Feb 20, 2025 10:40 UTC (Thu) by Rigrig (subscriber, #105346) [Link]

If this is imposed on users (e.g. their phones), then we have a problem
I run shell (and other) scripts on my phone using Termux, and W^X is indeed a problem: (apparently also for BOINC)
  1. "Recent" Android API versions enforce W^X, so Termux is either
  2. Downloading+running executable code seems to be against Play Store policies, so it might get randomly removed at any time anyway.

DLL hell, version 2

Posted Feb 17, 2025 18:06 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (21 responses)

> In the wake of the XZ backdoor, Poettering started pushing others to take the dlopen() approach too.

This is absolutely a short-sighted decision that will cause huge problems down the line. In particular, for static binaries and for the future lockdown features that are going to be hamstrung by not having the ability to load the whole dependency graph.

DLL hell, version 2

Posted Feb 17, 2025 18:13 UTC (Mon) by bluca (subscriber, #118303) [Link] (1 responses)

It's a fantastic idea that is working out exceedingly well in practice, and we'll do it more and more and more, and across other projects too. Seeing Rust zealots, fanatics of the "let's statically link the entire universe, what could possibly go wrong" whinge about it just further confirms it's 100% the right solution.

DLL hell, version 2

Posted Feb 17, 2025 18:18 UTC (Mon) by corbet (editor, #1) [Link]

If it is working well for you, please say so. That is fine. But please stop insulting the people you might disagree with on unrelated topics. Seriously.

DLL hell, version 2

Posted Feb 17, 2025 18:37 UTC (Mon) by mezcalero (subscriber, #45103) [Link] (17 responses)

We do export the "whole dependency" graph of weak deps in an ELF note, afaik all major package managers have been updated to read and process this information and turn it into automatic package dependencies. Hence, I dont think your criticism in this regard is valid, sorry.

Lennart

DLL hell, version 2

Posted Feb 17, 2025 19:18 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (16 responses)

For example, if you want to seal the address space of a process (like OpenBSD does) then you need to have the information about all the dependencies beforehand.

With dlopen() optionals the only way is to load all of them, in case the application needs them later. Then there are issues of disk access, optionals are resolved at an arbitrary point.

This all will have consequences in the future.

DLL hell, version 2

Posted Feb 17, 2025 19:26 UTC (Mon) by mezcalero (subscriber, #45103) [Link] (15 responses)

Nah, we live in a world where things like glibc nss are a thing and basically involved in any hostname or user resolution request, i.e. in basically any non-trivial program. Glibc nss is a dlopen() based system, hence sealing things off entirely during early init of a process is just not realistic on Linux in general, and systemd's use of dlopen() is not making it worse.

In fact the way we do it makes it a lot *easier* to handle things like this, since you could parse the mentioned elf note within your program ealy on and load all modules listed therein, gracefully handling those whoch cannot be fulfilled because not installed and *then* seal off memory. After all the data is there, programmatically accessible from inside the elf programs and from outside too.

DLL hell, version 2

Posted Feb 17, 2025 22:19 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (13 responses)

> Nah, we live in a world where things like glibc nss are a thing and basically involved in any hostname or user resolution request
> Glibc nss is a dlopen() based system, hence sealing things off entirely during early init of a process is just not realistic on Linux in general, and systemd's use of dlopen() is not making it worse.

musl exists, and it can be fully static. With nscd we don't need NSS modules, and the only remaining bogosity is in PAM.

We're actually pretty close to a fully static system, that can be sealed in RAM.

> In fact the way we do it makes it a lot *easier* to handle things like this, since you could parse the mentioned elf note within your program ealy on and load all modules listed therein, gracefully handling those whoch cannot be fulfilled because not installed and *then* seal off memory.

"Mandatory optionals, it has a nice ring to it!"

DLL hell, version 2

Posted Feb 17, 2025 22:46 UTC (Mon) by mezcalero (subscriber, #45103) [Link] (12 responses)

> musl exists, and it can be fully static. With nscd we don't need NSS modules, and the only remaining bogosity is in PAM.

Oh christ, static linking. and musl. not sure where to start. I mean, sure whatever, I think we have very different pov on operating systems. Good luck!

> "Mandatory optionals, it has a nice ring to it!"

Hmm? what's mandatory? not grokking what you are saying? even if you load dlopen() weak deps during process initialization early on they don't become mandatory?

DLL hell, version 2

Posted Feb 17, 2025 23:54 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (5 responses)

> Oh christ, static linking. and musl. not sure where to start. I mean, sure whatever, I think we have very different pov on operating systems. Good luck!

Yep. I prefer systems to be as static as possible, with cryptographic verification from the image down to individual pages in RAM. Adding _more_ dlopens is not helping.

> Hmm? what's mandatory? not grokking what you are saying?

I believe so?

> In fact the way we do it makes it a lot *easier* to handle things like this, since you could parse the mentioned elf note within your program ealy on and load all modules listed therein, gracefully handling those whoch cannot be fulfilled because not installed and *then* seal off memory.

This will result in xz loaded unconditionally everywhere, since there are no primitives that can generically say "/bin/true needs libsystemd but _only_ for the readiness protocol".

DLL hell, version 2

Posted Feb 19, 2025 15:15 UTC (Wed) by surajm (subscriber, #135863) [Link] (4 responses)

dlopen if done entirely during the initial boot is basically equivalent to normal dynamic linking, but just provides some extra control to the user to handle faults like the dependency not being available. That seems like a good thing.

If you really want to go the static route you might end up with more binaries where you try and split out what could have been an optional shared lib dependency into its own process which you talk to over ipc. In some cases that might be okay but certainly not all.

DLL hell, version 2

Posted Feb 19, 2025 23:17 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

Systemd switched to dynamic linking because they realized that they screwed up the libsystemd. It had become too large, because of the inclusion of multiple compressors for journald functionality. So they split compressors into an optional dependency to keep the size of libsystemd down. Undoubtedly, they'll keep adding more optional dependencies in the future because it's now "free".

Eagerly loading everything will undo that.

The _correct_ way to handle the screwup with compressors was to split libsystemd into libmeaculpa that only has the readiness protocol and other simple compact functionality, and into libjournald that has all the heavy journald-related stuff. Then projects like ssh can slowly migrate from the full libsystemd to libmeaculpa.

DLL hell, version 2

Posted Feb 19, 2025 23:51 UTC (Wed) by bluca (subscriber, #118303) [Link] (2 responses)

> dlopen if done entirely during the initial boot is basically equivalent to normal dynamic linking, but just provides some extra control to the user to handle faults like the dependency not being available. That seems like a good thing.

Yeah exactly, the purpose of it is to allow using the exact same build in multiple contexts, from a fully-featured large system to a minimal one, without having to do bespoke recompilations, which are a pain in the backside. This way the exact same package can be used for all purposes, and who assembles it decides how much functionality to take in or leave out, by simply changing which packages gets pulled in. It works really well and we can now build tiny initrds and large desktop images from the same set of packages.
Hopefully more projects that form part of the core OS will switch to this model, as it's a win-win with no downsides whatsoever. I've already seen a bunch of people interested in the elf notes spec to annotate dlopen deps, so this is very promising.

DLL hell, version 2

Posted Feb 20, 2025 9:31 UTC (Thu) by mathstuf (subscriber, #69389) [Link] (1 responses)

There is one downside to the behavior changing depending on whether a library happens to be found or not: debugging problems that only occur in the presence (or absence) of the library without indication of the runtime state is a royal pain. "Why is machine X compressing and Y not? They both have the required libraries…oh wait they were installed this boot on Y and missing when it started up." kind of stuff.

DLL hell, version 2

Posted Feb 20, 2025 9:58 UTC (Thu) by bluca (subscriber, #118303) [Link]

Sure but that is all programmatically discoverable. What is needed is encoded both at the ELF level, with the dlopen note, and at the package level, with the recommends/suggests or equivalent for RPM. So the bug reporting tool(s) should query for such information when assembling the report.
You can't have optional dependencies if they are not optional...

DLL hell, version 2

Posted Feb 19, 2025 10:11 UTC (Wed) by tlamp (subscriber, #108540) [Link] (5 responses)

> static linking. and musl. not sure where to start.

Do you have perchance some of those thoughts spelled out somewhere already? I'd be honestly interested to read about them, especially w.r.t. statically linking in general.

One thing that I'm also curious is you being in favor (FWICT) of image based distros, which IMO is basically a sort of higher-level statically linking over multiple binaries, vs. not liking statically linked binaries (distributed on a package level) so much.

As with some eye squinting those do not seem _that_ different in practice to me – e.g., things like A/B updates might come slightly more naturally with image based distros, but are certainly possible without them.
That said, while I'm very much have some experience with packaging both, dynamical and statically linked executables, and was not (yet) bitten by either linking – albeit felt the disadvantages sometimes, I certainly did not spend that much time on image based distros to have a strong opinion here and might easily overlook something – potentially even trivial stuff.

DLL hell, version 2

Posted Feb 19, 2025 10:59 UTC (Wed) by mezcalero (subscriber, #45103) [Link] (4 responses)

Watch the video this article is about, or read this article maybe. In particulary in the Rust part I discuss why dynamic linking is so essential to us.

DLL hell, version 2

Posted Feb 19, 2025 19:58 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

What is the problem with dynamic linking? You just need to replicate the C-based ABI on the .so borders.

DLL hell, version 2

Posted Feb 19, 2025 20:09 UTC (Wed) by mathstuf (subscriber, #69389) [Link] (2 responses)

Presumably one would like to leverage Rust APIs in the binaries as well. There's an internal shared library to help with code sharing; this isn't necessarily just about `libsystemd.so`.

DLL hell, version 2

Posted Feb 19, 2025 20:27 UTC (Wed) by raven667 (subscriber, #5198) [Link] (1 responses)

right, it seems the Rust way to deal with a suite of CLI utilities that share code, aside from including all the shared functions in each binary at compile time, is to create _one_ CLI binary and use ARGV[0] to figure out what personality it should have, busybox and uutils style, which is a major reorganization, aside from providing libsystemd.so or any other shared object with a C API. Rust may support shared Rust objects, but it's not idiomatic and is moving against the general flow of how code sharing works in the Rust ecosystem, so maybe doesn't work as well in practice, I dunno, I don't have Rust experience so am guessing a bit.

Rust and dynamic linking

Posted Feb 20, 2025 10:23 UTC (Thu) by farnz (subscriber, #17727) [Link]

There's three ways to deal with dynamic linking in Rust:
  1. As an implementation detail of a coherent whole. In this case, you don't care that libmyapp.so has an unstable ABI, because it's built with the same compiler as myapp-bin1 and myapp-bin2, so an unstable ABI is A-OK with you.
  2. Put a psABI shim around the Rust functions, and have a hand-maintained ABI, the way glibc does things in C land. This restricts the ABI you can use to the psABI (no generics, for example), but means that you have complete control and can make your ABI stable.
  3. Help out with the experimental crABI work, which aims to produce an external ABI that's more capable than current psABIs, then put a crABI shim around the Rust functions, and maintain a stable ABI that way.

Which one is right for you is a decision that depends on the project goals.

DLL hell, version 2

Posted Feb 18, 2025 19:49 UTC (Tue) by ferringb (subscriber, #20752) [Link]

Adding another in there- pam's basically built around dlopen.

DLL hell, version 2

Posted Feb 17, 2025 18:50 UTC (Mon) by brenns10 (subscriber, #112114) [Link]

For what it's worth, using dlopen() does not (necessarily) mean that dependency information is missing. Systemd goes to the trouble of declaring it in an ELF note which can easily be used to determine the set of dependencies that may be loaded at runtime.

https://systemd.io/ELF_DLOPEN_METADATA/

systemd - no rhyme or reason

Posted Feb 17, 2025 18:31 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (65 responses)

My problem with systemd is the lack of rhyme or reason for its features. It seems like a huge agglomeration of random features and modules, without a governing vision.

This permeated even into the core of systemd. For example, systemd supports restarts and watchdogs for regular services, but not for mount units. So you can't do the most trivial thing: wait until the NFS server becomes accessible during the system startup. Why? No idea, there's really no technical justification for it.

Some other components are also at least "strange", like the DNS resolver (resolved). It can, apparently, change the host name. This is a behavior borrowed from macOS, but it's totally unexpected in Linux.

systemd - no rhyme or reason

Posted Feb 17, 2025 19:51 UTC (Mon) by ringerc (subscriber, #3071) [Link] (1 responses)

When I find myself confronting these seeming contradictions and weird, asymmetrical or incomplete features... it's often just a people thing.

People scratching itches who don't want to implement a 70% similar feature for the other parts they don't use or care about. Trying to make them often means nobody gets any of it instead and progress halts; allowing them to do 70% often means someone else can tackle the 30% who wouldn't have been able to do all 100%. I've been on all the different sides of this many times.

Sometimes others who are deep in the weeds of a focus area don't look for or see the parallel in the first place.

Sometimes what looks from the outside like a closely related feature turns out to be almost entirely dissimilar in use and implementation. So it wasn't done for good but non obvious reasons.

The Kubernetes project is packed full of weird gaps, quirks and omissions like this. Largely because of its organic user driven growth. Kubectl port-forward's lack of a json output for the port mapping, the behavior of statefulset scaling when the pods are failing, etc.

systemd - no rhyme or reason

Posted Feb 17, 2025 22:20 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

That's the thing. A better design is to have a basic "thing" with a lifecycle, and then enforce the lifcycle on the concrete implementations of the "thing".

systemd - no rhyme or reason

Posted Feb 17, 2025 21:08 UTC (Mon) by mezcalero (subscriber, #45103) [Link] (61 responses)

What might appear random to you is maybe not that random to others.

I am not sure i see the connection of automatic service restarts and watchdogs on one hand and NFS on the other?

We do have quite nice infrastructure for waiting for network online state though, see systemd-networkd-wait-online.service. It's quite configurable, since what you think is "trivial" is actually mindbogglingly complex. Deciding when a network is "online" is different to everyone. Could mean IP address configured and/or link beat seen, or DNS works, or server can be reached or so, and then has multiple axis because of multiple NICs, and IPv4 and IPv6 and so on, as well as DHCP vs. IPv4LL and so on. What's right for people heavily depends on your usecase.

So, I think we actually cover what you are asking for really nicely, I am not sure what you are missing. Is it possible that you are not actually using systemd's tools though for networking? Maybe you are barking up the wrong tree then? That said, we actually do provide hook points so that other stacks can plug their own wait logic in too. Maybe yours didn't? We are hardly to blame for that though?

Or are you suggesting we should retry establishing NFS mounts if that doesn't work for the first time because networking is borked? Sorry, but that's really something the NFS folks need to implement and deal with. Only they understand what is actually going wrong and whether it's worth retrying. Frankly, any network protocol should retry a couple of times before they give up, and DNS, TCP all do. Maybe your NFS stack lacks that feature, but why do you think that'd be systemd's problem?

And now if you ask me why I made automatic server restart + watchdog logic our problem and refuse to make NFS ours too: well, I think service management is inherently our thing, but some specific network protocol that is only used by a relatively restricted subset of people really is not.

(Also, please understand that NFS is really not at the center of attention for any of the core systemd developers though, it's not really where we focus our primary love… It appears you simply have a different focus than us, but that doesn't make our stuff an "aggolomeration of random features", it just makes us have a different focus.)

Lennart

systemd - no rhyme or reason

Posted Feb 17, 2025 21:46 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (60 responses)

> We do have quite nice infrastructure for waiting for network online state though

It's not a network online state. A very simple example: a compute server and its NFS storage server are starting at the same time after a power cycle. The compute server boots quickly, but the storage server takes a bit longer.

The compute server gets to the network-online target and tries to mount the NFS filesystem, and immediately receives an error. The error is definite, because the NFS server is not yet running and the server returns ECONNREFUSED. Mount fails. There are no retries, and there's no way to express this with mount units.

Ironically enough, it's easy to do that with _regular_ units that just wrap the `mount` utility.

> Maybe your NFS stack lacks that feature, but why do you think that'd be systemd's problem?

It's a bog-standard Linux. The same problem can happen with SMB and other protocols.

> Or are you suggesting we should retry establishing NFS mounts if that doesn't work for the first time because networking is borked?

Yep. For the same reason regular units have retries. Why is mounting special?

systemd - no rhyme or reason

Posted Feb 17, 2025 22:50 UTC (Mon) by mezcalero (subscriber, #45103) [Link] (19 responses)

This really sounds like something to address in NFS or SMB, not in systemd. We really have no understanding on reasons why things fail. i.e. local misconfiguration or network connectivity. Hence, we cannot make a reasonable decision on whether to attempt things again, noone would tell us. Also, why even bubble this issue up to systemd? This *really* sounds like somethings nfs and smb should just cover natively, only they know the reason why something failed, they *must* understand that networking is unreliable and give you the right knob to retry under the right conditions, and specifically also decide *what* to restart...

Have you request support for this from the NFS folks?

systemd - no rhyme or reason

Posted Feb 17, 2025 23:28 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (17 responses)

> This really sounds like something to address in NFS or SMB, not in systemd.

Let's remove the retry logic from other units, then. And ask their developers to make sure they retry on failure themselves.

> This *really* sounds like somethings nfs and smb should just cover natively, only they know the reason why something failed

In this case the failure is fully expected. It's positively indicated by the NFS utilities: the server is not available.

You _can_ add retry logic to NFS, SMB, Ceph, and other filesystem mounting utilities. But why?!? This is literally what service supervision should do.

systemd - no rhyme or reason

Posted Feb 18, 2025 1:11 UTC (Tue) by tim-day-387 (subscriber, #171751) [Link] (1 responses)

Working on Lustre (and encountering this same issue i.e. mounts sometimes failing because of ephemeral issues), I don't think systemd ought to blindly retry. Keeping the logic in the mount utilities allows for a lot more intelligence i.e. differentiating between retryable/non-retryable errors, HA awareness, enhanced logging, etc. NFS and Lustre, at least, have retry logic in their mount utilities.

systemd - no rhyme or reason

Posted Feb 18, 2025 5:04 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

This sort of policy should be in the unit. In some cases, not being able to connect to the NFS server is normal, and in some cases it should permanently fail.

But that's not an option in case of systemd.

systemd - no rhyme or reason

Posted Feb 18, 2025 6:11 UTC (Tue) by mezcalero (subscriber, #45103) [Link] (14 responses)

A mount is not a service.

Also I think nfs has exactly what I suggested already with the retry= option? So why can't you just use that?

(Let's also not forget: there are about 2 relevant network file systems around, and if you count the more exotic ones maybe 7, and they tend to be well maintained and are implemented in a layer *below* systemd (i.e. the kernel). which is systematically different from seervice mgmt where we have a bazillion of services and most of them are terrible and hence really need to be supervised and are implemented in a stack above systemd, conceptually. Supervising stuff that conceptually below you is kinda conceptual nonsense)

Frankly, just let it rest, we are not going to make nfs our problem. We are not going to go full static binaries either. I am sorry that your priorities are not ours but I nonetheless hope you can accept that.

systemd - no rhyme or reason

Posted Feb 18, 2025 6:43 UTC (Tue) by mb (subscriber, #50428) [Link] (4 responses)

> implemented in a layer *below* systemd (i.e. the kernel).

That is just an implementation design detail rather than an architectural or hierarchical necessity.
Network services are typically implemented in user space. (So can NFS.) And if these "normal" network services panic, you can retry with systemd units.

systemd - no rhyme or reason

Posted Feb 18, 2025 14:00 UTC (Tue) by bertschingert (subscriber, #160729) [Link] (3 responses)

> > implemented in a layer *below* systemd (i.e. the kernel).

> That is just an implementation design detail rather than an architectural or hierarchical necessity.

I believe he's referring to the client rather than the server, in which case it being in the kernel is more than an implementation detail. An NFS client could be in userspace, too, but I imagine it would require an application to be specifically coded to use a library that implements the protocol, rather than going through the mounted filesystem.

systemd - no rhyme or reason

Posted Feb 18, 2025 17:27 UTC (Tue) by mb (subscriber, #50428) [Link] (2 responses)

No. Things like fuse exists.

There's no fundamental reason why nfs-client mounting part must be in the kernel. nfs-client can be thought of as a *local* service that sits between the network and whatever gives it access to the filesystem mounting mechanism.
And if it would be implemented in that way, it could use systemd's restart mechanisms directly.

For me NFS is not properly integrated into the system and it has never been. At least on Debian. Also in pre-systemd days.
A problem that hits me daily (if I don't manually avoid it) is that a machine with a NFS mounted volume hangs during shutdown, if I forget to manually unmount all NFS before shutdown.
That happens because systemd tears down the network before it umounts the NFS.
That leads to possible data loss and long shutdown phases, because it has to time-out.

If I hit the system shutdown buttons it should first umount NFS and then tear down networking.

systemd - no rhyme or reason

Posted Feb 18, 2025 19:48 UTC (Tue) by bertschingert (subscriber, #160729) [Link] (1 responses)

How would FUSE help with the problems that you mentioned around the network stopping before unmounting the FS? It seems like that issue is orthogonal to whether the FS client implementation is in kernel or userspace?

I considered a FUSE NFS client but dismissed the idea because I didn't see what benefits it would provide over the kernel client. If I'm missing something though, I'd love to learn what the use case is.

systemd - no rhyme or reason

Posted Feb 18, 2025 20:32 UTC (Tue) by mb (subscriber, #50428) [Link]

>How would FUSE help with the problems that you mentioned around the network stopping before unmounting the FS?

I didn't say that.

>It seems like that issue is orthogonal to whether the FS client implementation is in kernel or userspace?

Exactly.
This was just another real world example where "the system" as-is falls apart. It's not clear to me which part of "the system" is at fault, but the behavior clearly is bad. But as systemd wants to be "the system manager", I tend to assume that systemd is at fault.

>If I'm missing something though

The root of this discussion was that it has been said that systemd was not supposed to manage NFS mount retries, because they are "below" systemd in architecture.
https://lwn.net/Articles/1010520/

That's only an implementation detail, though. If there's a way to easily implement NFS above systemd, one can hardly argue that it is architecturally below by its nature.

It doesn't matter where a service is implemented. If the service fails to do what it is supposed to do, the service manager should retry.

systemd - no rhyme or reason

Posted Feb 18, 2025 6:45 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

> A mount is not a service.

But it is. It literally is a service, just with an additional filesystem interface. There is no logical reason why socket and automount activation should be treated any differently.

> Also I think nfs has exactly what I suggested already with the retry= option? So why can't you just use that?

NFS retries are too basic, CIFS doesn't have them: https://linux.die.net/man/8/mount.cifs and FUSE-based filesystems are similar.

> Frankly, just let it rest, we are not going to make nfs our problem. We are not going to go full static binaries either. I am sorry that your priorities are not ours but I nonetheless hope you can accept that.

I mean, yes. I'm accepting that systemd is a poorly-run project that produces inconsistent software. This inconsistency is not limited to mount units, nspawn units are similarly special. And they are _literal_ services.

After seeing the development flow and the lack of coherency in it, I'll be steering away from using it more deeply than for the basic process supervision.

There are far too many interactions in systemd to make it reliable as a _system_.

systemd - no rhyme or reason

Posted Feb 18, 2025 11:31 UTC (Tue) by beagnach (guest, #32987) [Link] (2 responses)

> I'm accepting that systemd is a poorly-run project

OK now you're stopping to ad-hominen attacks thinly disguised as technical critiques. It's starting to look like some personal grievance is underlying the poorly thought out arguments. You're just ruining your own credibility here.

> I'll be steering away from using it more deeply than for the basic process supervision.

Please do.

systemd - no rhyme or reason

Posted Feb 18, 2025 18:49 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

> OK now you're stopping to ad-hominen attacks thinly disguised as technical critiques.

I'm sorry, what? I have no problems whatsoever with systemd's authors, and I like the idea of systemd itself. I'm saying that the systemd as a project is poorly ran and lacks focus and consistency.

systemd - no rhyme or reason

Posted Feb 18, 2025 19:22 UTC (Tue) by corbet (editor, #1) [Link]

Such a statement can certainly be seen as an attack on the people who are running the project. It also doesn't really help the discussion, so maybe we don't need to do that?

Thanks.

systemd - no rhyme or reason

Posted Feb 18, 2025 7:49 UTC (Tue) by PeeWee (guest, #175777) [Link] (4 responses)

Just spitballing here, but it does sound like this could be useful in some cases. I think what's missing is a way to define which exit codes `Restart=on-failure` would act on (default as it is now: any non-zero exit code; if set: only listed exit codes, and maybe a negative match, i.e. ~255 = any exit code except 255) and make that available in mount units as well. Some differentiation when restarts make sense and when not may be useful in general. In a way mount units do have an implicit `ExecStart=mount ...`, don't they? Or is there something inherently different in them that this would bend them totally out of shape? Just curious.

systemd - no rhyme or reason

Posted Feb 18, 2025 11:10 UTC (Tue) by bluca (subscriber, #118303) [Link]

> Or is there something inherently different in them that this would bend them totally out of shape? Just curious.

Yes, mount units are different, because they are not just statically defined, but come from the kernel too. And the /proc/mountinfo interface was so gnarly and full of races that handling lifecycle of these is _really_ hard to do right with all the billions of corner cases. Granted we now have new syscalls that should make it better, but we haven't got around to using them yet.
But in general this area is so fraught with risk that none of us are going to spend any time to add workarounds for corner cases that should really be handled by the kernel implementation or its userspace tools at best, as none of us has any use case involving networked file systems, and we have other things to do.

But as always it's much easier to DEMAND that open source projects implement the one thing you really care about and nobody else does, and loudly whinge that they are "poorly run" if they don't do that pronto and for free and with a ribbon on top, I guess (not referring to you).

systemd - no rhyme or reason

Posted Feb 18, 2025 11:45 UTC (Tue) by taladar (subscriber, #68407) [Link]

Service units have SuccessExitStatus and RestartPreventExitStatus and RestartForceExitStatus which control that.

systemd - no rhyme or reason

Posted Feb 18, 2025 14:17 UTC (Tue) by mezcalero (subscriber, #45103) [Link] (1 responses)

btw, RestartForceExitStatus= + RestartPreventExitStatus= do exist already for services.

(they don't for mount units, but as mentioned I am pretty sure that is not our job to add, fs developers should add retries if desired to their mount tools/file system drivers)

systemd - no rhyme or reason

Posted Feb 18, 2025 14:47 UTC (Tue) by PeeWee (guest, #175777) [Link]

Also @taladar

So I stand corrected: *I* have missed them in the docs. Thanks, both of you, for clearing things up.

systemd - no rhyme or reason

Posted Feb 18, 2025 0:07 UTC (Tue) by Wol (subscriber, #4433) [Link]

> This really sounds like something to address in NFS or SMB, not in systemd. We really have no understanding on reasons why things fail. i.e. local misconfiguration or network connectivity. Hence, we cannot make a reasonable decision on whether to attempt things again, noone would tell us.

I'm not going to try and place blame - I don't have a clue how Pr1menet did it, but I was used to putting "ADDDISK DISKNAME ON COMPUTERNAME" into my Pr1me's bootscript back in the 80s. The state of COMPUTERNAME when I booted my system was irrelevant - when that computer came up, that resource (disk) appeared on my system.

It's always irked me that *nix'en can't declare resources that just appear once they are available. Although systemd does have automount units that mount when you try to access the resource - so I wonder why the GP doesn't try that? Or doesn't that work with NFS (it does with CIFS, I use it).

Cheers,
Wol

systemd - no rhyme or reason

Posted Feb 17, 2025 23:13 UTC (Mon) by MrWim (subscriber, #47432) [Link] (3 responses)

You could make your mount unit depend on another unit that waits until your nfs server is responding.

systemd - no rhyme or reason

Posted Feb 17, 2025 23:29 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

You can't really detect this except by trying to mount a filesystem.

systemd - no rhyme or reason

Posted Feb 18, 2025 8:40 UTC (Tue) by mbunkus (subscriber, #87248) [Link] (1 responses)

What I usually do is make the mount an automount & then have a service unit do something like 'while ! test -f /expected/file/on/nfs ; sleep 1 ; done'

Yes, this is ugly.

systemd - no rhyme or reason

Posted Feb 18, 2025 14:50 UTC (Tue) by hmh (subscriber, #3838) [Link]

This "hack" pretty much illustrates what was asked for, why it is useful, and how and why it would be used.

I like it.

And to me, what it describes does feel like a higher-level functionality that makes sense to have on mount units, yes.

systemd - no rhyme or reason

Posted Feb 17, 2025 23:45 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (5 responses)

When you're dealing with multiple hosts, you're not doing single-node configuration anymore. You're doing distributed system orchestration. That implies using a tool like k8s, not (just) systemd. While I'm sure there are ways of convincing systemd or other single-node components to make it work, there will always be friction, because you've put yourself in the business of configuring one node at a time, despite wanting to establish a shared state (NFS node running, compute node has NFS mounted) that transcends any one of them.

(Yes, I know, k8s is overkill for this particular use case, so much so that its documentation will outright tell you to throw away NFS and replace it with their bespoke solution. Unfortunately, pets are harder than cattle. Sometimes, you just have to choose between using the massively overengineered solution anyway, or putting up with the friction of not using it.)

systemd - no rhyme or reason

Posted Feb 17, 2025 23:50 UTC (Mon) by NYKevin (subscriber, #129325) [Link]

> so much so that its documentation will outright tell you to throw away NFS and replace it with their bespoke solution

Correction: This is wrong. They will tell you that NFS is one of many, many things that can be slotted into their bespoke solution. So yes, this is fully supported under k8s. That doesn't change the fact that k8s is probably overkill for a two-node setup.

See for example https://github.com/kubernetes/examples/tree/master/stagin...

systemd - no rhyme or reason

Posted Feb 18, 2025 0:38 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

It's a one-node setup. The NFS server is (for all intents and purposes) an external stand-alone resource that is not managed via the orchestration system.

To add more spice, the compute server uses iSCSI to mount the volume that has the configuration for the containers it runs. So K8s needs first to have the iSCSI mounted, which also has the same "retry" problem, btw.

systemd - no rhyme or reason

Posted Feb 19, 2025 0:32 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (2 responses)

> The NFS server is (for all intents and purposes) an external stand-alone resource that is not managed via the orchestration system.

Strictly speaking, I believe k8s can accommodate that as a one-node system. Which is even more absurdly overkill, but at least it should work.

I would still call this a form of orchestration, because you care about the state of more than one node even if you only control one of them. But I suppose that's a matter of semantics.

> To add more spice, the compute server uses iSCSI to mount the volume that has the configuration for the containers it runs. So K8s needs first to have the iSCSI mounted, which also has the same "retry" problem, btw.

k8s has an API because you are meant to write code that calls into it (or use code that others have written). Said code can, itself, be run by k8s - there is nothing preventing you from having a Pod that spawns new Pods or updates existing ones. This pattern is normally used for release automation (and in that use case, it preferably updates Deployments or StatefulSets rather than directly manipulating individual Pods), but it can also solve the "I need this volume mounted before I can figure out what I want to run" use case, and it can even be self-hosting once you start it for the first time (assuming it is smart enough, and of course you will want to have a reasonable story for doing a black start if needed).

systemd - no rhyme or reason

Posted Feb 19, 2025 4:22 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

I wonder if it's possible to make it even more complicated? Perhaps by adding a workflow engine?

systemd - no rhyme or reason

Posted Feb 19, 2025 9:06 UTC (Wed) by stijn (subscriber, #570) [Link]

I read the V.R. piece linked elsewhere. It says, regarding systemd:

> Unit startup is executed as a non-idempotent parallel dataflow with weak ordering guarantees on the job level, mostly independent of the active state of dependent units.

Which superficially sounds exactly like Nextflow's 'functional dataflow model' where 'Instead of running in a fixed sequence, a process starts executing when all its input requirements are fulfilled' (lazily copied from Wikipedia - I'm well acquainted with Nextflow, not so with systems). In both cases there is a 'singleton object called Manager, responsible for launching jobs' (copied from the V.R. piece).

So .... does systemd resemble a workflow engine that dispatches jobs via a dataflow 'requirements fulfilled' model?

systemd - no rhyme or reason

Posted Feb 18, 2025 9:07 UTC (Tue) by neilbrown (subscriber, #359) [Link] (26 responses)

If the NFS server gives you ECONNREFUSED during boot then its startup is wrong. It should start the NFS server before the network comes up.

systemd - no rhyme or reason

Posted Feb 18, 2025 9:41 UTC (Tue) by Wol (subscriber, #4433) [Link]

Is there a hard dependency saying the network can't start until after the NFS server is up? Or is there no dependency and hence no guaranteed order?

Being naive, but I would have expected any dependency to say "NFS after network" (although I'm probably thinking of client rather that server...)

Cheers,
Wol

systemd - no rhyme or reason

Posted Feb 18, 2025 18:45 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (24 responses)

The NFS server in question is Synology NAS, that needs to boot up, retrieve the disk encryption keys from KMIP, unlock drives, run fsck, boot up a container that runs Kerberos, and only then it can start the NFS.

systemd - no rhyme or reason

Posted Feb 18, 2025 20:48 UTC (Tue) by ferringb (subscriber, #20752) [Link] (2 responses)

If you're not telling systemd "the boot is ungodly long, wait till eternity, also it signals that the mount isn't possible due to the other end being down" I'm not sure what you can do there. Just pulling man systemd.mount, offhand some of the configurables here sounds like what you're after:

The NFS mount option bg for NFS background mounts as documented in nfs(5) is detected by systemd-fstab-generator and the options are transformed so that systemd fulfills the job-control implications of that option. Specifically systemd-fstab-generator acts as though "x-systemd.mount-timeout=infinity,retry=10000" was prepended to the option list, and "fg,nofail" was appended. Depending on specific requirements, it may be appropriate to provide some of these options explicitly, or to make use of the "x-systemd.automount" option described below instead of using "bg".

I mean, hanging for a stupidly long time on a mount/unit trying to bring it up seems... not great to me... but if that's the setup you've got, and the behavior you want, that's what you've got. More specifically, that's what you've got with the non-systemd components available, and what systemd has abstracted around it. If mount cuts out all retrys due to a ECONREFUSED, that's a kernel/mount complaint. You could try asking systemd folks to add an option for "MaxEconnRefused=int", but that sounds very much like a hack around something outside their scope that should be fixed.

I'm strongly suspect you could sidestep this anyways, via adding an explicit requires to the mount- target a one shot unit that confirms the nfs server is up. It's not like folks don't have to occasionally tweak the depgraph for weird setups, after all.

Also, it's not like NFS mounts and init hasn't been a colossal pain in the ass for decades. I'm not suffering your issue, but at least with this I can see a way to at least hack around it if I can't configure mount options to suppress it.

systemd - no rhyme or reason

Posted Feb 18, 2025 21:10 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

This is for fstab (which I don't want). There are also no retry options for SMB, which I also use.

systemd - no rhyme or reason

Posted Feb 18, 2025 21:26 UTC (Tue) by ferringb (subscriber, #20752) [Link]

...and the require hack I mentioned? I can't see any reason that can't work.

I reiterate, no network mount + init had ever been more than "stab my eyes out" in my experience. That said, the requires inje tion looks a helluva lot simpler than the old route if writing custom shell for mounting, and then trying to sequence init levels (thus delaying the entire boot).

systemd - no rhyme or reason

Posted Feb 19, 2025 5:46 UTC (Wed) by neilbrown (subscriber, #359) [Link] (20 responses)

> The NFS server in question is Synology NAS, that needs to boot up, retrieve the disk encryption keys from KMIP, unlock drives, run fsck, boot up a container that runs Kerberos, and only then it can start the NFS.

So the nfs server could be run in a container that provides a network namespace with an IP address which isn't configured until nfsd has started.
Or if that is too hard you could use iptables to add a rule to drop any packets to port 2049 until the NFS server is up.
mount.nfs will certainly keep trying while it gets no response - not even ICMP - from the server.

However a quick look at mount.nfs code suggests that ECONNREFUSED it treated by nfs_is_permanent_error() as not being a permanent error.
So a foreground mount will by default timeout after 2 minutes (maybe your synology takes longer than that too boot). You can change this with "retry=10000".

As you say, cifs doesn't have a retry option. "automount" might be the correct tool. Or you could report a bug to cifs-utils.

Can you use systemd to overcome this weakness in cifs? Probably. Local mounts often have a dependency on a device. cifs seems to have a dependency on a remote TCP port being responsive. Is there a tool that can test if port 445 is open on a given IP address? nmap can do it but doesn't return an exit status. "nc -N IP 445 < /dev/null" returns 0 when the port accepts a connection and 1 otherwise. You could create a service which execs this command and restarts on failure. Then make the mount depend on the service.
Remember: systemd provides a programming language for describing system management - a declarative language. You can provide your own primitives and program systemd to do whatever you want. You shouldn't expect it to already be able to do everything you could possibly want without needing to do any programming yourself.

systemd - no rhyme or reason

Posted Feb 19, 2025 23:11 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (19 responses)

I have worked around that problem via regular systemd units. It's not really much better than custom init scripts from YeOldeDays, but it does its job.

However, having systemd stay consistent and offer retries would have helped to remove that bogosity.

And another meta-observation: systemd docs are a freaking mess. There are tons of options, and some are very powerful, but they are completely undiscoverable. E.g. the ability to transfer secrets between daemon restarts.

I believe, this stems from the very same problem: the lack of the overall vision. Systemd is being developed like a giant ball of Lego components, that slowly accretes additional pieces as it rolls through the landscape.

If I were designing something like systemd now, I would have defined a central "Service" entity with its lifecycle. Then I would implement this "Service" for runnable processes, mount units, devices, etc. Some of them might not have all the state transitions, but this can be documented explicitly.

systemd - no rhyme or reason

Posted Feb 19, 2025 23:42 UTC (Wed) by bluca (subscriber, #118303) [Link] (3 responses)

> And another meta-observation: systemd docs are a freaking mess.

And what have you done to make it better, given it's open source? Oh that's right, the square root of fuck all. Guess whining about stuff you get for free is easier than doing something useful.

systemd - no rhyme or reason

Posted Feb 19, 2025 23:45 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

For me? I'm working on making it worse, by creating a tool that merges together systemctl and journalctl. So I can do "scl start -f blah.service" and get its logs tailed, instead of remembering cryptic `journalctl` invocations.

As for systemd, I don't think the docs can be fixed. They are a symptom, not the cause.

Second request

Posted Feb 19, 2025 23:52 UTC (Wed) by corbet (editor, #1) [Link] (1 responses)

Enough. Honestly, if you don't like this project, maybe you should be running something else, but slinging insults here will not help anybody. Please stop.

How about if everybody in this subthread calms down, please?

Second request

Posted Feb 22, 2025 17:54 UTC (Sat) by dmv (subscriber, #168800) [Link]

Seems like you picked the wrong day to stop sniffing glue. :)

systemd - no rhyme or reason

Posted Feb 20, 2025 7:52 UTC (Thu) by donald.buczek (subscriber, #112892) [Link] (14 responses)

> And another meta-observation: systemd docs are a freaking mess.

There are other ways to look at it. In my opinion, the documentation of systemd is exemplary and far above the usual level. The man pages are a complete and correct reference: they cover all levels of abstraction, from basic working models to the smallest formatting details, completely, concisely, and precisely. They are well structured and refer to each other in a meaningful way.

systemd - no rhyme or reason

Posted Feb 20, 2025 9:32 UTC (Thu) by mathstuf (subscriber, #69389) [Link] (11 responses)

I feel they are similar to the CMake docs: detailed in the what. Not all that clear on the how or why. I *can* piece together how to set up a "reload service X on a timer or when $path changes", but an example would go much farther.

systemd - no rhyme or reason

Posted Feb 20, 2025 9:56 UTC (Thu) by bluca (subscriber, #118303) [Link] (10 responses)

That sort of documentation is provided in systemd.io rather than in manpages, and for even higher level stuff there are excellent admin/user guides provided by RedHat

systemd - no rhyme or reason

Posted Feb 20, 2025 23:14 UTC (Thu) by Klaasjan (subscriber, #4951) [Link] (4 responses)

Aha!
Now, what if I run systemd on a Debian system?
(Honest question, since indeed I do)

systemd - no rhyme or reason

Posted Feb 20, 2025 23:18 UTC (Thu) by bluca (subscriber, #118303) [Link] (3 responses)

systemd works the same everywhere - it's one of its main selling points vs what came before. 99.99% of what you find on RH's documentation will apply to Debian or any other distribution running systemd too.

systemd - no rhyme or reason

Posted Feb 20, 2025 23:36 UTC (Thu) by Klaasjan (subscriber, #4951) [Link] (2 responses)

Understood, thanks.
It would be nice if RedHat’s documentation was available under a free enough license so that it could be hosted on Debian.org as well.
Cheers

systemd - no rhyme or reason

Posted Feb 23, 2025 4:09 UTC (Sun) by pabs (subscriber, #43278) [Link] (1 responses)

I believe these are the public freely licensed sources for the RedHat documentation. They do contain a lot of trademarks and other things that would not be correct on Debian though.

https://gitlab.com/redhat/centos-stream/docs/enterprise-d...

systemd - no rhyme or reason

Posted Feb 24, 2025 6:58 UTC (Mon) by Klaasjan (subscriber, #4951) [Link]

Thanks, that is helpful.

systemd - no rhyme or reason

Posted Feb 20, 2025 23:33 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

Sorry. systemd.io is not any less messy.

To give an example of a good doc: https://upload.wikimedia.org/wikipedia/commons/3/37/Netfi... from https://wiki.linuxfoundation.org/networking/kernel_flow A very clear but detailed overview.

I actually looked, and I can't find anything similar for systemd. I've been following its development for years, but I can't outright tell how symlinks, overrides, and other machinery work together.

systemd - no rhyme or reason

Posted Feb 20, 2025 23:54 UTC (Thu) by bluca (subscriber, #118303) [Link] (3 responses)

"This article is based on the 2.6.20 kernel" and that's THE BEST example you could find of what in your head constitutes good documentation? Nearly 20 years out of date? lol, lmao

systemd - no rhyme or reason

Posted Feb 21, 2025 0:23 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

Yes, it's pretty old, but I love it as an example. I had its A0-sized printout on a wall at one point around 15 years ago. And it has been updated a bunch of times, of course. The updated version that I linked is still pretty accurate.

And speaking of old, systemd.io links to the series of Lennart's blog posts from 2010-2012. They're also still mostly accurate, fwiw, but do miss stuff like practical uses of journalctl.

systemd - no rhyme or reason

Posted Feb 21, 2025 6:56 UTC (Fri) by zdzichu (subscriber, #17118) [Link] (1 responses)

systemd - no rhyme or reason

Posted Feb 21, 2025 7:08 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

Most of the Lennart's blog posts linked on systemd.io predate journalctl. So they miss a lot of functionality that is useful for people (e.g. checking logs only from the current invocation of the service).

FWIW, I have a personal wrapper around systemd (`scl`). It started as an alias to `systemctl` because I didn't want to keep breaking my fingers typing it all the time. Then I added some obviously missing functionality like starting a service and viewing its logs in one command (like `docker compose up` does, for example). I think a lot of hostility to systemd could have been avoided, if something like this existed.

systemd - no rhyme or reason

Posted Feb 20, 2025 9:34 UTC (Thu) by Wol (subscriber, #4433) [Link] (1 responses)

The problem is (a) what does the reader want the docs for, and (b) why did the writer write the docs.

Most developer documentation is great for reminding you of what you already know. It is absolutely hopeless at teaching you how to use the system. That's why beginners make the same mistakes over and over again - from their point of view Double Dutch would probably make more sense.

I've had exactly that with systemd - I don't know where to start looking, so I can't find anything, and what I do find I can't make sense of. That is where so much stuff (not just computing) is severely lacking nowadays. We've just bought a new modern tv, which thinks that the main purpose of the user manual is to tell you where to plug the leads in. Of course, we don't have a clue how to use it, everything is trial and error, and it's so damn complicated that we can't find anything! We just want a tv we can turn on and watch!!!

And of course, most documentation FOR beginners is written BY beginners, so it's full of beginner grade errors :-(

Cheers,
Wol

systemd - no rhyme or reason

Posted Feb 21, 2025 7:38 UTC (Fri) by donald.buczek (subscriber, #112892) [Link]

Yes, you are right. The documentation is good for our needs, but not necessarily for others. Since we run our own in-house distribution, which is not only very different from other distributions but also from the typical target system that the systemd developers primarily have in mind, we have to integrate everything ourselves anyway and need to understand the details to do so. We don't need high-level user guides. Before making changes that affect systemd configuration, I often re-read the relevant systemd man pages from front to back. My positive experience is that the documentation for systemd is almost always sufficient. My experience with many other products is that you end up analyzing the sources after a very short time because the documentation is unclear, incomplete, contradictory or wrong.

Occasionally reading Lennard's “PID Eins” blog is more of a leisure activity, but it sometimes gives you ideas about whether one or the other idea or a new systemd feature could be useful for us as well.

systemd - no rhyme or reason

Posted Feb 18, 2025 11:51 UTC (Tue) by WolfWings (subscriber, #56790) [Link] (2 responses)

The correct solution here is to add an additional oneshot unit. Roughly:

Requires=systemd-networkd-wait-online@NFS-interface
Before=mnt-storage-server.mount
Exec=monitoring.sh

And monitoring.sh would be roughly (using libnfs-utils for simplicity):

#!/bin/bash
ATTEMPTS=0
while true
do
  nfs-ls -D nfs://storage-server 2> /dev/null && exit 0
  ATTEMPTS=$[${ATTEMPTS}+1]
  if [ "${ATTEMPTS}" -ge 60 ]
  then
    exit 255
  fi
  sleep 60
done

Which would approximately wait for up to one hour, testing once a minute for the storage server to be online before allowing things to continue, and properly error out with a failure state after that hour. This isn't fully fleshed out as a .unit file, etc, but it's the gist, you just need to insert your extra dependency (wait for the storage server) as exactly that: A dependency in the chain.

systemd - no rhyme or reason

Posted Feb 18, 2025 18:47 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

I actually prefer to just represent the mount unit as a service, with hooks for startup/shutdown. It does not recapture all the lifecycle events from the mountinfo, but it's far more reliable than systemd's mount units.

systemd - no rhyme or reason

Posted Feb 19, 2025 22:59 UTC (Wed) by neggles (subscriber, #153254) [Link]

Entirely agree on this. Even with the explanations from Luca and Lennart above I still cannot understand why they absolutely will not allow a Restart=on-failure or similar option for mount units, instead forcing users to hack around the arbitrary limitation with a service unit that runs or has a mount dependency or triggers an automount by trying to read a file.

It's a completely valid use case; for example, wekafs mounts require an LXC where the DPDK-enabled frontend container service runs to be started and operational before the mount will succeed; but you may have just started a converged cluster and not have a quorum of nodes up and running yet, so you want to retry a few times over (say) 10 minutes before giving up. On shutdown, I also need that mount unit to be unmounted before the service units that back it are shut down, and before the network goes down.

This is stuff a service manager should be handling, right? Service and mount dependencies and ordering? As a bonus, this is probably the filesystem that /home lives on, so only your break-glass accounts can sign in if it's not mounted.

What is so harmful about allowing a failed mount to retry just like a failed service startup?

systemd often doesn't know why a given service failed to start any more than it knows why a mount failed to mount, so why is it OK to blindly restart services but not retry mounts?

why are automount units (which retry every time someone tries to open() a path beneath them) acceptable when an auto-retrying mount isn't?

The functionality is utterly trivial for systemd to add, it's logic that already exists for other units (and IIRC it would Just Work if mounts weren't explicitly excluded), and this is a significant pain point for users. Refusing to allow it on what seems to be ideological grounds ("you shouldn't need this" / "we can't be perfect so we won't try") is just shitty, i can't think of a better way to put that.

NFS and systemd automount units

Posted Feb 20, 2025 10:20 UTC (Thu) by farnz (subscriber, #17727) [Link]

Out of interest, why can't you use an automount unit for your NFS mount? This has the properties you would want for a network FS, of remounting automatically if the NFS server is missing, and of not blocking anything until the first access to a file on the mount point.

systemd, 10 years later

Posted Feb 18, 2025 13:23 UTC (Tue) by dankamongmen (subscriber, #35141) [Link] (8 responses)

i am personally a tremendous fan of systemd, but still, no retrospective of systemd is complete without mention of "V.R."'s essay, systemd, 10 years later (yes, the https is broken). this remains one of the finest pieces of technical writing i've ever come across.

systemd, 10 years later

Posted Feb 18, 2025 17:42 UTC (Tue) by Tobu (subscriber, #24111) [Link] (3 responses)

Oh this is very good, it goes into the historical context and the social dynamics!

The technical dive that follows didn't seem relevant immediately, but it does explain the issue with some unit types being more ad-hoc (and might help with finding workarounds for the NFS mount unit people were discussing). And it does make a relevant point about some dubious job engine stuff getting frozen after a period of fast growth in adoption. And about the non-stop expansion of the focus, with new features being more tractable than clean-ups that will break things for someone.

systemd, 10 years later

Posted Feb 18, 2025 17:55 UTC (Tue) by bluca (subscriber, #118303) [Link] (2 responses)

> Oh this is very good

It's really not though, the chip on the shoulder is extremely evident, and it's very inaccurate. For example there have been tons of changes to the engine, the recent PIDFD-ization is one example of sweeping across-the-board changes affecting process management.

systemd, 10 years later

Posted Feb 18, 2025 18:52 UTC (Tue) by Tobu (subscriber, #24111) [Link]

I can see the author's position and I don't mind it. Because of systemd finding success, outside viewpoints are useful. As is the retrospective showing how things happened and might have happened differently. As far as components being more or less coupled as convenient for forcing adoption early on, it's also close to how I remember it.

Now that systemd is well established, it could in fact stand to decouple its components more, and the adoption of varlink as an alternative to dbus that doesn't require global bus instances and tricky bootstrapping is a good move in that direction.

systemd, 10 years later

Posted Feb 19, 2025 16:14 UTC (Wed) by raven667 (subscriber, #5198) [Link]

I too thought it was a very useful perspective and not being a systemd developer I didn't take the obvious bias of the author personally, like when they used passive voice to obscure independent decisions to make them sound more sinister, or their conclusions which make some big leaps to get to their preferred opinion which isn't demonstrated to follow from the evidence they have so painstakingly collected. I greatly appreciate systemd and the improvements to process supervision and log collection as well as organization using cgroups, much better than daemontools/multilog which is what I cut my teeth on, but I too don't understand and haven't internalized the event/dependency/job model so have to work out how to use anything but the most basic config elements through close reading of the man pages and experimentation, rather than being confident I know what I'm doing, so there is valid criticism as well. Sometimes the criticism is just a downstream effect of the kernel APIs being insufficient for what is being attempted, such as the discussion of how parsing/diffing mountinfo doesn't scale well. Maybe this can be improved by creative writing to make documentation that leaves the reader with a more accurate mental model and fewer surprises, maybe at some point in the future the kernel will improve its management APIs so that a much simpler implementation can replace systemd while maintaining compatibility with existing unit definitions and management tools, much like how the designers of PipeWire were able to replace and unify a number of overlapping systems while maintaining compatibility with the clients that depend on them. This kind of refactoring is always easier when you have a working example to start from then initial design when there are more unknowns, but I'm not expecting such an effort to materialize soon as the current system, while imperfect and sometimes inconsistent running on an imperfect kernel and in an imperfect world, is probably good enough for the foreseeable future, until someone has a better idea and does the work to implement it.

systemd, 10 years later

Posted Feb 19, 2025 4:02 UTC (Wed) by motk (guest, #51120) [Link] (3 responses)

Some wild Sepiroth-posting there.

systemd, 10 years later

Posted Feb 19, 2025 16:07 UTC (Wed) by intelfx (subscriber, #130118) [Link] (2 responses)

> Some wild Sepiroth-posting there.

What's that? (in words suitable for those unfamiliar with...presumably, Final Fantasy lore?)

systemd, 10 years later

Posted Feb 19, 2025 16:13 UTC (Wed) by bluca (subscriber, #118303) [Link] (1 responses)

I for one am familiar with Final Fantasy lore and still have no clue what it means, lol

systemd, 10 years later

Posted Feb 19, 2025 16:38 UTC (Wed) by raven667 (subscriber, #5198) [Link]

I'm going to take a wild guess that this refers to the tone being somewhat grandiose and pretentious by using archaic sentence structure and wording in the more opinion based parts, so they end up sounding like a character rather than using a more conversational and modern style of language.

Can systemd safely replace GRUB?

Posted Feb 18, 2025 20:08 UTC (Tue) by dmoulding (subscriber, #95171) [Link] (5 responses)

GRUB can already today encrypt the boot volume, can systemd-boot? If not, are there plans to enable it to?

I know, I can already hear the cries of, "But the kernel code isn't secret, why do you want to encrypt your kernel binary?!" Well, there's more than just the kernel binary on my boot volume. It also contains an initramfs image which may contain all manner of things that I might not want easily divulged to a determined adversary.

It also contains the kernel command line arguments which I also don't necessarily want to be openly visible to nefarious eyes (things like file system labels or UUIDs, and other bits of system configuration information may be found therein).

Can systemd safely replace GRUB?

Posted Feb 18, 2025 20:48 UTC (Tue) by bluca (subscriber, #118303) [Link]

No, absolutely not - it's one of the major selling points of sd-boot that there is not even a wiff of crypto or filesystem code in it, and that will never change.
Nothing of what you mentioned needs to be encrypted. Not the kernel, nor the initrd, nor the kernel command line.

If for any reason you have data that _must_ be stored in the ESP and _must_ be encrypted, you can use credentials and seal them against the local TPM: https://systemd.io/CREDENTIALS/

Can systemd safely replace GRUB?

Posted Feb 18, 2025 21:04 UTC (Tue) by ferringb (subscriber, #20752) [Link] (3 responses)

What sort of things are in your initrd that you consider sensitive? I mean yeah, there are stuff I at first glance go "ehhh"... but that's me being anal and paranoid. About my only real concern w/ initrd/ramfs is using tooling like dracut, and the potential something get hoovered out of /etc that shouldn't be. That's not a knock on that tooling, it's more "I don't control it and can't easily validate it hasn't made a mistake".

This is absolutely talking out my backside, but if you really have stuff in initrd- that must be there but you wish to protect- I suspect you *could* use an intermediate initrd. Basically a layer that holds decryption tooling for mounting, and pivot's across to that- what a normal/proper setup should be doing (keeping the 'secret' crap behind encryption). For the 'decrypt', TPM or secureboot validation of the UKI and then require some external validation (ick).

https://lwn.net/Articles/1010466/ (systemd-soft-reboot) also seems like a hacky way to shove the same in, just mounting the decrypted inner initrd to /run/nextroot . That seems like pain, but who you do you. :)

Still, I'm curious exactly what you're concerned about, and if it's any form of 'secrets', why that isn't either buried in the encrypted drives, or utilizing systemd-creds?

Can systemd safely replace GRUB?

Posted Feb 18, 2025 22:59 UTC (Tue) by dmoulding (subscriber, #95171) [Link]

>there are stuff I at first glance go "ehhh"... but that's me being anal and paranoid.

Where I'm from, being anal and paranoid is what encryption is all about! :)

>What sort of things are in your initrd that you consider sensitive?

Well, just taking a quick look at one from one of my machines, without looking too hard, I see it's got /etc/machine-id in it. According to the docs, "This ID uniquely identifies the host. It should be considered "confidential", and must not be exposed in untrusted environments".

>About my only real concern w/ initrd/ramfs is using tooling like dracut

And yes, that is exactly the issue. Nobody I know builds their initramfs by hand. There's no easy way to verify on every single machine that every initramfs is being assembled (by whatever tool happens to be in use to generate it) without anything that is confidential. Or could be used by an attacker to glean information that might enable further attacks down the line (even just seeing what kernel version is in use might aid an attack). And even if I were to go through the trouble to validate all of them on all of my machines today, doesn't mean two weeks from now that will still be the case. Not to mention, nobody I know has time for checking that.

The simplest solution is to just encrypt the boot volume and not worry about it.

> you *could* use an intermediate initrd. Basically a layer that holds decryption tooling for mounting

But why should I bother cooking up some rickety workaround like that, when there's already a perfectly good solution that does exactly what I need, today?

I think maybe it's a mistake to think that nothing on the boot volume requires encrypting. I hope that the systemd project isn't baking that mistake into their systemd-boot architecture. If they do, it's likely a big step backwards from what we already have with GRUB.

Can systemd safely replace GRUB?

Posted Feb 19, 2025 11:42 UTC (Wed) by taladar (subscriber, #68407) [Link] (1 responses)

If you use unlocking of the disk encryption remotely you might have SSH host keys or credentials to some other remote credential supplying system in there. However for those you can't really encrypt the initrd because you need it to decrypt the rest.

Can systemd safely replace GRUB?

Posted Feb 19, 2025 13:36 UTC (Wed) by ferringb (subscriber, #20752) [Link]

> If you use unlocking of the disk encryption remotely you might have SSH host keys or credentials to some other remote credential supplying system in there.

I'm both curious and pre-emptively a bit terrified about that setup; you're trying to do this as a way to centralize encryption keys? Specifically the ability to get into a disk if it gets pulled? Or is this intended as a form of remote attestation to get the keys?

If the former, is there any reason you're not just using a scheme that either adds a common key across all disks (ick), or alternatively mixing hardware identification into a central key, so if you know have access to the 'central' privkey and know particulars of the hardware, you can recreate the per host key disk keys?

> However for those you can't really encrypt the initrd because you need it to decrypt the rest.

I'd argue you should be using https://www.freedesktop.org/software/systemd/man/latest/s... , specifically read the encrypt section for an explanation of the two modes. It's basically the same sort of trick as automatic unlocking of a disk.

Specifically, I'm proposing you store the systemd-creds encrypted key into the initrd, and during init you use creds to decrypt the ssh key. That key can only be recovered if the system measurements match. Change the bootloader, measurements change; etc. See https://wiki.archlinux.org/title/Trusted_Platform_Module#... for the various measurements you can make it sensitive to; if you're particularly aggressive, you can include firmware measurements.

You'd have to do some custom units on that, but I assume you're already doing that for how you're pulling the luks keys.

If you've already explored this and rejected it, I'd be curious of your reasons.

Lazy linking -> Lazy loading

Posted Feb 18, 2025 21:19 UTC (Tue) by wahern (subscriber, #37304) [Link] (1 responses)

> In the wake of the XZ backdoor, Poettering started pushing others to take the dlopen() approach too.

Lazy linking is typical on Linux systems, but the library is still loaded at start. Have there been any experiments to extend ELF and/or the linker (e.g. glibc's rtld) to lazily load also? It would require the compile time linker to associate a shared library name with each symbol, which I don't think is possible today.

Today with lazy linking you can approximate this by not declaring the library dependency, and dlopen'ing the library before invoking any functions. But you still have to explicitly dlopen before invocation, otherwise the linker throws a fatal error (the symbol has no definition). I've used this in an OpenSSL test harness to be able to test different OpenSSL builds--you pass on the command-line the desired library path(s) to load, which is processed before any OpenSSL API is invoked. It actually works for both ELF and Mach-O. And it proved more convenient and less brittle than LD_PREOAD--it was too easy to accidentally mismatch libssl and libcrypto or otherwise quietly using (without any apparent error) the wrong library, for example.

AFAICT systemd uses a macro, DLSYM_PROTOTYPE, to generate and declare pointers to used symbols. And before dereferencing any of those symbols a utility routine is run, like dlopen_many_sym_or_warn, that takes a list of pointers to those pointers to update. Though I suppose one could make a more sophisticated macro that generated wrapper routines that called dlopen. But you still have to, effectively, redeclare every symbol. Maybe some projects, like systemd, would prefer to keep things more explicit, but for widespread adoption[1] I could easily see a more automatic mechanism preferred so you don't need to superfluously declare things that are already declared in the library's header, and otherwise leverage the existing infrastructure (e.g. cross dependencies, lazy loading of recursive dependencies, etc).

[1] I'm not sure I'd want to adopt this in the general case, but from an engineering standpoint it's an enticing problem.

Lazy linking -> Lazy loading

Posted Feb 22, 2025 5:57 UTC (Sat) by pabs (subscriber, #43278) [Link]

The Solaris libc had optional dynamic linking IIRC, that never got adopted by glibc though.

Governence

Posted Feb 18, 2025 23:52 UTC (Tue) by MrWim (subscriber, #47432) [Link]

Once thing I don't see mentioned in the fine article is the organisational aspect that I believe has been so central to systemd's success: centralisation. Much like the Linux kernel it's a project sufficiently large in scope that if you want something done it's easier to fix it in systemd than to replace systemd with something else. It's big enough and improves at a fast enough rate that it would be too much effort to maintain a fork. This means that systemd matures and improves and becomes even harder to replace with time.

Centralised projects naturally have internal conflict, priorities to be managed, etc. - something that has to be managed in any project of sufficient scale. In other fields of endeavour this is known as politics. Many open-source advocates and developers deny its existence or necessity, and as a result are unable to scale sufficiently to solve problems that require that scale[^1]. Not systemd (nor Linux), where project is still able to function without complete consensus.

By being organised and centralised the projects opens itself to criticism, because there is a clear target to be criticised. systemd claims to solve a problem, rather than be just another developer scratching a personal itch where complaints can be dismissed with "you can't tell me what to do". Systemd is sometimes accused of buck passing - dismissing problems as needing to be solved elsewhere - but I think that it demonstrates a strength - people disagree over the scope of the project and that disagreement is met head on, rather than dismissed with excuses.

Another aspect: systemd went out of its way to be adopted by the major distributions, doing the hard work of advocating for adoption. The act of advocating is difficult - it opens you up to criticism, and involves putting yourself in a vulnerable position where you risk rejection. At the same time it requires empathy, and a willingness to adapt the project to another project's needs. Not easy.

Funnily enough I rarely see these aspects of the project discussed - except by systemd's detractors, where they are all brought up as negatives. Sometimes complaints are dressed up in technical terms like "monolithic", when really it's about people organisation, not code organisation. Advocacy is treated as cheating, when it's harder than writing code. Project organisation is hard, thankless work that few have the skills to do, and many don't even recognise that it exists at all, let alone are thankful towards those who actually step up and do it. systemd is all the more impressive for having achieved what it as how it has.

[^1]: See also https://www.bassi.io/articles/2019/01/17/history-of-gnome... and the discussion of Conway's law. As an aside I wonder if it's one reason that micro-kernels have not been successful at any scale - Conway's law makes it's possible to keep avoiding conflict, splintering teams all the way until the project fails.

Lack of understanding of fundamentals even after 14 years

Posted Feb 19, 2025 10:26 UTC (Wed) by walex (guest, #69836) [Link] (3 responses)

«Then came Upstart in 2006. It was an event‑based init daemon [...] The question of why distributions didn't stick with Upstart deserves an answer, he said. [...] It required "an administrator/developer type" to figure out all of the things that should happen on the system "and then glue these events and these actions together".»

This is just handwaving as both Upstart and systemd are event based and both require a systems engineer to write a lot of configuration.

The reason was that Upstart was "push" (eager) based and that is less convenient than a "pull" (lazy) based one like systemd.

I am not surprised that Poettering does not understand even the positive side of systemd as he seems to me a very intelligent person who however does stupid things because of shallowness, in particular after 14 years systemd still had two fundamental and related issues:

  • systemd does not have or propose a model of service states as Poettering and many others confuse service states with process states. The reason why the "socket activation concept" was loved is that usually socket state matches service state better than process, as when a socket is ready usually the service behind it is ready or soon to be so.
  • Since systemd aims to manage system states and the UNIX architecture does not have good facilities for connecting unrelated (by forking) processes and there is no model of service states systemd has become a huge and effectively monolithic (despite being technically split into 150 closely related executables) general service state multiplexer where it must eventually manage all system aspects itself using a wild and complex variety of service state wrappers, so for example it must manage package installs, network interfaces, filesystem mounts itself etc. to ensure that their state is well defined before starting services that depend on those.

If some thoughts has been given to those fundamental issues instead of hacking piecemeal new complicated features and wrappers in the past 14 years a much smaller and simpler structure might have been the result.

Lack of understanding of fundamentals even after 14 years

Posted Feb 19, 2025 14:36 UTC (Wed) by corbet (editor, #1) [Link] (1 responses)

If somebody has a different understanding of "fundamentals" than you do, perhaps we can talk about that. But leave the personal insults out of it; they degrade both LWN and your argument.

Lack of understanding of fundamentals even after 14 years

Posted Feb 20, 2025 11:39 UTC (Thu) by walex (guest, #69836) [Link]

«leave the personal insults»

I will strive to never call again Poettering a "a very intelligent person" again on LWN. That is the only comment I made as to his person.

Lack of understanding of fundamentals even after 14 years

Posted Mar 5, 2025 12:53 UTC (Wed) by dtardon (subscriber, #53317) [Link]

> systemd does not have or propose a model of service states as Poettering and many others confuse service states with process states. The reason why the "socket activation concept" was loved is that usually socket state matches service state better than process, as when a socket is ready usually the service behind it is ready or soon to be so.

So maybe you could enlighten those among us who don't know what service states are? And why they are better than the status quo?

> Since systemd aims to manage system states and the UNIX architecture does not have good facilities for connecting unrelated (by forking) processes

UNIX might not, but Linux does. It's called cgroups and systemd makes heavy use of it.

> and there is no model of service states systemd has become a huge and effectively monolithic (despite being technically split into 150 closely related executables) general service state multiplexer

IOW, because systemd doesn't have a model of service states, it's become a service state multiplexer... Sorry, but this sentence makes no sense. (And the bit about the split to separate executables being just technical is pure bullshit. It just shows that you've no idea what you're talking about.)

> where it must eventually manage all system aspects itself using a wild and complex variety of service state wrappers,
> so for example it must manage package installs,

It doesn't.

> network interfaces,

It doesn't either. networkd does, but that's completely unrelated to PID1 and there's no information sharing between them.

> filesystem mounts itself etc. to ensure that their state is well defined before starting services that depend on those.

I'm eager to learn how service states help to manage dependencies between services and mounts without tracking mounts.

Still don't handle remote boot correctly

Posted Feb 25, 2025 22:12 UTC (Tue) by lee_duncan (subscriber, #84128) [Link] (8 responses)

I work on iSCSI, and remote iSCSI volumes used for booting have always been problematic on Linux.

Systemd had a chance to fix that, but they didn't. There's still no way to start an iSCSI session in initrd then pivot and have the full root understand that. But since it's not a preferred use case, it doesn't matter.

Still don't handle remote boot correctly

Posted Feb 25, 2025 22:36 UTC (Tue) by bluca (subscriber, #118303) [Link] (7 responses)

That's not something that systemd can implement, it's up to whoever builds the initrd.

Still don't handle remote boot correctly

Posted Feb 25, 2025 23:49 UTC (Tue) by lee_duncan (subscriber, #84128) [Link] (6 responses)

No. The problem is that we have a root connection, managed by the kernel, and a daemon that handles error conditions, retries, configuration, etc.

We get connected in initrd, thus establishing the root disc. But then when it's time to switch root, the daemon is killed. A new daemon is started in user space. In the time between the pre-pivot daemon stopping and the post-pivot daemon starting, the connection cannot handle any errors, such as network hickups, target hickups. This also means that the post-pivot daemon needs to rediscover everything about the existing connection.

I would like the ability to have the daemon continue during post-pivot. Evidently, this is possible, the daemon can't see the post-pivot filesystem, so it can't read config files, create or read database entries, etc. That means it can't really be interacted with. So that's not a solution.

BTW, the daemon ran uninterrupted, from boot through multi-user, pre-systemd, for historical reference.

I believe the systemd solution to this is to redesign our system so the daemon isn't needed during pivot, which also will not happen.

I'm not aware of any remote-boot solution that works well with systemd, but I'm not sure how nvme handles this.

Still don't handle remote boot correctly

Posted Feb 26, 2025 0:28 UTC (Wed) by bluca (subscriber, #118303) [Link] (5 responses)

> We get connected in initrd, thus establishing the root disc. But then when it's time to switch root, the daemon is killed.

Then the daemon is not implemented correctly. This problem has been solved for a decade at least. See:

https://systemd.io/ROOT_STORAGE_DAEMONS/

Still don't handle remote boot correctly

Posted Feb 28, 2025 21:56 UTC (Fri) by lee_duncan (subscriber, #84128) [Link] (4 responses)

Yes, sadly, that's the same response I have gotten in the past. An architecture document that, to my knowledge, nobody has followed. That document has a lot of issues that I'm pretty sure the systemd folks "don't understand".

I have a daemon that handles error connections, that needs to be running for the root disc to work correctly. And it does great at that. But it's somewhat complicated.

So to be able to use it the "systemd way", I would need to redesign it so that it can (1) run two at a time (not currently possible), and create a protocol for the initrd version to migrate its state to the post-pivot daemon, and not miss a beat if the root disc connection has an issue (or, even worse, have to daemons trying to fix the problem).

Or perhaps I'm supposed to keep my daemon from being killed? But then it's stuck in pre-pivot initrd root forever, and can't see any local filesystems. So calling it "solved for a decade" is off by about 10 years in my opinion.

Perhaps there's a working implementation of this "root storage daemon" policy? I'd love to see it if so, since none is reference in the link you supplied. Perhaps I can learn something.

Still don't handle remote boot correctly

Posted Mar 1, 2025 12:25 UTC (Sat) by bluca (subscriber, #118303) [Link] (3 responses)

> Yes, sadly, that's the same response I have gotten in the past. An architecture document that, to my knowledge, nobody has followed.

This is not true, and even a cursory search on GH shows plenty of real world examples:

https://github.com/search?q=%22argv%5B0%5D%5B0%5D+%3D+%27...

Still don't handle remote boot correctly

Posted Mar 5, 2025 19:43 UTC (Wed) by lee_duncan (subscriber, #84128) [Link] (2 responses)

Excellent grepping skills on github, but only one of those projects is actually a filesystem daemon! Seems funny that so many other projects (like a web server serving up git as a filesystem) use the "ROOT STORAGE DAEMONS" loophole to keep their daemons running.

Open-iscsi is unlike any of these implementations, in that it daemon needs to access post-pivot sysfs, a post-pivot database, and post-pivot configuration files. The pre-pivot daemon has their own copies of these things.

This kind of push-back is why, in my experience, some see systemd as less than cooperative. I just see it as an immovable object I have to go around.

Still don't handle remote boot correctly

Posted Mar 5, 2025 21:40 UTC (Wed) by bluca (subscriber, #118303) [Link] (1 responses)

> Open-iscsi is unlike any of these implementations, in that it daemon needs to access post-pivot sysfs, a post-pivot database, and post-pivot configuration files. The pre-pivot daemon has their own copies of these things.

And? That's trivial to get

Still don't handle remote boot correctly

Posted Mar 6, 2025 10:22 UTC (Thu) by farnz (subscriber, #17727) [Link]

Can you describe how? It's clearly not obvious to lee_duncan how to get at the post-pivot filesystem contents from a daemon started pre-pivot, and documentation on how to do that might clear up their confusion.

"All major Linux distributions..."

Posted Feb 27, 2025 22:36 UTC (Thu) by sombragris (guest, #65430) [Link] (1 responses)

All major Linux distributions use systemd

Not all of them. Slackware (the oldest continuously-maintained Linux distribution, which arguably fits the criteria for a "major" distro) does not use it, never has, and has no plan of including it in the foreseeable future.

The article was very informative and educational. Thanks for the reporting.

"All major Linux distributions..."

Posted Feb 27, 2025 23:18 UTC (Thu) by rahulsundaram (subscriber, #21946) [Link]

> Not all of them. Slackware (the oldest continuously-maintained Linux distribution, which arguably fits the criteria for a "major" distro) does not use it, never has, and has no plan of including it in the foreseeable future.

As with many who started off with Linux early on, I too was a Slackware user at some point but I haven't seen it be considered a major distro in a really long time. It has a special place as one of the oldest Linux distros still actively maintained with a small but passionate Linux user base and over the years it has become a pretty niche distro. YMMV.


Copyright © 2025, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds