Systemd catches up with bind events

By Jonathan Corbet
November 13, 2020

The kernel project has a strong focus on not breaking user-space applications; if something works with a given kernel release, it should continue to work with subsequent releases. So it may be discouraging to read the lengthy exposition on an apparent user-space API break in the announcement for the systemd 247-rc2 release. Changes to udev configuration files will be needed to keep systems working, but the systemd project claims that it "is not [the] fault of systemd or udev, but caused by an incompatible kernel change that happened back in Linux 4.12". It seems like an appropriate time to look at what happened, how administrators need to respond, and whether anything can be done to avoid this kind of thing from happening again.

Modern computers tend to be highly dynamic, with devices (of both the physical and virtual variety) appearing and disappearing while the system is running. The kernel handles the low-level details with regard to these device events, but it is up to user space to take care of the rest. For that to happen, user space needs to know when something has changed with the system's configuration.

To that end, events are emitted to user space from deep within the kernel's driver-core subsystem whenever something changes; for example, plugging in a USB device will result in the creation of one or more ADD events to tell user space that the new device is available. The udev daemon is charged with responding to these events according to a set of rules; it can create device nodes, set permissions, notify other user-space components, and more, all in response to properties attached to events by matching rules. The set of possible events is relatively small and does not change often.

Breaking systemd

In July 2017, though, Dmitry Torokhov added two new event types called BIND and UNBIND. They are meant to allow user space to handle devices that need help before they can become fully functional — those that need a firmware load, for example. For drivers that support the new mechanism, a BIND event for a device will follow the ADD event once the device is ready to operate. This change was a part of the 4.14 kernel release in November 2017 (not 4.12 as stated in the systemd announcement).

Later that same month, a bug report landed in the KDE bug tracker; this was perhaps the first case where somebody noticed a problem related to the new events. That report only made it to the kernel lists at the end of 2018, though — over one year later. By then, 4.14 had been made into a long-term support kernel and shipped by distributors, with relatively few complaints from users. Indeed, Greg Kroah-Hartman was mystified as to why problems were turning up a year later. That turned out to be a change to systemd that caused it to propagate the new events.

Specifically, the problem would appear to originate in the way that udev (which is a part of the systemd project) attaches tags to events. These tags, which are set and used by udev rules, control how user space will set up the new device. There is an assumption built in that there will only be a single event to announce the existence of a new device, so attaching tags to that event is sufficient. When the second event (the BIND event) shows up, the device state is reset and those tags are forgotten, leading to the associated device not being set up properly.

As a short-term "fix", systemd was patched to simply ignore the new events. That caused things to work as they did before, at the cost of hiding those events entirely. That was never a long-term solution; the new events were added for a reason and some devices need them for proper setup. So a better solution had to be found for the longer term; that solution has two aspects, one of which may be disruptive for users who have created their own udev rules.

Fixing systemd

The first piece is a reworking of the "tag" mechanism provided by udev. Tags are special properties that can be attached, then matched in subsequent rules or consumed by user space. Rather than attaching tags to events, as has been done until now, udev attaches them to devices, so tags added in response to an ADD event will still be there for the BIND event as well. For cases where rules need to respond only to tags added to the current event, a new CURRENT_TAGS property lists only those tags; it thus holds the value that the TAGS property held in previous releases.

The other part, though, is a change that must be applied to a number of udev rule sets. Consider, for example, this snippet taken from a randomly chosen rules file (10-dm-disk.rules in particular) on a Fedora 32 system:

    # "add" event is processed on coldplug only!
    ACTION!="add|change", GOTO="dm_end"

The ACTION line causes the entire file to be skipped for anything other than ADD or CHANGE events; in particular, that is what will happen with BIND events. That will cause properties associated with those events to be lost — and the device in question to be set up improperly (if at all). The fix is to change that line to read:

    ACTION=="remove", GOTO="dm_end"

That causes the rules to be skipped (and their associated state forgotten) only when the device is removed from the system.

The problem here is that these rules were written under the assumption that no new event types would be added, so anything that wasn't recognized as adding or modifying a device could be ignored. There is, evidently, a certain amount of code that runs in response to device events that has a similar problem. What this shows is, in effect, a sort of protocol ossification effect that has made it much harder to add event types to the API provided by the kernel. Indeed, in 2018, Torokhov remarked:

Well, it appears that we can no longer extend uevent interface with new types of uevents, at least not until we go and fix up all udev-derivatives and give some time for things to settle.

At the time, there was discussion of possibly reverting the change, causing the new events to disappear. But that approach had the potential to create regressions of its own, as some systems may well have depended on getting those events; the kernel release adding them was a year old by that point, after all. There was also discussion of adding some sort of knob to enable or disable the creation of BIND and UNBIND events, but that never came to pass. Instead, Torokhov described the work in the systemd project to make the changes described above, and Kroah-Hartman responded: "So all should be good".

A regression?

With luck, all will be good, but it has come at the cost of some work within the systemd community over the last two years; the systemd developers have made their displeasure known:

We are very sorry for this breakage and the requirement to update packages using these interfaces. We'd again like to underline that this is not caused by systemd/udev changes, but result of a kernel behaviour change.

Was this a violation of the kernel's "no regressions" rule? The answer must almost certainly be "yes"; code that worked with 4.13 no longer worked with 4.14. What should have been done about it is a bit less clear. Had the issue been reported to the kernel community more quickly, it might have been possible to revert and redesign the change; after it had been deployed for a year, though, that was not a simple option. One could argue that the kernel community should have found some other way to fix the regression; the systemd 247-rc2 announcement tries to make that case. But once Torokhov posted that the problem was being addressed on the systemd side, there was no longer any pressure to do that.

Perhaps the real lesson here is that the community would be better served by closer relations between the kernel project and projects managing low-level utilities like systemd. Those relations have been somewhat strained at times, and there are not a lot of places where cooperative, cross-project discussions can take place. The presence of systemd developers at events like the Linux Plumbers Conference is limited at best, and those developers — not without reason — do not find the kernel mailing lists to be an entirely welcoming place. We are all working on the same system, though, and we would probably have an easier time of it if we could talk things through a bit more.

Index entries for this article
Kernel	Development model/User-space ABI
Kernel	udev

Systemd catches up with bind events

Posted Nov 13, 2020 21:10 UTC (Fri) by MatejLach (guest, #84942) [Link] (13 responses)

I am not all that surprised that systemd wants to point out this is a kernel regression, as I seem to remember some in the kernel community having choice words for systemd in the past.

I do wish for the relations between the most deployed init systm and the kernel to improve.

Systemd catches up with bind events

Posted Nov 13, 2020 21:15 UTC (Fri) by willy (subscriber, #9762) [Link] (12 responses)

Said choice words being caused by systemd coopting a kernel boot parameter for its own purposes. Again, systemd not playing well with others.

Systemd catches up with bind events

Posted Nov 13, 2020 22:20 UTC (Fri) by MatejLach (guest, #84942) [Link] (5 responses)

In this case it seems to be the kernel not playing well with others. Or rather not adhering to its own rules.

Which is precisely the point, cooperation rather than pointing the fingers would help here.

systemd has proven itself useful enough where it should be consulted by kernel developers and vice versa.

Systemd catches up with bind events

Posted Nov 14, 2020 0:49 UTC (Sat) by gerdesj (subscriber, #5446) [Link] (4 responses)

It doesn't matter who is right or wrong. As my parents used to say "six of one and half a dozen of the other".

The kernel by definition is used on all Linux boxes and systemd is probably by now the most widely used init system at least on systems that the sysadmin/user actually cares what is happening.

systemd has become the de-facto Linux init system or PID1 or whatever the hell you want to call it. I still recall coming across M van S's comments in init scripts for the first time rather a long time ago and suddenly feeling that a real person actually cared about me and my little system. I "got" open source about then - I kept on finding notes in man pages and readme files and so on that indicated I was dealing with people who give a shit. Every now and then I still find something to make me smile in a readme or a help menu. I don't get that feeling when I'm fiddling with Windows or Macs. Linux is properly corporate these days and quite rightly so - we've grown up but it is still nice to see a human touch sometimes.

Please remember why we do this stuff.

Systemd catches up with bind events

Posted Nov 21, 2020 6:59 UTC (Sat) by ras (subscriber, #33059) [Link] (1 responses)

> systemd has become the de-facto Linux init system or PID1 or whatever the hell you want to call it.

Only for the desktop distro's. It's too heavyweight for the smaller ones like Apline, OpenWRT or Android, containers like Docker don't use an init system at all. And that could well cover the bulk of deployed Linux instances.

Systemd catches up with bind events

Posted Nov 21, 2020 12:32 UTC (Sat) by rahulsundaram (subscriber, #21946) [Link]

> Only for the desktop distro's. It's too heavyweight for the smaller ones like Apline, OpenWRT or Android, containers like Docker don't use an init system at all.

The vast majority of Linux distros including RHEL, SLES, Debian etc (not just the desktop ones) use systemd by default. Docker containers are not comparable to distros but some of them do run a init system and it is popular enough that several distros include a systemd-container package specifically for this purpose and systemd is not limited to a init system, so other parts gets routinely used in containers as well.

Systemd catches up with bind events

Posted Nov 23, 2020 13:29 UTC (Mon) by flussence (guest, #85566) [Link] (1 responses)

> Please remember why we do this stuff.

It's hard to tell corporate types to remember FOSS had a human element when they came from outside that culture entirely, their salary depends on gentrifying it out of existence, and in their off-time their hobby is talking over everyone to proclaim “Well Actually everyone uses our software because our software is great because everyone uses it”.

Most of the interesting people seem to be using BSD these days.

Systemd catches up with bind events

Posted Nov 23, 2020 16:18 UTC (Mon) by anselm (subscriber, #2796) [Link]

Most of the interesting people seem to be using BSD these days.

It used to be that using Linux was the way to stand out from the crowd, to be nerdy and interesting and metaphorically show the finger to those stodgy Windows and Mac users.

Now Linux has been mainstream for a while and is no longer good for nerd cred. People's elderly relations can (and do) use it. This means that the people who were using Linux 20+ years ago, when it meant not being able to do certain things (that when pointed out, one would adamantly insist weren't worth doing, anyway), spending three days to get a new video card/monitor working, etc., are being forced into BSD if they still want to impress their peers. But that's not because of BSD's versatility, wide compatibility with popular hardware and peripherals, and technical excellence – it's because few other people want to use it. It's the IT equivalent of an Indian fakir's bed of nails; very comfortable and just the thing if you're a fakir, but an item of morbid fascination for others.

Systemd catches up with bind events

Posted Nov 14, 2020 0:03 UTC (Sat) by pbonzini (subscriber, #60935) [Link] (2 responses)

No, the parent comment is right. The issue was caused by: 1) a distro backport to systemd that had a bug, which caused it to spit too much debugging output, which caused it to timeout because the framebuffer console is insanely slow; 2) kernel developers spending more time crafting flaming emails than looking 1 inch further than their nose.

Come on, even Linus said that it was perfectly fine for systemd to use the command line that way[1] and, after having laced some emails with remarks about Kay, later admitted that it was just a bug and people were overreacting[2].

[1] http://lkml.iu.edu/hypermail/linux/kernel/1404.0/01488.html

[2] http://lkml.iu.edu/hypermail/linux/kernel/1404.0/02712.html

Systemd catches up with bind events

Posted Nov 16, 2020 0:20 UTC (Mon) by nevets (subscriber, #11875) [Link] (1 responses)

It wasn't really the command line usage that was the problem. The real problem was that the command line triggered systemd to spam the kernel printk buffer so much that we lost all debug messages from the kernel. When reporting this as a problem, instead of saying there was a bug in systemd, we were told it was not a bug and systemd is perfectly fine using the debug command line to trigger writing debug messages in the kernel printk buffer. I thought the real solution was to separate kernel writes into printk from userspace writes, preventing userspace from overwriting what the kernel produces.

This would not have escalated the way it did if we were told from the beginning, "oh there's a bug in systemd that causes it to spam the buffer, please upgrade to a fixed version". But instead told to bugger off. Yes, it really is a lack of communication and good faith between the two communities and I hope we can work better in the future.

Systemd catches up with bind events

Posted Nov 16, 2020 7:55 UTC (Mon) by pbonzini (subscriber, #60935) [Link]

> oh there's a bug in systemd that causes it to spam the buffer, please upgrade to a fixed version

Technically the upstream people couldn't have known, since the bug was introduced by an incorrect distro backport. And if a buggy systemd, one that spews assertion failures all the time, will slow boot down to a crawl, the systemd people might even consider that to be a feature. It can and will happen for kernel WARNs as well, and a buggy PID 1 is not much better than a buggy kernel. But these are details, and in general I think we agree.

What this shows to me, is that Linux is sorely lacking postmortems. Whenever Linus screams at me, I try to figure out what went wrong in my workflow and how I can improve it to avoid being screamed at in the future. On the other hand, if 5 years later people still believe that "debug" is a sacred part of the kernel command line (and not the more nuanced explanation that you gave), something went wrong on the kernel side in figuring out what happened.

Systemd catches up with bind events

Posted Nov 14, 2020 0:39 UTC (Sat) by foom (subscriber, #14868) [Link] (2 responses)

"You can't use the boot command line option 'debug', it's ALL MINE!"...Seriously? "Coopting"?

As a user, I'm sure I don't really care which part of the system boot is implemented by code in the kernel and which is implemented in systemd/udev/etc. I just want them to work together to boot the system properly. I mean, it makes sense to me that if I want to debug an issue, that everyone would respond to the one debug flag...

And same for "This incompatibility is all their fault!" -- again...who cares? It's nonsense.

Systemd catches up with bind events

Posted Nov 14, 2020 13:30 UTC (Sat) by willy (subscriber, #9762) [Link] (1 responses)

Uh, yes, "coopting".

"divert to or use in a role different from the usual or original one"

What word would you use to describe using something for your own purposes that somebody else was already using? It's not like I said "stealing".

Systemd catches up with bind events

Posted Nov 14, 2020 16:26 UTC (Sat) by pbonzini (subscriber, #60935) [Link]

When you say "the original purpose" of /proc/cmdline and the debug flag, do you mean "System services are *supposed* to parse it, because it gives a unified way for people to pass in various flags. The kernel doesn't complain about flags it doesn't recognize, exactly because the kernel realizes that "hey, maybe this flag is for something else". The classic example of this is things like "charset" markers, but also options to modules that modprobe parses etc etc. And yes, that does include "quiet" and "debug"."?

Systemd catches up with bind events

Posted Nov 13, 2020 21:24 UTC (Fri) by jkingweb (subscriber, #113039) [Link] (5 responses)

I'm not sure how this counts as a regression. Code the relies on undefined behaviour (viz. the set of event types which are not "add" or "change" that the kernel will emit) should not be surprised when things change.

Systemd catches up with bind events

Posted Nov 13, 2020 22:16 UTC (Fri) by rahulsundaram (subscriber, #21946) [Link] (4 responses)

The article already addresses this. To recap, in the past, Linus has argued that it doesn't matter what is correct behaviour and that if something userspace worked on one kernel version and the next kernel version broke it, it should be considered a regression and reverted and note that the breakage is not limited to systemd here. It has not always been 100% followed by that is the general guidance. It looks like the delay in reporting upstream is part of the problem.

Systemd catches up with bind events

Posted Nov 14, 2020 2:23 UTC (Sat) by koh (subscriber, #101482) [Link] (3 responses)

The kernel has lots of interfaces where stuff keeps getting added and therefore has to be considered as an open set: syscalls, flags to syscalls, the sysfs entries, filesystems, modules, etc. How is this set of event types any different? If someone was to create a userspace program relying on a particular syscall, flag, whatever, not being implemented - until it finally is - would that be a regression?

Systemd catches up with bind events

Posted Nov 14, 2020 3:12 UTC (Sat) by khim (subscriber, #9252) [Link]

> If someone was to create a userspace program relying on a particular syscall, flag, whatever, not being implemented - until it finally is - would that be a regression?

Absolutely. LWN even have article which explains how and why that case should be handled.

But there is also the rule if nobody notices, it's not broken.

Now… we have very weird corner-case: somebody have noticed… year after the change was made. That's… rather unusual, to say the least.

Systemd catches up with bind events

Posted Nov 14, 2020 8:42 UTC (Sat) by abo (subscriber, #77288) [Link] (1 responses)

Try booting RHEL8 (or derivatives) on the latest upstream kernels. It will initially appear to work, but certain systemd operations will fail due to a change in capabilities. (https://bugzilla.redhat.com/show_bug.cgi?id=1853736)
The fix in that case is a lot simpler, and backporting it to various distro systemd versions isn't a big deal, but it's still a regression.

Perhaps it is reasonable to consider systemd exempted from the kernel's ABI/API stability promise, because it is sometimes almost the only user of certain interfaces?

Systemd catches up with bind events

Posted Nov 15, 2020 15:19 UTC (Sun) by pbonzini (subscriber, #60935) [Link]

> Perhaps it is reasonable to consider systemd exempted from the kernel's ABI/API stability promise, because it is sometimes almost the only user of certain interfaces?

That's complicated. With more and more people using containers—including OS containers running a full-blown init system—it's not that rare to see very new userspace on old kernels or vice versa. This also means that it will be much harder to remove features in distro kernels: for example, even if your distro ships with an nftables-based iptables(8), there could be containers using the older iptables API.

Systemd catches up with bind events

Posted Nov 13, 2020 22:42 UTC (Fri) by GhePeU (subscriber, #56133) [Link] (4 responses)

So, to make it short, a project makes a breaking change, its downstream developers/users are initially blindsided by it, then they need to scramble to work around it, then they try to convince the upstream developers that the change is problematic and could they please reconsider but they’re told to adapt to the brave new world they now live in and fix their software, and finally they have to do just that and publish a new fixed release with a big disclaimer tacked on hoping that at least a part of the annoyed users that will be bitten by the change will read it and not complain about it to them?

The only news in this story is that, maybe for the first time, the systemd people are not the upstream project, and I think there’s a German word for what I’m feeling right now :)

Systemd catches up with bind events

Posted Nov 13, 2020 23:33 UTC (Fri) by ubhofmann (subscriber, #47368) [Link] (1 responses)

Schadenfreude? https://en.wikipedia.org/wiki/Schadenfreude

Systemd catches up with bind events

Posted Nov 14, 2020 7:03 UTC (Sat) by jonas.bonn (subscriber, #47561) [Link]

Besserwisser... ;)

Systemd catches up with bind events

Posted Nov 15, 2020 15:11 UTC (Sun) by Wol (subscriber, #4433) [Link] (1 responses)

> maybe for the first time, the systemd people are not the upstream project

Well, I think Lennart is well used to being downstream, and he likes to rely on upstream doing what they claim.

This seems a classic case of upstream not sticking to its promises, which is the whole problem with the unixy philosophy of being liberal with what you accept, and strict in what you emit. systemd (and pulseaudio, etc etc) is strict in expecting upstream to do what they promised.

Cheers,
Wol

Systemd catches up with bind events

Posted Nov 15, 2020 18:44 UTC (Sun) by NYKevin (subscriber, #129325) [Link]

> which is the whole problem with the unixy philosophy of being liberal with what you accept, and strict in what you emit.

That's not Unix, that's Postel's Law, which IIRC originates from TCP/IP (where it is *also* an unholy mess, but of a different kind).

Systemd catches up with bind events

Posted Nov 13, 2020 23:01 UTC (Fri) by syrjala (subscriber, #47399) [Link] (19 responses)

Maybe someone will finally merge my bluez fix for this same regression: https://lkml.org/lkml/2018/12/4/1167

Systemd catches up with bind events

Posted Nov 14, 2020 16:43 UTC (Sat) by IanKelling (subscriber, #89418) [Link] (18 responses)

Looks like the maintainers need more info from you.

Systemd catches up with bind events

Posted Nov 15, 2020 13:55 UTC (Sun) by syrjala (subscriber, #47399) [Link] (17 responses)

I pretty much gave up after being trolled with "what makes my hardware different from yours?"

Systemd catches up with bind events

Posted Nov 15, 2020 18:47 UTC (Sun) by NYKevin (subscriber, #129325) [Link] (16 responses)

That is not trolling, that is an entirely fair response to "you don't see it because you don't have the right hardware."

If you don't tell them about the hardware that reproduces the bug, they cannot reproduce it. If they are unable to reproduce a bug, how are they supposed to evaluate a patch for that bug?

Systemd catches up with bind events

Posted Nov 16, 2020 10:56 UTC (Mon) by k3ninho (subscriber, #50375) [Link] (15 responses)

>That is not trolling, that is an entirely fair response to "you don't see it because you don't have the right hardware."
Many years of past experience have given this a name: 'works for me'. Developer doesn't experience the problem and can't conceive of the imagined version of the code in their head not running as they think it will.

This is dull, unhelpful pushback -- what would be better if not actually helpful is to call no-op ACTION=="remove", GOTO='end_stanza' an antipattern making you think that add/change/remove are the only legitimate udev action types and core to this 4.12-changed-userspace issue.

(There's a further issue at the level of our civilisation and society where 'works for me' gives people with power -- to fix bugs raised by users -- a habit of denying the lived experience of users and the struggles that users have with our software, which can become a life-long denial of the lived experience and struggle of other human people. I get that, in software, unanticipated complexity means that fixes have to not also break other things and that makes an apparently-simple change expensive and unpredictable, easier to push back and not make changes. Here's the question from this rhetoric: Are we not the wizards and masters of these systems that we should be able to change them to work more correctly for more people?)

K3n.

Systemd catches up with bind events

Posted Nov 16, 2020 12:15 UTC (Mon) by magfr (subscriber, #16052) [Link]

This is interesting and further enforces my half formed thought that matching of inverse patterns are bad since it assumed that the set of values är fixed and unchanged.

The example in the article was matching
~add|change when what is needed is remove.

For this poster the problem sounds like the reverse, he needs to match ~add|change and someone have "optimzed" that to remove.

This proves that one need to know what one is doing and that crap can be written in any language, in this case the udev config rule language.

One way to fix this is to document an ERROR event.
Any rule that mentions an ERROR event is broken.
Any action that happens when ERROR is issued is a bug.

Systemd catches up with bind events

Posted Nov 16, 2020 12:21 UTC (Mon) by hkario (subscriber, #94864) [Link] (8 responses)

put yourself in the developers boots for a minute:

you get a bug report, you look at the experienced behaviour, you haven't encountered it before; you try it with your hardware, it's not reproducible; you look at code that *may* be related, it doesn't seem possible to trigger this kind of behaviour

now, what on earth can you do more than to ask for more information?

Developers aren't omniscient and omnipotent entities that exist beyond confines of space and time, entities that fix bugs based only on a fickle. They're human, and they need to understand the bug before they can fix it.

Systemd catches up with bind events

Posted Nov 16, 2020 17:59 UTC (Mon) by jezuch (subscriber, #52988) [Link] (7 responses)

You create a unit/regression test which mocks hardware to behave the way described by the bug reporter? Consult the spec to confirm that this is a valid use case? Just saying "works for me" is being a horrible maintainer.

Systemd catches up with bind events

Posted Nov 16, 2020 18:30 UTC (Mon) by rahulsundaram (subscriber, #21946) [Link] (6 responses)

> Just saying "works for me" is being a horrible maintainer.

That wasn't what was said however. There was a question back on what makes the hardware different which seems to have gone unanswered. Given the wide variations in hardware, this is a reasonable question.

Systemd catches up with bind events

Posted Nov 20, 2020 15:16 UTC (Fri) by k3ninho (subscriber, #50375) [Link] (5 responses)

>> Just saying "works for me" is being a horrible maintainer.

>That wasn't what was said however.

It wasn't *exactly* what was said but it was the spirit of what was said.

K3n.

Systemd catches up with bind events

Posted Nov 20, 2020 15:45 UTC (Fri) by rahulsundaram (subscriber, #21946) [Link] (4 responses)

> It wasn't *exactly* what was said but it was the spirit of what was said.

I don't agree but even assuming that, works for me is a fine thing to say if you don't stop at that point. There was a query for more information. It's up to the reporter to pursue that further

Systemd catches up with bind events

Posted Nov 20, 2020 20:08 UTC (Fri) by mathstuf (subscriber, #69389) [Link] (3 responses)

The key is to coax out the difference in the setup from the reporter. I find it hard to get relevant details from reporters sometimes. I know what I'm looking for, but reporters will sometimes trim output to what they think is important, missing the actual details that are relevant to diagnosing the problem. Screenshots instead of copy/pasted text are also a thing.

It's about communication. I certainly have more to learn on this front, but part of it is realizing the differences in knowledge and expectations on either side of the wire.

Systemd catches up with bind events - works for me

Posted Nov 21, 2020 19:14 UTC (Sat) by giraffedata (guest, #1954) [Link] (2 responses)

This thread is either about someone's serious misinterpretation of the "works for me" response as, "this is your problem; go away" or a misnaming of that actual response.

"Works for me" is a request for more information or diagnostic work.

But I've also been the recipient of the response, "What you're doing is too unusual for me to care about. Do what I do, and it will work." Many times. I'm creative. I suppose someone might characterize that as "works for me."

Systemd catches up with bind events - works for me

Posted Nov 28, 2020 9:21 UTC (Sat) by jezuch (subscriber, #52988) [Link] (1 responses)

I guess there's a subtle difference between "works for me" and "can't reproduce". I, as the bug report responder, would never say the former as it souds kind of dismissive. Like in, "i reproduced your exact environment and it works for me". The latter admits that you're probably missing some context that makes the reproduction impossible.

But other people will feel differently about this.

Systemd catches up with bind events - works for me

Posted Nov 28, 2020 19:29 UTC (Sat) by Wol (subscriber, #4433) [Link]

How you say things can be so important ...

"Can't reproduce" implies you have tried to replicate the error, you've put in a bit of effort to help the person with the problem.

"Works for me", on the other hand, *could* mean the same thing. It could also mean "I don't suffer that problem, so I can't be bothered to look for it".

And then there's the language problem. I'm probably known for being a bit prickly about language and how, even when you may think you're speaking the "same" language, the identical word may mean different things based on the speaker's background.

Cheers,
Wol

Systemd catches up with bind events

Posted Nov 16, 2020 21:14 UTC (Mon) by pebolle (guest, #35204) [Link] (4 responses)

> There's a further issue at the level of our civilisation and society where 'works for me' gives people with power -- to fix bugs raised by users -- a habit of denying the lived experience of users and the struggles that users have with our software, which can become a life-long denial of the lived experience and struggle of other human people.

Poe's law works both ways: one is never sure whether someone is sarcastic or sincere on the internet.

Systemd catches up with bind events

Posted Nov 18, 2020 14:37 UTC (Wed) by k3ninho (subscriber, #50375) [Link] (3 responses)

>>> There's a further issue at the level of our civilisation and society where 'works for me' gives people with power -- to fix bugs raised by users -- a habit of denying the lived experience of users and the struggles that users have with our software, which can become a life-long denial of the lived experience and struggle of other human people.

>Poe's law works both ways: one is never sure whether someone is sarcastic or sincere on the internet.

You have to live your life, I can't make this statement a positive for you if you've taken it on bad faith. Plus, I hope that you can overcome whatever made it difficult to trust words from a random internet person. Maybe the world also needs to change to allow you this.

Life's going to miserable for everyone if we presume bad faith.

K3n.

Systemd catches up with bind events

Posted Nov 18, 2020 19:32 UTC (Wed) by pebolle (guest, #35204) [Link] (2 responses)

> Life's going to miserable for everyone if we presume bad faith.

The point here is that the statement I quoted is entirely over the top but it's still impossible to be sure whether it was made sincerely or not.

Look: developers that say 'works for me' are simply stating a mundane fact. If things didn't work for them they could start working on a fix. (If they have the time and the motivation to do that, of course.) But as long as it's unclear what triggers the bug that's been reported to them they are about as clueless as any random person using their software. I'd guess that all of this should be obvious to the kind of people reading lwn.net.

So if I read a comment containing little treasures like "a further issue at the level of our civilisation and society" and "[something] gives people with power [...] a habit of denying the lived experience of users and the struggles that users have with our software" and "life-long denial of the lived experience and struggle of other human people" (human people!) then, yes, Poe's law kicks in one again.

Systemd catches up with bind events

Posted Nov 20, 2020 15:45 UTC (Fri) by k3ninho (subscriber, #50375) [Link] (1 responses)

I've blocked your comments. I considered the consequences of my words in this forum and audience and I think what I said was a cogent, reasonable and meaningful contribution. With respect to Poe's original comment and the way that Wikipedia explains Poe's Law, you were already looking to parody my comment, a rhetorical tactic to dismiss as ridiculous something you can't dismiss as untrue.

Beyond that, I don't have to care how you respond.

K3n.

Systemd catches up with bind events

Posted Nov 20, 2020 23:09 UTC (Fri) by pebolle (guest, #35204) [Link]

> With respect to Poe's original comment and the way that Wikipedia explains Poe's Law, you were already looking to parody my comment, a rhetorical tactic to dismiss as ridiculous something you can't dismiss as untrue.

You might never read this but I only quoted your hyperbole verbatim. How is that parody?

Systemd catches up with bind events

Posted Nov 13, 2020 23:33 UTC (Fri) by walters (subscriber, #7396) [Link] (1 responses)

> ... fix up all udev-derivatives and give some time for things to settle.

Not sure if it was intentional, but this was funny.

Systemd catches up with bind events

Posted Nov 15, 2020 13:45 UTC (Sun) by Conan_Kudo (subscriber, #103240) [Link]

I certainly laughed when I saw that comment. 😂

Systemd catches up with bind events

Posted Nov 14, 2020 0:11 UTC (Sat) by sbaugh (guest, #103291) [Link] (6 responses)

>The presence of systemd developers at events like the Linux Plumbers Conference is limited at best

Why is this? Certainly lots of userspace stuff happens at LPC. And systemd has its own conference, All Systems Go, and it's very Linux-specific and focused on kernel features - one might also ask why there's not more kernel presence at ASG.

Is there some kind of dramatic reason that this isn't a single conference, or at least two conferences with roughly the same set of attendees? If there really is no overlap, that seems pretty strange.

Systemd catches up with bind events

Posted Nov 14, 2020 0:37 UTC (Sat) by Paf (subscriber, #91811) [Link]

Well, a few intertwined reasons come immediately to mind.

systemd is big, and big enough to justify a conference, but it doesn’t cover/touch all aspects of Linux plumbing, by any means. So there are things at LPC that would be weird to have at a systemd conference. Secondly, there are competitors/alternatives for many of the services provided by systemd components, and while perhaps some of those developers would attend a systemd conference... yeah.

More crossover sounds good. Just one conference, though...?

Systemd catches up with bind events

Posted Nov 14, 2020 14:18 UTC (Sat) by mezcalero (subscriber, #45103) [Link] (4 responses)

I used to be involved in LPC program committe and ran multiple MCs there for years. Thing though is: LPC back then was was different from now, in my eyes: back then lots of userspace people attended so the plumbing layer was very well covered, i.e. the actual interface where userspace and kernelspace touched was at the center of the conf. I got the impression though that over the years things shifted and focus is a lot more on kernelspace side of things with not much left from userspace. I.e. there are so many talks about memory management and scheduling and whatnot that are primarily things that (while userspace benefits from it) are mostly kernel internal, opaque stuff that userspace doesnt have to think much about, and thus are of almost no interest to userspace plumbing people, that made it less and less interesting for userspace people to attend. It does make a difference whether you travel to some (increasingly remote) place in the US from Europe where you find 60% of talks interesting in one way or another, or where it's just 5% you find interesting. So I stopped going. And then I came back one time a few years ago, just to give it another chance, but found it didnt get better, and i havent been since.

I am not complaining about this though. I honestly believe that talks about MM and scheduling are highly relevant and should be held, and there needs to be a conf for that -- but also that it might not be the most interesting place for me personally to be, and I think a number of other userspace plumbing people think similar. In particular as AllSystemsGo! exists these days, with a focus much much closer to what I am interested in: userspace plumbing stuff only. I have been one of the organizers of that conf, and I love it. Hence: no complaints from me, what LPC isnt for me anymore ASG now is, and hence I am happy.

If LPC wanted to be more attractive to userspace people again I think they'd have to cut down heavily on those kernel-internals-focussed talks so that userspace people dont come back feeling pushed to the side as much. I doubt though that doing so is that clearly desirable though, given that those MM/scheduling talks are after all heavily relevant to many people, just not to many userspace folks like me. I mean, LPC attracts so so many attendees, so it's doing a lot of stuff right apparently, even if it's not the same as it was initially.

So, no hard feelings, but I hope this does explain a bit why you don't see me at LPC. (And I think I am not the only one thinking that way)

Lennart

LPC

Posted Nov 14, 2020 16:32 UTC (Sat) by corbet (editor, #1) [Link] (2 responses)

Hmm... LPC 2019 was held in Lisbon — not a remote location in the US last I looked. Microconferences included BPF, distribution kernels, containers and checkpoint/restart, IoT, printing, toolchains, databases, Android, and system boot and security. That seems like there should be material to interest somebody who isn't looking for memory-management talks.

LPC 2020 was only as remote as your keyboard — also presumably not located in an obscure corner of North America. Microconferences included containers and checkpoint restart, Android, LLVM, testing and fuzzing, IoT, system boot and security, printing, application ecosystems, and the GNU toolchain.

Perhaps it's time to take another look? Or even help with LPC organization and drive it in the direction you would like to see?

LPC

Posted Nov 15, 2020 12:34 UTC (Sun) by mezcalero (subscriber, #45103) [Link] (1 responses)

I wanted to go to the Lisbon one, have another look, as an attendee, but there was some scheduling conflict with something else. And online confs are really not my thing, I must say, ill wait those out.

Last time I did a talk at LPC (in Santa Fe), I didnt have the impression too many people cared, the room was the opposite of crowded. Which is totally fine, but it did suggest that the lack of interest is actually mutual in a way.

I am sure I'll check out LPC again one day, no doubt. And others from the communities thats are involved in ASG have been attending LPC off and on over the years too. I just wanted to explain a bit the lack of enthusiasm from my person, and I think others from the same communities.

Lennart

LPC

Posted Nov 16, 2020 0:38 UTC (Mon) by nevets (subscriber, #11875) [Link]

I didn't like the direction LPC was heading and joined the committee to fix that. I happened to join *after* santa fe (as I had issues with that one). My biggest focus was to get away from being kernel centric and I believe I (with help from others on the committee) was successful.

That's why I asked you to come back and give it another try ;-)

-- Steve

Systemd catches up with bind events

Posted Nov 15, 2020 0:02 UTC (Sun) by josh (subscriber, #17465) [Link]

> If LPC wanted to be more attractive to userspace people again I think they'd have to cut down heavily on those kernel-internals-focussed talks so that userspace people dont come back feeling pushed to the side as much.

I think it makes sense to have kernel *tracks* at LPC, and to also have kernel/userspace interface tracks. LPC has some great kernel content, and that kernel content helps attract core kernel developers. It also has kernel/userspace interface content, which benefits from having kernel people around who might not have come solely for the kernel/userspace interface content. It sounds like the balance needs some tuning, but I don't think it makes sense to have 100% kernel/userspace interface content with no kernel internals at all, or you end up with a conference for which many kernel folks will encounter the same issue you're describing.

Systemd catches up with bind events

Posted Nov 14, 2020 1:03 UTC (Sat) by dxin (guest, #136611) [Link] (8 responses)

Probably it's a good idea to end every enum definition with a RESERVED so that users are forced to handle future extensions.

Systemd catches up with bind events

Posted Nov 14, 2020 1:40 UTC (Sat) by Paf (subscriber, #91811) [Link] (7 responses)

How does that force them to handle extensions? (Legitimate question, not criticism)

Systemd catches up with bind events

Posted Nov 14, 2020 8:32 UTC (Sat) by TheGopher (subscriber, #59256) [Link] (1 responses)

Agreed - RESERVED_FOR_FUTURE_EXPANSION is a clearer indicator - but I do understand OP's point - it's the difference between a closed set and a potentially open set.

Systemd catches up with bind events

Posted Nov 15, 2020 20:41 UTC (Sun) by NYKevin (subscriber, #129325) [Link]

Eh, I'm unconvinced. IPv4 did that with Class E addresses, back when classful networking was still a thing. Nowadays, classes are otherwise dead, we're running short of addresses, and Class E (now known as 240.0.0.0/4) is still "reserved for future expansion" - because everyone hard-coded it as invalid, sometimes even at the hardware level. So you can't use 240/4 without losing backwards compatibility. But if you're going to do a compatibility break anyway, you're far better off switching to IPv6 altogether, as it gives you far more addresses to play with and is generally less of a hassle to administer (for example, it lacks NAT). So those ~268 million IPv4 addresses will likely never be publicly routable.

Perhaps the RFCs of the time could have been written with greater care, but the trouble is that at the time (see RFC 988), they were in the process of designating class D (224.0.0.0/4) as multicast. They didn't know what class E would be used for, so they couldn't just say "treat class E as if it were unicast, unless a later standard says otherwise." For all they knew at the time, they would later want to use class E for some even weirder thing, and unicast processing would have been inappropriate or even harmful. So they just left it as "reserved," and the people who had to actually make the silicon and software decided that "reserved" meant "invalid." IMHO, they didn't really have much of a choice.

In short: From userspace's perspective, "reserved for future expansion" means "I don't know what this value represents, so if the kernel hands it to me, the only not-wrong thing I can possibly do is crash." In some contexts, ignoring the value *might* be not-wrong, but it's hard for userspace to predict that in advance. Regardless, the kernel cannot rely on userspace taking any particular interpretation, because as Linus has previously said, they don't break userspace, even where userspace is wrong.

Systemd catches up with bind events

Posted Nov 14, 2020 9:53 UTC (Sat) by ballombe (subscriber, #9523) [Link] (2 responses)

because in some case, gcc will say
"warning: enum RESERVED not handled in switch".

Systemd catches up with bind events

Posted Nov 14, 2020 10:57 UTC (Sat) by embe (subscriber, #46489) [Link] (1 responses)

And of course then you add default: abort(); to make the warning go away and everything is fine ;)

Systemd catches up with bind events

Posted Nov 15, 2020 11:15 UTC (Sun) by tinko92 (guest, #102129) [Link]

Well, at least that might cause a visible crash at a line of code that points to the cause of the issue rather than a, maybe initially invisible, corruption of the programs state.

Systemd catches up with bind events

Posted Nov 18, 2020 9:30 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

Be VERY evil and send nonsensical events with random names?

This is actually a thing in TLS. At least several TLS implementations send deliberately non-existing cipher suite names during negotiation to make sure that middleboxes don't encode stuff like "use the first cipher".

Systemd catches up with bind events

Posted Nov 19, 2020 14:49 UTC (Thu) by kpfleming (subscriber, #23250) [Link]

So much this. A kernel config option which causes random, undocumented, events to appear, both for existing devices and for 'nonsense' devices. Correctly formatted messages (e.g. not fuzzing), but with types that are never used for real devices.

Chaos engineering can be quite useful.

Systemd catches up with bind events

Posted Nov 14, 2020 6:23 UTC (Sat) by marcH (subscriber, #57642) [Link] (4 responses)

> Breaking systemd
> In July 2017, though, Dmitry Torokhov added two new event types called BIND and UNBIND.
> [...]
> Later that same month, a bug report landed in the KDE bug tracker; this was perhaps the first case where somebody noticed a problem related to the new events.

How can someone create new uevents in 2017 and not test them on a system running systemd?

I must have missed something...

Systemd catches up with bind events

Posted Nov 14, 2020 7:17 UTC (Sat) by geuder (subscriber, #62854) [Link]

Test on *a* system running systemd might not be enough.

Haven't looked what 10-dm-disk rule really does nor how many other rules (in other distros?) are fatally affected by the same problem. While developer machines are not unlikely to use some basic LVM I doubt they typically have very fancy disk setups. I had failing udev rules before and it did not show up until later in some use case, not immediately preventing boot or all usage of the system.

Systemd catches up with bind events

Posted Nov 14, 2020 14:24 UTC (Sat) by mezcalero (subscriber, #45103) [Link] (1 responses)

What this article isnt mentioning is that the bind/unbind uevents wasnt used that much initially by subsystems, but that changed over time. Just having these new event types is one thing, actually firing them another. One is an architectural change without immediate effect on userspace. The other is a subsystem-specific "trickle-down" thing that breaks things slowly and piece-meal, and that actually affects userspace.

Systemd catches up with bind events

Posted Nov 14, 2020 18:10 UTC (Sat) by marcH (subscriber, #57642) [Link]

> What this article isnt mentioning is that the bind/unbind uevents wasnt used that much initially by subsystems, but that changed over time

Indeed, this is exactly my question: how much were these new uevents used and tested before submission? Again I might have missed something but I find this sentence (and patch) amazing:

> As a short-term "fix", systemd was patched to simply ignore the new events

I thought I read somewhere that every new kernel feature must come with "real" use cases where "real" means running code. Yet the first thing the main and... surprised (!) consumer of this new feature did was... discarding it! How come this was not immediately reverted for obvious lack of testing?

Many kernel patches get months and even sometimes years of out-of-tree test coverage before making it to the main line, some even ship on millions of commercial products before getting merged, so why/how was this patch expedited? Just because it had a small number of lines? I can write a one-line kernel patch with very bad consequences any day :-)

> Perhaps the real lesson here is that the community would be better served by closer relations between the kernel project and projects managing low-level utilities like systemd.

Better relationships can only help but in this particular case it seems like the much more basic, focused and technical question: "What tested this and how?" would have been at least as effective.

Systemd catches up with bind events

Posted Nov 14, 2020 22:09 UTC (Sat) by bnorris (subscriber, #92090) [Link]

> How can someone create new uevents in 2017 and not test them on a system running systemd?

For one, I'm pretty sure the author's day job involves a distribution that does not run systemd (the init system). But since this article is really about systemd-udevd (which said distribution does use), I guess that's beside the point ;)

But in a similar vein: the set of udev rules running on a given distribution may vary wildly, so just because a rule ships on certain systems (e.g., the Fedora 32 example in the article) doesn't mean the distribution the author was developing has problematic rules of a similar type.

Your question sounds more like, "how can someone not test with the udev rules provided by libmtp [1]?" To me, it sounds like an honest oversight, and not a lack of legitimate use case or testing. But I could be wrong.

[1] https://bugs.kde.org/show_bug.cgi?id=387454#c20

udev "can create device nodes" - not really

Posted Nov 14, 2020 15:31 UTC (Sat) by zdzichu (subscriber, #17118) [Link]

The bit about udev creating device nodes is wrong. This ability was removed in commit 220893b3cbdbf8932f95c44811b169a8f0d33939, around systemd-176 from 2012.

Systemd catches up with bind events

Posted Nov 15, 2020 0:18 UTC (Sun) by flussence (guest, #85566) [Link] (4 responses)

It's a good thing we have eudev as an alternative willing to actually fix software bugs instead of sneering. Patches should be sent there.

Systemd catches up with bind events

Posted Nov 15, 2020 13:40 UTC (Sun) by shalem (subscriber, #4062) [Link] (1 responses)

This comment is really well below the average comment quality on LWN. Next time please actually read and understand the article.

Or in your own writing style:

It is a good thing that you actually read and understood the article before commenting.

Systemd catches up with bind events

Posted Nov 15, 2020 13:41 UTC (Sun) by shalem (subscriber, #4062) [Link]

p.s. In case it was not clear the last line of my previous comment was sarcasm.

Systemd catches up with bind events

Posted Nov 16, 2020 18:33 UTC (Mon) by mbiebl (subscriber, #41876) [Link] (1 responses)

In case this wasn't meant sarcasticly:

https://github.com/gentoo/eudev/commits/master looks like this project is pretty much dormant.
I see no signs that eudev intends to address this issue.

Systemd catches up with bind events

Posted Nov 19, 2020 23:40 UTC (Thu) by nix (subscriber, #2304) [Link]

It updates in bursts. Frankly I'm fairly happy this sort of breaks-boot-if-it-goes-wrong stuff is mature enough that it doesn't *need* constant updates...

Systemd catches up with bind events

Posted Nov 15, 2020 20:09 UTC (Sun) by jthill (subscriber, #56558) [Link]

If the add event had been for fully-ready devices, shouldn't BIND have been more delivered *first*, and possibly spelled "CONNECT", for devices that aren't fully ready? Or --oh, I guess the check for whether the device needs prep done in userspace, in response to the ADD? Still, seems to me there's now no "device ready" event, because you can't tell what an ADD means, and that just feels wrong.

Systemd catches up with bind events

Posted Nov 16, 2020 14:14 UTC (Mon) by Fowl (subscriber, #65667) [Link] (1 responses)

Couldn't some sort of "compatibility mode" be implemented for individual rules, where udev pretends the new events don't exist?

Systemd catches up with bind events

Posted Nov 17, 2020 20:43 UTC (Tue) by zuki (subscriber, #41808) [Link]

It's very hard to figure out which rules would need this legacy mode... I looked at various rules installed in Fedora when working on this and for many I couldn't say what the effect of the change will be.

If a file mentions BIND events, than it's pretty clear that it has been adapted for the new events. But for other files, it's hard to say anything without knowing if the kernel drivers for that type of hardware ever emit BIND|UNBIND events. If they don't, a rule that only seems to care about ADD|CHANGE|REMOVE and hasn't been modified in 10 years might still be fully adequate. In other cases the driver might emit BIND|UNBIND events, but the rule just doesn't need to do anything for them, and translating BIND to ADD would actively break things.

Overall, I don't think a mode like this would be extremely brittle.

Systemd catches up with bind events

Posted Nov 18, 2020 8:52 UTC (Wed) by zurdo (guest, #137849) [Link] (2 responses)

I suspect I'm grossly misunderstanding something, or missing an elementary computing class here.

Given n possibilities, wouldn't the correct expression have looked more like `ACTION==remove` in the first place? If that's what it meant back when that was written, surely it was an option to specify the ACTION you want to run on instead of every ACTION you don't want to act on?

Systemd catches up with bind events

Posted Nov 18, 2020 15:45 UTC (Wed) by cladisch (✭ supporter ✭, #50193) [Link]

When you want to apply the same filter to multiple rules, the simplest way to write this is the equivalent of "if ACTION != remove goto past_the_rules". Writing the filter non-inverted would require another goto to jump over the second goto.

Systemd catches up with bind events

Posted Nov 25, 2020 9:12 UTC (Wed) by AdamW (subscriber, #48457) [Link]

I'm not sure that *is* what was meant, to be honest - I'm not convinced the "fix" really is a fix.

What the condition is "trying to mean" is basically: "skip this whole script if we're not in some sort of scenario where a device mapper device has appeared or changed". "dm_end" is literally the end of the file: `GOTO="dm_end"` means "just don't do anything else at all".

This is how the comments look on my version of the file:

# Device created, major and minor number assigned - "add" event generated.
# Table loaded - no event generated.
# Device resumed (or renamed) - "change" event generated.
# Device removed - "remove" event generated.
#
# The dm-X nodes are always created, even on "add" event, we can't suppress
# that (the node is created even earlier with devtmpfs). All the symlinks
# (e.g. /dev/mapper) are created in right time after a device has its table
# loaded and is properly resumed. For this reason, direct use of dm-X nodes
# is not recommended.
ACTION!="add|change", GOTO="dm_end"

this makes it pretty clear that what we're really trying to do here is "do stuff if a device is being added or changed, don't do anything if it isn't". So I don't think changing the condition to `ACTION=="remove"` is necessarily a correct fix at all. After all, one thing that means is that we'll go ahead with the script if the action is "unbind", the counterpart to "bind". Is that what we want? Are we sure it isn't going to do anything wrong? I'm pretty sure the script doesn't expect it, though hopefully it'll wind up bailing on a later check and not do anything disruptive...

Systemd catches up with bind events

Posted Nov 23, 2020 21:11 UTC (Mon) by gswoods (subscriber, #37) [Link]

As one who had a 35 year career as a system administrator but who knows only a little about kernel and systemd internals, all of this looks a lot like US politics, where the Republicrats and Demopublicans are more interested in blaming the other side for all our problems than they are in solving them.

Systemd catches up with bind events

Posted Dec 2, 2020 16:29 UTC (Wed) by joey (guest, #328) [Link]

The horribleness of the udev rule file format surely has a lot to do with this. I mean, this is a config file that uses goto for control flow! Take a udev rules file and imagine it implemented in your programming language of choice, and consider how this problem might have been avoided or at least made much more tractable to fix across the code base.

udev in the wrong basket?

Posted Dec 4, 2020 13:43 UTC (Fri) by oldtomas (guest, #72579) [Link]

Watching this I get the impression that udev should be part of the kernel project.