Systemd catches up with bind events
is not [the] fault of systemd or udev, but caused by an incompatible kernel change that happened back in Linux 4.12". It seems like an appropriate time to look at what happened, how administrators need to respond, and whether anything can be done to avoid this kind of thing from happening again.
Modern computers tend to be highly dynamic, with devices (of both the physical and virtual variety) appearing and disappearing while the system is running. The kernel handles the low-level details with regard to these device events, but it is up to user space to take care of the rest. For that to happen, user space needs to know when something has changed with the system's configuration.
To that end, events are emitted to user space from deep within the kernel's driver-core subsystem whenever something changes; for example, plugging in a USB device will result in the creation of one or more ADD events to tell user space that the new device is available. The udev daemon is charged with responding to these events according to a set of rules; it can create device nodes, set permissions, notify other user-space components, and more, all in response to properties attached to events by matching rules. The set of possible events is relatively small and does not change often.
Breaking systemd
In July 2017, though, Dmitry Torokhov added two new event types called BIND and UNBIND. They are meant to allow user space to handle devices that need help before they can become fully functional — those that need a firmware load, for example. For drivers that support the new mechanism, a BIND event for a device will follow the ADD event once the device is ready to operate. This change was a part of the 4.14 kernel release in November 2017 (not 4.12 as stated in the systemd announcement).
Later that same month, a bug report landed in the KDE bug tracker; this was perhaps the first case where somebody noticed a problem related to the new events. That report only made it to the kernel lists at the end of 2018, though — over one year later. By then, 4.14 had been made into a long-term support kernel and shipped by distributors, with relatively few complaints from users. Indeed, Greg Kroah-Hartman was mystified as to why problems were turning up a year later. That turned out to be a change to systemd that caused it to propagate the new events.
Specifically, the problem would appear to originate in the way that udev (which is a part of the systemd project) attaches tags to events. These tags, which are set and used by udev rules, control how user space will set up the new device. There is an assumption built in that there will only be a single event to announce the existence of a new device, so attaching tags to that event is sufficient. When the second event (the BIND event) shows up, the device state is reset and those tags are forgotten, leading to the associated device not being set up properly.
As a short-term "fix", systemd was patched to simply ignore the new events. That caused things to work as they did before, at the cost of hiding those events entirely. That was never a long-term solution; the new events were added for a reason and some devices need them for proper setup. So a better solution had to be found for the longer term; that solution has two aspects, one of which may be disruptive for users who have created their own udev rules.
Fixing systemd
The first piece is a reworking of the "tag" mechanism provided by udev. Tags are special properties that can be attached, then matched in subsequent rules or consumed by user space. Rather than attaching tags to events, as has been done until now, udev attaches them to devices, so tags added in response to an ADD event will still be there for the BIND event as well. For cases where rules need to respond only to tags added to the current event, a new CURRENT_TAGS property lists only those tags; it thus holds the value that the TAGS property held in previous releases.
The other part, though, is a change that must be applied to a number of udev rule sets. Consider, for example, this snippet taken from a randomly chosen rules file (10-dm-disk.rules in particular) on a Fedora 32 system:
# "add" event is processed on coldplug only! ACTION!="add|change", GOTO="dm_end"
The ACTION line causes the entire file to be skipped for anything other than ADD or CHANGE events; in particular, that is what will happen with BIND events. That will cause properties associated with those events to be lost — and the device in question to be set up improperly (if at all). The fix is to change that line to read:
ACTION=="remove", GOTO="dm_end"
That causes the rules to be skipped (and their associated state forgotten) only when the device is removed from the system.
The problem here is that these rules were written under the assumption that no new event types would be added, so anything that wasn't recognized as adding or modifying a device could be ignored. There is, evidently, a certain amount of code that runs in response to device events that has a similar problem. What this shows is, in effect, a sort of protocol ossification effect that has made it much harder to add event types to the API provided by the kernel. Indeed, in 2018, Torokhov remarked:
At the time, there was discussion of possibly reverting the change, causing
the new events to disappear. But that approach had the potential to create
regressions of its own, as some systems may well have depended on getting
those events; the kernel release adding them was a year old by that point,
after all. There was also discussion of adding some sort of knob to enable
or disable the creation of BIND and UNBIND events, but
that never came to pass. Instead, Torokhov described the
work in the systemd project to make the changes described above, and
Kroah-Hartman responded:
"So all should be good
".
A regression?
With luck, all will be good, but it has come at the cost of some work within the systemd community over the last two years; the systemd developers have made their displeasure known:
Was this a violation of the kernel's "no regressions" rule? The answer must almost certainly be "yes"; code that worked with 4.13 no longer worked with 4.14. What should have been done about it is a bit less clear. Had the issue been reported to the kernel community more quickly, it might have been possible to revert and redesign the change; after it had been deployed for a year, though, that was not a simple option. One could argue that the kernel community should have found some other way to fix the regression; the systemd 247-rc2 announcement tries to make that case. But once Torokhov posted that the problem was being addressed on the systemd side, there was no longer any pressure to do that.
Perhaps the real lesson here is that the community would be better served
by closer relations between the kernel project and projects managing
low-level utilities like systemd. Those relations have been somewhat
strained at times, and there are not a lot of places where cooperative,
cross-project discussions can take place. The presence of systemd
developers at events like the Linux Plumbers Conference is limited at best,
and those developers — not without reason — do not find the kernel mailing
lists to be an entirely welcoming place. We are all working on the same
system, though, and we would probably have an easier time of it if we could
talk things through a bit more.
Index entries for this article | |
---|---|
Kernel | Development model/User-space ABI |
Kernel | udev |
Posted Nov 13, 2020 21:10 UTC (Fri)
by MatejLach (guest, #84942)
[Link] (13 responses)
I do wish for the relations between the most deployed init systm and the kernel to improve.
Posted Nov 13, 2020 21:15 UTC (Fri)
by willy (subscriber, #9762)
[Link] (12 responses)
Posted Nov 13, 2020 22:20 UTC (Fri)
by MatejLach (guest, #84942)
[Link] (5 responses)
Which is precisely the point, cooperation rather than pointing the fingers would help here.
systemd has proven itself useful enough where it should be consulted by kernel developers and vice versa.
Posted Nov 14, 2020 0:49 UTC (Sat)
by gerdesj (subscriber, #5446)
[Link] (4 responses)
The kernel by definition is used on all Linux boxes and systemd is probably by now the most widely used init system at least on systems that the sysadmin/user actually cares what is happening.
systemd has become the de-facto Linux init system or PID1 or whatever the hell you want to call it. I still recall coming across M van S's comments in init scripts for the first time rather a long time ago and suddenly feeling that a real person actually cared about me and my little system. I "got" open source about then - I kept on finding notes in man pages and readme files and so on that indicated I was dealing with people who give a shit. Every now and then I still find something to make me smile in a readme or a help menu. I don't get that feeling when I'm fiddling with Windows or Macs. Linux is properly corporate these days and quite rightly so - we've grown up but it is still nice to see a human touch sometimes.
Please remember why we do this stuff.
Posted Nov 21, 2020 6:59 UTC (Sat)
by ras (subscriber, #33059)
[Link] (1 responses)
Only for the desktop distro's. It's too heavyweight for the smaller ones like Apline, OpenWRT or Android, containers like Docker don't use an init system at all. And that could well cover the bulk of deployed Linux instances.
Posted Nov 21, 2020 12:32 UTC (Sat)
by rahulsundaram (subscriber, #21946)
[Link]
The vast majority of Linux distros including RHEL, SLES, Debian etc (not just the desktop ones) use systemd by default. Docker containers are not comparable to distros but some of them do run a init system and it is popular enough that several distros include a systemd-container package specifically for this purpose and systemd is not limited to a init system, so other parts gets routinely used in containers as well.
Posted Nov 23, 2020 13:29 UTC (Mon)
by flussence (guest, #85566)
[Link] (1 responses)
It's hard to tell corporate types to remember FOSS had a human element when they came from outside that culture entirely, their salary depends on gentrifying it out of existence, and in their off-time their hobby is talking over everyone to proclaim “Well Actually everyone uses our software because our software is great because everyone uses it”.
Most of the interesting people seem to be using BSD these days.
Posted Nov 23, 2020 16:18 UTC (Mon)
by anselm (subscriber, #2796)
[Link]
It used to be that using Linux was the way to stand out from the crowd, to be nerdy and interesting and metaphorically show the finger to those stodgy Windows and Mac users.
Now Linux has been mainstream for a while and is no longer good for nerd cred. People's elderly relations can (and do) use it. This means that the people who were using Linux 20+ years ago, when it meant not being able to do certain things (that when pointed out, one would adamantly insist weren't worth doing, anyway), spending three days to get a new video card/monitor working, etc., are being forced into BSD if they still want to impress their peers. But that's not because of BSD's versatility, wide compatibility with popular hardware and peripherals, and technical excellence – it's because few other people want to use it. It's the IT equivalent of an Indian fakir's bed of nails; very comfortable and just the thing if you're a fakir, but an item of morbid fascination for others.
Posted Nov 14, 2020 0:03 UTC (Sat)
by pbonzini (subscriber, #60935)
[Link] (2 responses)
Come on, even Linus said that it was perfectly fine for systemd to use the command line that way[1] and, after having laced some emails with remarks about Kay, later admitted that it was just a bug and people were overreacting[2].
[1] http://lkml.iu.edu/hypermail/linux/kernel/1404.0/01488.html
[2] http://lkml.iu.edu/hypermail/linux/kernel/1404.0/02712.html
Posted Nov 16, 2020 0:20 UTC (Mon)
by nevets (subscriber, #11875)
[Link] (1 responses)
This would not have escalated the way it did if we were told from the beginning, "oh there's a bug in systemd that causes it to spam the buffer, please upgrade to a fixed version". But instead told to bugger off. Yes, it really is a lack of communication and good faith between the two communities and I hope we can work better in the future.
Posted Nov 16, 2020 7:55 UTC (Mon)
by pbonzini (subscriber, #60935)
[Link]
Technically the upstream people couldn't have known, since the bug was introduced by an incorrect distro backport. And if a buggy systemd, one that spews assertion failures all the time, will slow boot down to a crawl, the systemd people might even consider that to be a feature. It can and will happen for kernel WARNs as well, and a buggy PID 1 is not much better than a buggy kernel. But these are details, and in general I think we agree.
What this shows to me, is that Linux is sorely lacking postmortems. Whenever Linus screams at me, I try to figure out what went wrong in my workflow and how I can improve it to avoid being screamed at in the future. On the other hand, if 5 years later people still believe that "debug" is a sacred part of the kernel command line (and not the more nuanced explanation that you gave), something went wrong on the kernel side in figuring out what happened.
Posted Nov 14, 2020 0:39 UTC (Sat)
by foom (subscriber, #14868)
[Link] (2 responses)
As a user, I'm sure I don't really care which part of the system boot is implemented by code in the kernel and which is implemented in systemd/udev/etc. I just want them to work together to boot the system properly. I mean, it makes sense to me that if I want to debug an issue, that everyone would respond to the one debug flag...
And same for "This incompatibility is all their fault!" -- again...who cares? It's nonsense.
Posted Nov 14, 2020 13:30 UTC (Sat)
by willy (subscriber, #9762)
[Link] (1 responses)
"divert to or use in a role different from the usual or original one"
What word would you use to describe using something for your own purposes that somebody else was already using? It's not like I said "stealing".
Posted Nov 14, 2020 16:26 UTC (Sat)
by pbonzini (subscriber, #60935)
[Link]
Posted Nov 13, 2020 21:24 UTC (Fri)
by jkingweb (subscriber, #113039)
[Link] (5 responses)
Posted Nov 13, 2020 22:16 UTC (Fri)
by rahulsundaram (subscriber, #21946)
[Link] (4 responses)
Posted Nov 14, 2020 2:23 UTC (Sat)
by koh (subscriber, #101482)
[Link] (3 responses)
Posted Nov 14, 2020 3:12 UTC (Sat)
by khim (subscriber, #9252)
[Link]
Absolutely. LWN even have article which explains how and why that case should be handled. But there is also the rule if nobody notices, it's not broken.
Now… we have very weird corner-case: somebody have noticed… year after the change was made. That's… rather unusual, to say the least.
Posted Nov 14, 2020 8:42 UTC (Sat)
by abo (subscriber, #77288)
[Link] (1 responses)
Perhaps it is reasonable to consider systemd exempted from the kernel's ABI/API stability promise, because it is sometimes almost the only user of certain interfaces?
Posted Nov 15, 2020 15:19 UTC (Sun)
by pbonzini (subscriber, #60935)
[Link]
> Perhaps it is reasonable to consider systemd exempted from the kernel's ABI/API stability promise, because it is sometimes almost the only user of certain interfaces?
That's complicated. With more and more people using containers—including OS containers running a full-blown init system—it's not that rare to see very new userspace on old kernels or vice versa. This also means that it will be much harder to remove features in distro kernels: for example, even if your distro ships with an nftables-based iptables(8), there could be containers using the older iptables API.
Posted Nov 13, 2020 22:42 UTC (Fri)
by GhePeU (subscriber, #56133)
[Link] (4 responses)
The only news in this story is that, maybe for the first time, the systemd people are not the upstream project, and I think there’s a German word for what I’m feeling right now :)
Posted Nov 13, 2020 23:33 UTC (Fri)
by ubhofmann (subscriber, #47368)
[Link] (1 responses)
Posted Nov 14, 2020 7:03 UTC (Sat)
by jonas.bonn (subscriber, #47561)
[Link]
Posted Nov 15, 2020 15:11 UTC (Sun)
by Wol (subscriber, #4433)
[Link] (1 responses)
Well, I think Lennart is well used to being downstream, and he likes to rely on upstream doing what they claim.
This seems a classic case of upstream not sticking to its promises, which is the whole problem with the unixy philosophy of being liberal with what you accept, and strict in what you emit. systemd (and pulseaudio, etc etc) is strict in expecting upstream to do what they promised.
Cheers,
Posted Nov 15, 2020 18:44 UTC (Sun)
by NYKevin (subscriber, #129325)
[Link]
That's not Unix, that's Postel's Law, which IIRC originates from TCP/IP (where it is *also* an unholy mess, but of a different kind).
Posted Nov 13, 2020 23:01 UTC (Fri)
by syrjala (subscriber, #47399)
[Link] (19 responses)
Posted Nov 14, 2020 16:43 UTC (Sat)
by IanKelling (subscriber, #89418)
[Link] (18 responses)
Posted Nov 15, 2020 13:55 UTC (Sun)
by syrjala (subscriber, #47399)
[Link] (17 responses)
Posted Nov 15, 2020 18:47 UTC (Sun)
by NYKevin (subscriber, #129325)
[Link] (16 responses)
If you don't tell them about the hardware that reproduces the bug, they cannot reproduce it. If they are unable to reproduce a bug, how are they supposed to evaluate a patch for that bug?
Posted Nov 16, 2020 10:56 UTC (Mon)
by k3ninho (subscriber, #50375)
[Link] (15 responses)
This is dull, unhelpful pushback -- what would be better if not actually helpful is to call no-op ACTION=="remove", GOTO='end_stanza' an antipattern making you think that add/change/remove are the only legitimate udev action types and core to this 4.12-changed-userspace issue.
(There's a further issue at the level of our civilisation and society where 'works for me' gives people with power -- to fix bugs raised by users -- a habit of denying the lived experience of users and the struggles that users have with our software, which can become a life-long denial of the lived experience and struggle of other human people. I get that, in software, unanticipated complexity means that fixes have to not also break other things and that makes an apparently-simple change expensive and unpredictable, easier to push back and not make changes. Here's the question from this rhetoric: Are we not the wizards and masters of these systems that we should be able to change them to work more correctly for more people?)
K3n.
Posted Nov 16, 2020 12:15 UTC (Mon)
by magfr (subscriber, #16052)
[Link]
The example in the article was matching
For this poster the problem sounds like the reverse, he needs to match ~add|change and someone have "optimzed" that to remove.
This proves that one need to know what one is doing and that crap can be written in any language, in this case the udev config rule language.
One way to fix this is to document an ERROR event.
Posted Nov 16, 2020 12:21 UTC (Mon)
by hkario (subscriber, #94864)
[Link] (8 responses)
you get a bug report, you look at the experienced behaviour, you haven't encountered it before; you try it with your hardware, it's not reproducible; you look at code that *may* be related, it doesn't seem possible to trigger this kind of behaviour
now, what on earth can you do more than to ask for more information?
Developers aren't omniscient and omnipotent entities that exist beyond confines of space and time, entities that fix bugs based only on a fickle. They're human, and they need to understand the bug before they can fix it.
Posted Nov 16, 2020 17:59 UTC (Mon)
by jezuch (subscriber, #52988)
[Link] (7 responses)
Posted Nov 16, 2020 18:30 UTC (Mon)
by rahulsundaram (subscriber, #21946)
[Link] (6 responses)
That wasn't what was said however. There was a question back on what makes the hardware different which seems to have gone unanswered. Given the wide variations in hardware, this is a reasonable question.
Posted Nov 20, 2020 15:16 UTC (Fri)
by k3ninho (subscriber, #50375)
[Link] (5 responses)
>That wasn't what was said however.
It wasn't *exactly* what was said but it was the spirit of what was said.
K3n.
Posted Nov 20, 2020 15:45 UTC (Fri)
by rahulsundaram (subscriber, #21946)
[Link] (4 responses)
I don't agree but even assuming that, works for me is a fine thing to say if you don't stop at that point. There was a query for more information. It's up to the reporter to pursue that further
Posted Nov 20, 2020 20:08 UTC (Fri)
by mathstuf (subscriber, #69389)
[Link] (3 responses)
It's about communication. I certainly have more to learn on this front, but part of it is realizing the differences in knowledge and expectations on either side of the wire.
Posted Nov 21, 2020 19:14 UTC (Sat)
by giraffedata (guest, #1954)
[Link] (2 responses)
"Works for me" is a request for more information or diagnostic work.
But I've also been the recipient of the response, "What you're doing is too unusual for me to care about. Do what I do, and it will work." Many times. I'm creative. I suppose someone might characterize that as "works for me."
Posted Nov 28, 2020 9:21 UTC (Sat)
by jezuch (subscriber, #52988)
[Link] (1 responses)
But other people will feel differently about this.
Posted Nov 28, 2020 19:29 UTC (Sat)
by Wol (subscriber, #4433)
[Link]
"Can't reproduce" implies you have tried to replicate the error, you've put in a bit of effort to help the person with the problem.
"Works for me", on the other hand, *could* mean the same thing. It could also mean "I don't suffer that problem, so I can't be bothered to look for it".
And then there's the language problem. I'm probably known for being a bit prickly about language and how, even when you may think you're speaking the "same" language, the identical word may mean different things based on the speaker's background.
Cheers,
Posted Nov 16, 2020 21:14 UTC (Mon)
by pebolle (guest, #35204)
[Link] (4 responses)
Poe's law works both ways: one is never sure whether someone is sarcastic or sincere on the internet.
Posted Nov 18, 2020 14:37 UTC (Wed)
by k3ninho (subscriber, #50375)
[Link] (3 responses)
>Poe's law works both ways: one is never sure whether someone is sarcastic or sincere on the internet.
You have to live your life, I can't make this statement a positive for you if you've taken it on bad faith. Plus, I hope that you can overcome whatever made it difficult to trust words from a random internet person. Maybe the world also needs to change to allow you this.
Life's going to miserable for everyone if we presume bad faith.
K3n.
Posted Nov 18, 2020 19:32 UTC (Wed)
by pebolle (guest, #35204)
[Link] (2 responses)
The point here is that the statement I quoted is entirely over the top but it's still impossible to be sure whether it was made sincerely or not.
Look: developers that say 'works for me' are simply stating a mundane fact. If things didn't work for them they could start working on a fix. (If they have the time and the motivation to do that, of course.) But as long as it's unclear what triggers the bug that's been reported to them they are about as clueless as any random person using their software. I'd guess that all of this should be obvious to the kind of people reading lwn.net.
So if I read a comment containing little treasures like "a further issue at the level of our civilisation and society" and "[something] gives people with power [...] a habit of denying the lived experience of users and the struggles that users have with our software" and "life-long denial of the lived experience and struggle of other human people" (human people!) then, yes, Poe's law kicks in one again.
Posted Nov 20, 2020 15:45 UTC (Fri)
by k3ninho (subscriber, #50375)
[Link] (1 responses)
Beyond that, I don't have to care how you respond.
K3n.
Posted Nov 20, 2020 23:09 UTC (Fri)
by pebolle (guest, #35204)
[Link]
You might never read this but I only quoted your hyperbole verbatim. How is that parody?
Posted Nov 13, 2020 23:33 UTC (Fri)
by walters (subscriber, #7396)
[Link] (1 responses)
Not sure if it was intentional, but this was funny.
Posted Nov 15, 2020 13:45 UTC (Sun)
by Conan_Kudo (subscriber, #103240)
[Link]
Posted Nov 14, 2020 0:11 UTC (Sat)
by sbaugh (guest, #103291)
[Link] (6 responses)
Why is this? Certainly lots of userspace stuff happens at LPC. And systemd has its own conference, All Systems Go, and it's very Linux-specific and focused on kernel features - one might also ask why there's not more kernel presence at ASG.
Is there some kind of dramatic reason that this isn't a single conference, or at least two conferences with roughly the same set of attendees? If there really is no overlap, that seems pretty strange.
Posted Nov 14, 2020 0:37 UTC (Sat)
by Paf (subscriber, #91811)
[Link]
systemd is big, and big enough to justify a conference, but it doesn’t cover/touch all aspects of Linux plumbing, by any means. So there are things at LPC that would be weird to have at a systemd conference. Secondly, there are competitors/alternatives for many of the services provided by systemd components, and while perhaps some of those developers would attend a systemd conference... yeah.
More crossover sounds good. Just one conference, though...?
Posted Nov 14, 2020 14:18 UTC (Sat)
by mezcalero (subscriber, #45103)
[Link] (4 responses)
I am not complaining about this though. I honestly believe that talks about MM and scheduling are highly relevant and should be held, and there needs to be a conf for that -- but also that it might not be the most interesting place for me personally to be, and I think a number of other userspace plumbing people think similar. In particular as AllSystemsGo! exists these days, with a focus much much closer to what I am interested in: userspace plumbing stuff only. I have been one of the organizers of that conf, and I love it. Hence: no complaints from me, what LPC isnt for me anymore ASG now is, and hence I am happy.
If LPC wanted to be more attractive to userspace people again I think they'd have to cut down heavily on those kernel-internals-focussed talks so that userspace people dont come back feeling pushed to the side as much. I doubt though that doing so is that clearly desirable though, given that those MM/scheduling talks are after all heavily relevant to many people, just not to many userspace folks like me. I mean, LPC attracts so so many attendees, so it's doing a lot of stuff right apparently, even if it's not the same as it was initially.
So, no hard feelings, but I hope this does explain a bit why you don't see me at LPC. (And I think I am not the only one thinking that way)
Lennart
Posted Nov 14, 2020 16:32 UTC (Sat)
by corbet (editor, #1)
[Link] (2 responses)
LPC 2020 was only as remote as your keyboard — also presumably not located in an obscure corner of North America. Microconferences included containers and checkpoint restart, Android, LLVM, testing and fuzzing, IoT, system boot and security, printing, application ecosystems, and the GNU toolchain.
Perhaps it's time to take another look? Or even help with LPC organization and drive it in the direction you would like to see?
Posted Nov 15, 2020 12:34 UTC (Sun)
by mezcalero (subscriber, #45103)
[Link] (1 responses)
Last time I did a talk at LPC (in Santa Fe), I didnt have the impression too many people cared, the room was the opposite of crowded. Which is totally fine, but it did suggest that the lack of interest is actually mutual in a way.
I am sure I'll check out LPC again one day, no doubt. And others from the communities thats are involved in ASG have been attending LPC off and on over the years too. I just wanted to explain a bit the lack of enthusiasm from my person, and I think others from the same communities.
Lennart
Posted Nov 16, 2020 0:38 UTC (Mon)
by nevets (subscriber, #11875)
[Link]
That's why I asked you to come back and give it another try ;-)
-- Steve
Posted Nov 15, 2020 0:02 UTC (Sun)
by josh (subscriber, #17465)
[Link]
I think it makes sense to have kernel *tracks* at LPC, and to also have kernel/userspace interface tracks. LPC has some great kernel content, and that kernel content helps attract core kernel developers. It also has kernel/userspace interface content, which benefits from having kernel people around who might not have come solely for the kernel/userspace interface content. It sounds like the balance needs some tuning, but I don't think it makes sense to have 100% kernel/userspace interface content with no kernel internals at all, or you end up with a conference for which many kernel folks will encounter the same issue you're describing.
Posted Nov 14, 2020 1:03 UTC (Sat)
by dxin (guest, #136611)
[Link] (8 responses)
Posted Nov 14, 2020 1:40 UTC (Sat)
by Paf (subscriber, #91811)
[Link] (7 responses)
Posted Nov 14, 2020 8:32 UTC (Sat)
by TheGopher (subscriber, #59256)
[Link] (1 responses)
Posted Nov 15, 2020 20:41 UTC (Sun)
by NYKevin (subscriber, #129325)
[Link]
Perhaps the RFCs of the time could have been written with greater care, but the trouble is that at the time (see RFC 988), they were in the process of designating class D (224.0.0.0/4) as multicast. They didn't know what class E would be used for, so they couldn't just say "treat class E as if it were unicast, unless a later standard says otherwise." For all they knew at the time, they would later want to use class E for some even weirder thing, and unicast processing would have been inappropriate or even harmful. So they just left it as "reserved," and the people who had to actually make the silicon and software decided that "reserved" meant "invalid." IMHO, they didn't really have much of a choice.
In short: From userspace's perspective, "reserved for future expansion" means "I don't know what this value represents, so if the kernel hands it to me, the only not-wrong thing I can possibly do is crash." In some contexts, ignoring the value *might* be not-wrong, but it's hard for userspace to predict that in advance. Regardless, the kernel cannot rely on userspace taking any particular interpretation, because as Linus has previously said, they don't break userspace, even where userspace is wrong.
Posted Nov 14, 2020 9:53 UTC (Sat)
by ballombe (subscriber, #9523)
[Link] (2 responses)
Posted Nov 14, 2020 10:57 UTC (Sat)
by embe (subscriber, #46489)
[Link] (1 responses)
Posted Nov 15, 2020 11:15 UTC (Sun)
by tinko92 (guest, #102129)
[Link]
Posted Nov 18, 2020 9:30 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
This is actually a thing in TLS. At least several TLS implementations send deliberately non-existing cipher suite names during negotiation to make sure that middleboxes don't encode stuff like "use the first cipher".
Posted Nov 19, 2020 14:49 UTC (Thu)
by kpfleming (subscriber, #23250)
[Link]
Chaos engineering can be quite useful.
Posted Nov 14, 2020 6:23 UTC (Sat)
by marcH (subscriber, #57642)
[Link] (4 responses)
How can someone create new uevents in 2017 and not test them on a system running systemd?
I must have missed something...
Posted Nov 14, 2020 7:17 UTC (Sat)
by geuder (subscriber, #62854)
[Link]
Haven't looked what 10-dm-disk rule really does nor how many other rules (in other distros?) are fatally affected by the same problem. While developer machines are not unlikely to use some basic LVM I doubt they typically have very fancy disk setups. I had failing udev rules before and it did not show up until later in some use case, not immediately preventing boot or all usage of the system.
Posted Nov 14, 2020 14:24 UTC (Sat)
by mezcalero (subscriber, #45103)
[Link] (1 responses)
Posted Nov 14, 2020 18:10 UTC (Sat)
by marcH (subscriber, #57642)
[Link]
Indeed, this is exactly my question: how much were these new uevents used and tested before submission? Again I might have missed something but I find this sentence (and patch) amazing:
> As a short-term "fix", systemd was patched to simply ignore the new events
I thought I read somewhere that every new kernel feature must come with "real" use cases where "real" means running code. Yet the first thing the main and... surprised (!) consumer of this new feature did was... discarding it! How come this was not immediately reverted for obvious lack of testing?
Many kernel patches get months and even sometimes years of out-of-tree test coverage before making it to the main line, some even ship on millions of commercial products before getting merged, so why/how was this patch expedited? Just because it had a small number of lines? I can write a one-line kernel patch with very bad consequences any day :-)
> Perhaps the real lesson here is that the community would be better served by closer relations between the kernel project and projects managing low-level utilities like systemd.
Better relationships can only help but in this particular case it seems like the much more basic, focused and technical question: "What tested this and how?" would have been at least as effective.
Posted Nov 14, 2020 22:09 UTC (Sat)
by bnorris (subscriber, #92090)
[Link]
For one, I'm pretty sure the author's day job involves a distribution that does not run systemd (the init system). But since this article is really about systemd-udevd (which said distribution does use), I guess that's beside the point ;)
But in a similar vein: the set of udev rules running on a given distribution may vary wildly, so just because a rule ships on certain systems (e.g., the Fedora 32 example in the article) doesn't mean the distribution the author was developing has problematic rules of a similar type.
Your question sounds more like, "how can someone not test with the udev rules provided by libmtp [1]?" To me, it sounds like an honest oversight, and not a lack of legitimate use case or testing. But I could be wrong.
Posted Nov 14, 2020 15:31 UTC (Sat)
by zdzichu (subscriber, #17118)
[Link]
Posted Nov 15, 2020 0:18 UTC (Sun)
by flussence (guest, #85566)
[Link] (4 responses)
Posted Nov 15, 2020 13:40 UTC (Sun)
by shalem (subscriber, #4062)
[Link] (1 responses)
Or in your own writing style:
It is a good thing that you actually read and understood the article before commenting.
Posted Nov 15, 2020 13:41 UTC (Sun)
by shalem (subscriber, #4062)
[Link]
Posted Nov 16, 2020 18:33 UTC (Mon)
by mbiebl (subscriber, #41876)
[Link] (1 responses)
https://github.com/gentoo/eudev/commits/master looks like this project is pretty much dormant.
Posted Nov 19, 2020 23:40 UTC (Thu)
by nix (subscriber, #2304)
[Link]
Posted Nov 15, 2020 20:09 UTC (Sun)
by jthill (subscriber, #56558)
[Link]
Posted Nov 16, 2020 14:14 UTC (Mon)
by Fowl (subscriber, #65667)
[Link] (1 responses)
Posted Nov 17, 2020 20:43 UTC (Tue)
by zuki (subscriber, #41808)
[Link]
If a file mentions BIND events, than it's pretty clear that it has been adapted for the new events. But for other files, it's hard to say anything without knowing if the kernel drivers for that type of hardware ever emit BIND|UNBIND events. If they don't, a rule that only seems to care about ADD|CHANGE|REMOVE and hasn't been modified in 10 years might still be fully adequate. In other cases the driver might emit BIND|UNBIND events, but the rule just doesn't need to do anything for them, and translating BIND to ADD would actively break things.
Overall, I don't think a mode like this would be extremely brittle.
Posted Nov 18, 2020 8:52 UTC (Wed)
by zurdo (guest, #137849)
[Link] (2 responses)
Given n possibilities, wouldn't the correct expression have looked more like `ACTION==remove` in the first place? If that's what it meant back when that was written, surely it was an option to specify the ACTION you want to run on instead of every ACTION you don't want to act on?
Posted Nov 18, 2020 15:45 UTC (Wed)
by cladisch (✭ supporter ✭, #50193)
[Link]
Posted Nov 25, 2020 9:12 UTC (Wed)
by AdamW (subscriber, #48457)
[Link]
What the condition is "trying to mean" is basically: "skip this whole script if we're not in some sort of scenario where a device mapper device has appeared or changed". "dm_end" is literally the end of the file: `GOTO="dm_end"` means "just don't do anything else at all".
This is how the comments look on my version of the file:
# Device created, major and minor number assigned - "add" event generated.
this makes it pretty clear that what we're really trying to do here is "do stuff if a device is being added or changed, don't do anything if it isn't". So I don't think changing the condition to `ACTION=="remove"` is necessarily a correct fix at all. After all, one thing that means is that we'll go ahead with the script if the action is "unbind", the counterpart to "bind". Is that what we want? Are we sure it isn't going to do anything wrong? I'm pretty sure the script doesn't expect it, though hopefully it'll wind up bailing on a later check and not do anything disruptive...
Posted Nov 23, 2020 21:11 UTC (Mon)
by gswoods (subscriber, #37)
[Link]
Posted Dec 2, 2020 16:29 UTC (Wed)
by joey (guest, #328)
[Link]
Posted Dec 4, 2020 13:43 UTC (Fri)
by oldtomas (guest, #72579)
[Link]
Systemd catches up with bind events
Systemd catches up with bind events
Systemd catches up with bind events
Systemd catches up with bind events
Systemd catches up with bind events
Systemd catches up with bind events
Systemd catches up with bind events
Systemd catches up with bind events
Most of the interesting people seem to be using BSD these days.
Systemd catches up with bind events
Systemd catches up with bind events
Systemd catches up with bind events
Systemd catches up with bind events
Systemd catches up with bind events
Systemd catches up with bind events
Systemd catches up with bind events
Systemd catches up with bind events
Systemd catches up with bind events
> If someone was to create a userspace program relying on a particular syscall, flag, whatever, not being implemented - until it finally is - would that be a regression?
Systemd catches up with bind events
Systemd catches up with bind events
The fix in that case is a lot simpler, and backporting it to various distro systemd versions isn't a big deal, but it's still a regression.
Systemd catches up with bind events
Systemd catches up with bind events
Systemd catches up with bind events
Systemd catches up with bind events
Systemd catches up with bind events
Wol
Systemd catches up with bind events
Maybe someone will finally merge my bluez fix for this same regression: https://lkml.org/lkml/2018/12/4/1167
Systemd catches up with bind events
Systemd catches up with bind events
Systemd catches up with bind events
Systemd catches up with bind events
Systemd catches up with bind events
Many years of past experience have given this a name: 'works for me'. Developer doesn't experience the problem and can't conceive of the imagined version of the code in their head not running as they think it will.
Systemd catches up with bind events
~add|change when what is needed is remove.
Any rule that mentions an ERROR event is broken.
Any action that happens when ERROR is issued is a bug.
Systemd catches up with bind events
Systemd catches up with bind events
Systemd catches up with bind events
Systemd catches up with bind events
Systemd catches up with bind events
Systemd catches up with bind events
This thread is either about someone's serious misinterpretation of the "works for me" response as, "this is your problem; go away" or a misnaming of that actual response.
Systemd catches up with bind events - works for me
Systemd catches up with bind events - works for me
Systemd catches up with bind events - works for me
Wol
Systemd catches up with bind events
Systemd catches up with bind events
Systemd catches up with bind events
Systemd catches up with bind events
Systemd catches up with bind events
Systemd catches up with bind events
I certainly laughed when I saw that comment. 😂
Systemd catches up with bind events
Systemd catches up with bind events
Systemd catches up with bind events
Systemd catches up with bind events
Hmm... LPC 2019 was held in Lisbon — not a remote location in the US last I looked. Microconferences included BPF, distribution kernels, containers and checkpoint/restart, IoT, printing, toolchains, databases, Android, and system boot and security. That seems like there should be material to interest somebody who isn't looking for memory-management talks.
LPC
LPC
LPC
Systemd catches up with bind events
Systemd catches up with bind events
Systemd catches up with bind events
Systemd catches up with bind events
Systemd catches up with bind events
Systemd catches up with bind events
"warning: enum RESERVED not handled in switch".
And of course then you add Systemd catches up with bind events
default: abort();
to make the warning go away and everything is fine ;)
Systemd catches up with bind events
Systemd catches up with bind events
Systemd catches up with bind events
Systemd catches up with bind events
> In July 2017, though, Dmitry Torokhov added two new event types called BIND and UNBIND.
> [...]
> Later that same month, a bug report landed in the KDE bug tracker; this was perhaps the first case where somebody noticed a problem related to the new events.
Systemd catches up with bind events
Systemd catches up with bind events
Systemd catches up with bind events
Systemd catches up with bind events
udev "can create device nodes" - not really
Systemd catches up with bind events
Systemd catches up with bind events
Systemd catches up with bind events
Systemd catches up with bind events
I see no signs that eudev intends to address this issue.
Systemd catches up with bind events
If the add event had been for fully-ready devices, shouldn't BIND have been more delivered *first*, and possibly spelled "CONNECT", for devices that aren't fully ready? Or --oh, I guess the check for whether the device needs prep done in userspace, in response to the ADD? Still, seems to me there's now no "device ready" event, because you can't tell what an ADD means, and that just feels wrong.
Systemd catches up with bind events
Systemd catches up with bind events
Systemd catches up with bind events
Systemd catches up with bind events
Systemd catches up with bind events
Systemd catches up with bind events
# Table loaded - no event generated.
# Device resumed (or renamed) - "change" event generated.
# Device removed - "remove" event generated.
#
# The dm-X nodes are always created, even on "add" event, we can't suppress
# that (the node is created even earlier with devtmpfs). All the symlinks
# (e.g. /dev/mapper) are created in right time after a device has its table
# loaded and is properly resumed. For this reason, direct use of dm-X nodes
# is not recommended.
ACTION!="add|change", GOTO="dm_end"
Systemd catches up with bind events
Systemd catches up with bind events
udev in the wrong basket?