|
|
Log in / Subscribe / Register

Changing Fedora's shutdown timeouts

By Jake Edge
January 18, 2023

On today's Fedora systems, a reboot cycle—for a kernel update, say—is normally a fairly quick affair, but that is not always true. The system will wait for services to shut down cleanly and will wait for up to two minutes before killing a service and moving on. A recent proposal to change the default timeout to 15 seconds, while still allowing some services to require more time, ran into more opposition than was perhaps anticipated. Not everyone was comfortable shortening the timeout period, though the decision has now been made to reduce it, but not as far as was proposed.

Change proposal

The proposal to shorten the timeout for Fedora 38, which is due in late April, was posted to the devel mailing list on December 22. The feature is owned by Michael Catanzaro and Allan Day; it would reduce the "extremely frustrating" delays that can occur when shutting down a Fedora system. The Fedora workstation working group has had an open bug for two years targeting the problem and has made efforts to change the upstream systemd default timeout, but to no avail. Thus, they are proposing that Fedora make a change to benefit its users:

The primary benefit of the change will be to mitigate a very annoying and - frankly - embarrassing bug. Our users shouldn't have to randomly sit waiting for their machine to shutdown.

An informal proposal to change the timeout was made to the Fedora Engineering Steering Committee (FESCo) late in the Fedora 37 cycle, but it was closed because more information (in the form of a Fedora change proposal) was needed. In that discussion and the one on the current proposal, the problem of simply hiding underlying bugs, where services should be shutting down cleanly but are not, was raised. The change proposed this time around—also available on the Fedora wiki—notes that concern:

Although this change will "paper over" bugs in services without fixing them, we emphasize that reducing the timeout is not merely a workaround for buggy services, but also the desired permanent design. Of course it is desirable to fix the underlying bugs as well, but it doesn't make sense to require this before fixing the service timeout to match our needs.

There are mechanisms to inhibit system shutdown when that is needed by a given service. In addition, packages can set a different timeout in their systemd unit files if that is required. But those timeouts can also stack up if multiple hanging service shutdowns are serialized, so the cumulative effect can be more than just one timeout period. The proposal would lower the current default timeouts (for services that do not set their own) to 15 seconds from either two minutes or 90 seconds currently, depending on the type of service.

Reaction

Adam Williamson was concerned that the proposal was too aggressive; there may be situations where the system needs to cleanly shut down multiple virtual machines (VMs), which could take longer, so he thought that 30 seconds might be a more reasonable choice. "Going all the way from 90/120 down to 15 seems pretty radical." Chris Murphy wondered it if made sense to make the shorter timeouts opt-in or to provide a way for servers and other types of installations to opt out of the change. A concrete reason to wait longer was provided by "allan2016": "15 seconds will for sure kill the modem on the Pinephones for good." Removing the power without waiting the 20-30 seconds its modem needs to shut down will apparently brick the modem.

Peter Boy was adamant that the timeout remain unchanged, at least for the Fedora server edition. Servers may have a lot of work to do before they can cleanly shut down (e.g. terminate VMs with their own delays, complete in-progress database transactions) and there is no available data on how long that might all take. The current values are generally working for servers; "this proposal brings no advantage at all for servers, only potential problems".

But Neal Gompa sees things differently; if the administrator is shutting the system down, they are doing so for a reason and, if the timeout is hit, it's likely because the service is hung. He suggested that either 15 or 30 seconds would be reasonable, especially in light of how systemd handles the timeout: "It's per service being shut down, rather than a global timeout." Boy disagreed, arguing that the current values "are empirically obviously a safe solution", but Gompa said: "If the end result is the same, it doesn't matter whether it's 30 seconds or 2 minutes."

Debugging

Trying to figure out what is causing a shutdown to time out is another part of the problem. The proposal notes that PackageKit is the most common offender, which is going to be difficult to fix, according to Gompa in the workstation bug entry, but there are others. Steve Grubb thought there should be a way to easily find out which service is holding things up, but Tomasz Torcz said that a message like that already exists. Debugging is still a problem though:

The problem is: at this points it is hardly debuggable. One cannot start a new shell, sshd is off already, journalctl too. No way to gather any information what's wrong with the process holding up shutdown. We only get a name. And usually you cannot reproduce the problem easy on next shutdown.

Grubb was unaware of the "trick" needed to access that information. Typing "Esc" at the stalled graphical console (which only shows "a black screen and a spinning circle") will show the textual messages, but Grubb thought that option was completely hidden by the interface. Fabio Valentini concurred with that:

Even if systemd prints nice diagnostic messages, they're useless if nobody is going to see them. And I doubt that many people know that pressing the Esc key makes plymouth go away.

Would it be possible to print an informative message in Plymouth instead? Something like "Shutdown is taking longer than expected, please do not force off the computer".

In another part of the thread, Catanzaro noted that killing the services with a SIGKILL after the timeout did not really leave any information behind to figure out what went wrong: "Killing things silently makes it real hard to report bugs." He thought it would make sense to change FinalKillSignal for systemd to SIGQUIT so that a core dump would be created. Lennart Poettering suggested a different solution:

Don't use FinalKillSignal=SIGQUIT.

Use TimeoutStopFailureMode=abort instead. (which covers more ground, and sends SIGABRT rather than SIGQUIT on failure, which has the same effect: coredumping).

He also cautioned that dumping core is not without costs, including time to write the core file. "You might end delaying things more than you hope shortening them." But Zbigniew Jędrzejewski-Szmek was not concerned about that particular problem; it would ultimately make the problems more visible:

It'll obviously delay the shutdown, making the whole thing even more painful. I assume that we would treat any such cases as bugs. If we get the coredumps reported though abrt, it'd indeed make it easier to diagnose those cases.

Catanzaro amended the proposal to follow Poettering's advice, but Kevin Fenzi wondered if it made more sense to selectively add shorter timeouts to services that are known to take too long, but that can be safely killed. Jędrzejewski-Szmek said that approach would mean that thousands of packages would need to be updated to get lower timeouts, which is not something that is realistically going to happen.

Instead, the idea is to attack the problem from the other end: reduce the timeout for everyone. Once this happens, we should start getting feedback about what services where this doesn't work. Some services legitimately need a long timeout (databases, etc), and for those the maintainers would usually have a good idea and can extend the timeout easily. Some services are just buggy, and with the additional visibility and tracebacks, it should be much easier to diagnose why they are slow.

Approaching the problem from this side is much more feasible. We'll probably have to touch a dozen files instead of thousands.

The existing timeout values were chosen arbitrarily when they were originally added to systemd, Poettering said. System V init had no timeouts at all, so the systemd developers chose "a conservative (i.e. overly long) value to not upset things too badly", though there were still some who were unhappy that there were timeouts. He is in favor of the change: "lowering the time-outs by default would make sense to me, but of course, people will be upset".

The FESCo issue for the change has more comments along the lines of those in the mailing-list discussion. The committee took up the question at its January 17 meeting. After a lengthy discussion, FESCo approved the proposal with two changes: the new default timeout would be 45 seconds and various Fedora editions (e.g. server) must be able to override the change. The timeout could potentially be lowered again in some future Fedora release.

There are few things more infuriating than waiting for one's computer to finally decide to give up and reboot, so it is nice to see a reduction in just how long that wait might be. Server administrators may have different needs and/or expectations, but even there, an infinite wait is not particular tenable. Obviously, it would be even better if the services themselves got fixed so that they did not unnecessarily delay the inevitable, but it looks like this change will bring some more tools toward making that a reality.



to post comments

Changing Fedora's shutdown timeouts

Posted Jan 18, 2023 23:10 UTC (Wed) by Sesse (subscriber, #53779) [Link] (7 responses)

My most infuriating thing about shutdown isn't the 90-second timeout. It's a competition between these two:

1. You want to force a shutdown, so you press Ctrl-Alt-Del 7 times quickly, causing a message “User insisted too much, rebooting immediately”… and it keeps on waiting for services!
2. Shutdown timeouts going from “1 second / 1 min 30 sec” slowly to “1 min 29 sec / 1 min 30 sec” and then, when the timeout happens, the limit is doubled! So you go from that last message to “1 min 30 sec / 3 min” instead of killing the service. Why on Earth. Extra fun when coupled with #1.

Changing Fedora's shutdown timeouts

Posted Jan 18, 2023 23:19 UTC (Wed) by logang (subscriber, #127618) [Link] (6 responses)

Yes!

I've also seen that after pressing ctrl-alt-del a million times to get through all that, it will actually says it's going to reboot but somehow screws it up and the computer just hangs without rebooting. The ctrl-alt-del feature is so broken that I don't even bother, I just hold down the power button to restart the machine. Debugging it is hard, and when I try and think its fixed something else seems to pop up to cause a new delay. Quite annoying.

Changing Fedora's shutdown timeouts

Posted Jan 19, 2023 17:17 UTC (Thu) by frostsnow (subscriber, #114957) [Link]

I'll third this. The worst part is that the hung service is usually something stupid like forgetting to unmount a drive before unplugging it and I have no remedy now but to twiddle my thumbs for 3 minutes (fantastic when I'm in a rush) or force-reboot. All I need is a method to tell systemd to force-kill the hung service NOW, but either that doesn't exist or I don't know about it. It's frustrating.

Changing Fedora's shutdown timeouts

Posted Jan 19, 2023 17:53 UTC (Thu) by mgedmin (subscriber, #34497) [Link] (3 responses)

Alt+SysRq+S,U,B should be a little bit safer -- it gives the kernel the chance to sync and unmount the filesystems before rebooting.

Changing Fedora's shutdown timeouts

Posted Jan 20, 2023 10:41 UTC (Fri) by paulj (subscriber, #341) [Link] (2 responses)

U and B are disabled in Fedora.

Changing Fedora's shutdown timeouts

Posted Jan 23, 2023 17:23 UTC (Mon) by Tet (guest, #5433) [Link]

U and B are disabled in Fedora.

Maybe by default. But they're enabled on my Fedora system.

Changing Fedora's shutdown timeouts

Posted Apr 19, 2023 20:02 UTC (Wed) by mchehab (subscriber, #41156) [Link]

Yeah, this is a quite irritating bug. It is almost certainly to happen if an oops happen on Fedora

What I do here, on systems that have troubles rebooting is to run this:

PARTS="$(cat /proc/mounts |grep /dev|cut -d' ' -f 2|grep -v /dev)"

history -a
sync
for i in $PARTS; do
	echo "Remounting $i as read/only"
	sudo mount -o remount,ro $i
done
echo "rebooting..."
sync && sudo su -c "echo b > /proc/sysrq-trigger"

Changing Fedora's shutdown timeouts

Posted Jan 20, 2023 22:04 UTC (Fri) by jccleaver (guest, #127418) [Link]

Every time I have to boot up a RHEL6 box to check on something it's like breathing a sigh of relief and not having to fight an init system that thinks it knows better and can't be scripted or configured without delving three layers deep in man pages and six layers deep in in file systems.

Changing Fedora's shutdown timeouts

Posted Jan 18, 2023 23:36 UTC (Wed) by xecycle (subscriber, #140261) [Link]

More often than bugs in services I run into bugs in kernels (e.g. recently amdgpu, or nfs some years ago), or sometimes "integration" bugs in some program X's interactions with systemd (e.g. leftover scope units from podman [1]), or my config of systemd [2]. Among these, kernel bugs usually cause infinite waits; I hope they actually configure that timeout into the hardware watchdog. I'd be disappointed if the outcome of such a long discussion does not reliably work in all cases.

[1] say podman run rust:1 sleep infinity, bang this would stay forever. It can be killed when systemd finally gives up waiting, but before that, the signal never made it to the sleep process, thus my waiting for a "shutdown" that did not even start feels stupid. I believe I should not "fix" the image; either podman should fix its use of scopes, or systemd should detect a PID 1 and behave differently.

[2] hit this once, when pvscan service decided to start a dmeventd directly, because dmeventd.socket was not started (my bad, did not enable it), and boom a leftover process in pvscan service after the main process exited. Like above, this is also never started to shutdown because no signal was delivered.

Experienced these cases, I feel that maybe we could be more clever in sending out the initial kills, but well, out of scope of the original discussion.

Changing Fedora's shutdown timeouts

Posted Jan 18, 2023 23:43 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

> TimeoutStopFailureMode=abort

This is what I _really_ hate in systemd. Its options are not easily discoverable, and there is a lack of proper organization for them. But I also don't know a good way to fix it.

Perhaps have them namespaced? Something like: Shutdown.Timeout.Behavior?

Changing Fedora's shutdown timeouts

Posted Jan 18, 2023 23:49 UTC (Wed) by rahulsundaram (subscriber, #21946) [Link] (1 responses)

> This is what I _really_ hate in systemd. Its options are not easily discoverable, and there is a lack of proper organization for them. But I also don't know a good way to fix it.

It's just listed in alphabetic order in the man page. https://www.freedesktop.org/software/systemd/man/systemd....

While this is comprehensive, it's more reference style as most man pages are and not as readily approachable if you are trying to figure out what to consider at all. Something to bring up to the systemd team if you are interested in contributing guides.

Changing Fedora's shutdown timeouts

Posted Jan 19, 2023 14:54 UTC (Thu) by lobachevsky (subscriber, #121871) [Link]

Usually, if I look whether something exists I have a look at the systemd.directives manpage, which is basically the index to all the man pages.

Changing Fedora's shutdown timeouts

Posted Jan 18, 2023 23:48 UTC (Wed) by flussence (guest, #85566) [Link] (19 responses)

This smells like premature optimization based on reading tea leaves and vibes, and a great way to make anyone running a large RDBMS side-eye any distro based on Fedora. And I say that as a runit user where the default is 7 seconds!

There is one way to do it right: *profile* the shutdown timing of each service at runtime, persist that data somewhere, and once you have enough of it you can reduce the timeout for each service to a safe value, like the average clean exit + 6σ. It's never safe to pick "nice-looking" default numbers without knowing how long it actually takes in situ.

Changing Fedora's shutdown timeouts

Posted Jan 19, 2023 0:35 UTC (Thu) by rahulsundaram (subscriber, #21946) [Link] (17 responses)

> This smells like premature optimization based on reading tea leaves and vibes, and a great way to make anyone running a large RDBMS side-eye any distro based on Fedora.

Don't see why. Things like PostgreSQL and libvirtd already override the default so there is no timeout for these services in Fedora.

Changing Fedora's shutdown timeouts

Posted Jan 19, 2023 9:56 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (16 responses)

Those services really ought to be SIGKILLable at any time, or else ACID is a complete lie.

(For performance reasons, it is *preferable* to cleanly shut them down, so they do not have to reconcile their journals etc. on startup. But this is not, or should not be, a correctness problem, and imposing an arbitrary timeout might well be a completely reasonable tradeoff for a sysadmin to make. It's less clear to me whether the distro is in a good decision to provide a useful default.)

Changing Fedora's shutdown timeouts

Posted Jan 19, 2023 13:02 UTC (Thu) by kleptog (subscriber, #1183) [Link] (7 responses)

When I see this issue it's mainly caused by container-like constructions where the parent process in the "container" gains PID1 signal protections. Namely, a PID1 process may not be killed by a default signal handler. Since most processes don't install a signal handler for SIGKILL, such processes become immune to SIGKILL. Of course, many programs don't install a SIGTERM handler either, but that's somewhat more common.

This is the reason many container guides suggest using something like systemd, supervisord, dumb-init, tini, etc as the root process because those *do* install a SIGKILL handler which properly exits. The symptom is pretty clear: killing your container takes ages.

Of course, if SIGKILL fails, the entire cgroup gets nuked eventually, and there's no immunity from that.

Changing Fedora's shutdown timeouts

Posted Jan 27, 2023 11:19 UTC (Fri) by intelfx (subscriber, #130118) [Link] (6 responses)

> Since most processes don't install a signal handler for SIGKILL, such processes become immune to SIGKILL. Of course, many programs don't install a SIGTERM handler either, but that's somewhat more common.
>
> This is the reason many container guides suggest using something like systemd, supervisord, dumb-init, tini, etc as the root process because those *do* install a SIGKILL handler which properly exits. The symptom is pretty clear: killing your container takes ages.

There is something wrong. signal(7) says (in the Standard signals) that “the signals SIGKILL and SIGSTOP cannot be caught, blocked, or ignored”.

That’s the whole point of SIGKILL.

Changing Fedora's shutdown timeouts

Posted Jan 27, 2023 12:20 UTC (Fri) by Wol (subscriber, #4433) [Link]

Forgive me if I'm wrong, but - just as access rights do not apply to user 0 - signals do not apply to pid 1?

SIGKILL etc may be untrappable, but it's down to pid 1 to enforce that. Ergo, you cannot send pid 1 a SIGKILL and expect it to work.

Cheers,
Wol

Changing Fedora's shutdown timeouts

Posted Jan 28, 2023 3:38 UTC (Sat) by mathstuf (subscriber, #69389) [Link] (4 responses)

PID 1 only receives signals for which it installs handlers. It's a bit special in that way. When "random executable X" is launched directly in a container with a PID namespace, it becomes that namespace's PID 1 and since SIGKILL handlers are generally only done in programs *expecting* to be PID 1, these processes therefore become unkillable.

Changing Fedora's shutdown timeouts

Posted Jan 28, 2023 12:49 UTC (Sat) by izbyshev (guest, #107996) [Link] (3 responses)

There are no "SIGKILL handlers" (and SIGSTOP handlers). It's impossible to change signal disposition for SIGKILL and SIGSTOP[1], as intelfx pointed out, so no init process (whether global or from a pid namespace) can change the effects of these signals. They can't be sent to the global init because the kernel explicitly disallows that[2]. For pid namespace inits, SIGKILL and SIGSTOP are allowed if sent from an ancestor pid namespace[3].

[1] https://elixir.bootlin.com/linux/v6.1.8/source/kernel/sig...
[2] https://elixir.bootlin.com/linux/v6.1.8/source/kernel/sig...
[3] https://elixir.bootlin.com/linux/v6.1.8/source/kernel/sig...

Changing Fedora's shutdown timeouts

Posted Jan 28, 2023 16:30 UTC (Sat) by kleptog (subscriber, #1183) [Link] (2 responses)

Thanks. I'd just like to say I find the PID1 signal handling somewhat underdocumented. The clearest thing I could find was this LKML message[1] from 2008.

[1] https://lwn.net/Articles/312721/

So it's true that init processes cannot be killed by unhandled signals, and that usually what prevents containers inits from dying on SIGTERM. But that's not what protects them from SIGKILL, that's handled on the other end. What happens if a container init receives a SIGSEGV due to a memory fault but has no handler installed wasn't immediately clear to me (I think the force mechanism will cause it to die anyway, not sure).

Changing Fedora's shutdown timeouts

Posted Feb 5, 2023 19:45 UTC (Sun) by flussence (guest, #85566) [Link] (1 responses)

IIRC in a namespace, if PID1 is SIGKILLed (or just exits some other way), what happens is every *other* PID in there gets SIGKILLed.

And that's part of the contract which "Container-init must behave like global-init to processes within the container" from that link entails - when you kill PID1 on bare metal the system stops.

Changing Fedora's shutdown timeouts

Posted Feb 26, 2023 16:30 UTC (Sun) by nix (subscriber, #2304) [Link]

> And that's part of the contract which "Container-init must behave like global-init to processes within the container" from that link entails

This seems to conflict badly with the reality which has random programs running as PID 1 in single-service containers. Hardly any of these programs are expecting ever to suddenly become an init...

Changing Fedora's shutdown timeouts

Posted Jan 19, 2023 13:46 UTC (Thu) by dskoll (subscriber, #1630) [Link] (5 responses)

How can you implement something that keeps the ACID guarantee in the face of SIGKILL? I believe it's impossible.

Changing Fedora's shutdown timeouts

Posted Jan 19, 2023 13:48 UTC (Thu) by dskoll (subscriber, #1630) [Link] (1 responses)

Ugh, never mind. It has to be possible or else you couldn't have ACID guarantees in the case of sudden loss of power, I guess.

Changing Fedora's shutdown timeouts

Posted Jan 19, 2023 14:58 UTC (Thu) by madscientist (subscriber, #16861) [Link]

Yes. It's a matter of whether you want to spend time at shutdown, or time at startup. On the product I worked on we used to drain our journal at shutdown but customers hated that. They wanted the thing to go down when they told it to: draining the journal can take many minutes, sometimes, depending on the situation.

Since we have to support ACID / poweroff / kill -9 / etc. anyway, we changed our shutdown to basically call _exit() and now it's very fast :). But it does mean that when they next start the service it may take longer to start up because all that work of reconciling the journal has to be done then.

For people who are concerned about it there is a standalone tool that can be used to reconcile the journal in the background while the service is offline, rather than doing it at startup.

ACID guarantees in the face of SIGKILL

Posted Jan 19, 2023 14:10 UTC (Thu) by matthias (subscriber, #94967) [Link] (1 responses)

In the same way as journaling filesystem work. First write the transaction into a journal (usually called log for databases) and after the transaction is confirmed to be written, the actual database can be changed.

Usually you want to keep the log with all transactions even after they have been written to the database. Using the log, you can recreate the current database state from a backup. So your backup scheme is as follows: do a full backup once in a while and keep a backup of all log files since the last full backup.

ACID guarantees in the face of SIGKILL

Posted Jan 19, 2023 14:26 UTC (Thu) by Wol (subscriber, #4433) [Link]

And try (not always possible) to make sure the log is on different hardware, even if only a different disk ...

Cheers,
Wol

Changing Fedora's shutdown timeouts

Posted Jan 19, 2023 19:21 UTC (Thu) by kreijack (guest, #43513) [Link]

> How can you implement something that keeps the ACID guarantee in the face of SIGKILL? I believe it's impossible.

The ACID guarantee have to be keep even in case of a power failure.

I remember that in the past I read that because the software have to be secure against a power failure, we could switch off the kernel without worrying about a clean shutdown or sync/flush....

Changing Fedora's shutdown timeouts

Posted Jan 19, 2023 14:34 UTC (Thu) by MarcB (guest, #101804) [Link] (1 responses)

They are SIGKILLable, but this doesn't mean doing so is free. Usually you pay for it on the next start-up in the form of a potentially lengthy and IO intensive recovery.

Changing Fedora's shutdown timeouts

Posted Jan 20, 2023 4:40 UTC (Fri) by NYKevin (subscriber, #129325) [Link]

Yes, I wrote an entire parenthetical specifically addressing that.

Changing Fedora's shutdown timeouts

Posted Jan 19, 2023 6:51 UTC (Thu) by LtWorf (subscriber, #124958) [Link]

> This smells like premature optimization based on reading tea leaves and vibes

It seems it has never happened to you, while it has happened to me several times.

Good for you that your daemons aren't buggy and get stuck when stopped.

Changing Fedora's shutdown timeouts

Posted Jan 19, 2023 3:43 UTC (Thu) by foom (subscriber, #14868) [Link] (1 responses)

> One cannot start a new shell, sshd is off already,

This is one thing that kinda bums me out about systemd.

Now, my ssh connection is just about the _first_ thing to be shut down. IIRC, under sysvinit/rc.d, an open ssh connection would live throughout the entire shutdown process until the last "sending sigterm/sigkill to all remaining processes" step.

Is there some way to tell systemd to just NOT shut down sshd and getty (and ethernet link?) until literally _everything_ else is successfully shut down?

Changing Fedora's shutdown timeouts

Posted Jan 19, 2023 4:50 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

Changing Fedora's shutdown timeouts

Posted Jan 19, 2023 4:43 UTC (Thu) by marcH (subscriber, #57642) [Link]

> Servers may have a lot of work to do before they can cleanly shut down (e.g. terminate VMs with their own delays, complete in-progress database transactions)

Ironically, you must pretend to be a Virtual Machine to get the... power button back:
https://unix.stackexchange.com/questions/242129/how-to-se...

> Would it be possible to print an informative message in Plymouth instead? Something like "Shutdown is taking longer than expected, please do not force off the computer".

Yes more informative messages please and also add "..., please press the ESC key" to them. Windows-like silly logos make Linux look "cool" except when things go wrong.

> Don't use FinalKillSignal=SIGQUIT. Use TimeoutStopFailureMode=abort instead.
> ...
> Once this happens, we should start getting feedback about what services where this doesn't work

Besides configuration and code changes, what seems really needed is a new "Slow shutdown FAQ" and in general more _communication_ with the user. Otherwise Fedora will get more anger than feedback.

Changing Fedora's shutdown timeouts

Posted Jan 19, 2023 5:13 UTC (Thu) by adobriyan (subscriber, #30858) [Link]

They also print stars indicating progress in red so it looks like an error while things are just little too slow.

On boot too!

But with new Nvidia driver I don't see anything from Grub kernel selection to KDE login prompt so I stopped worrying!

Changing Fedora's shutdown timeouts

Posted Jan 19, 2023 6:50 UTC (Thu) by LtWorf (subscriber, #124958) [Link]

I'm in favor of this change. I hate when systemd gets stuck waiting for some service and then I have to sit there and stare at the screen.

I normally end up doing an hard reset the second I see that is happening.

> 5 seconds will for sure kill the modem on the Pinephones for good.

I have "rebooted" mine multiple times by means of pulling out the battery and it still works. But it might just be luck? It certainly isn't bricked, so for my personal experience this looks false. But perhaps it might actually happen?

Changing Fedora's shutdown timeouts

Posted Jan 19, 2023 8:39 UTC (Thu) by taladar (subscriber, #68407) [Link]

I see the VM shutdowns as a non-issue since whatever timeout you choose as the default, a VM running the same distro as the host with a comparable number of services will always take as long as the host would if it strictly enforced that timeout and so will always be killed prematurely by such a timeout. There literally isn't a value you can choose as a default timeout for both host and VM that will always let the VM shut down cleanly unless you specifically exempt the VM shutdown service on the host from that default timeout.

Changing Fedora's shutdown timeouts

Posted Jan 19, 2023 9:36 UTC (Thu) by tchernobog (subscriber, #73595) [Link] (2 responses)

> Kevin Fenzi wondered if it made more sense to selectively add shorter timeouts to services that are known to take too long, but that can be safely killed.

What I would like then is a way to selectively make certain services blocking the shutdown. My use case is the backup service I use (btrbk), which might be in the middle of a send/receive operation between disks.

I don't know if shutdown is still prevented if Btrfs is committing a snapshot deletion (basically an fsync operation). If the kernel will still prevent shutdown automatically in that case, overriding the init system wishes, I haven't really tested. If it does, at least for me it will affect shutdown time much more than the typical use case of the user session hanging for two minutes.

A message to make the user aware that shutdown/restart is taking longer than expected would be welcome either way.

Changing Fedora's shutdown timeouts

Posted Jan 19, 2023 12:05 UTC (Thu) by smcv (subscriber, #53363) [Link] (1 responses)

> What I would like then is a way to selectively make certain services blocking the shutdown

This already exists and is mentioned in the article: it's the inhibitor lock mechanism, which follows the frequent systemd pattern of having a D-Bus API that services can call into, and a CLI that wraps it for use in scripts, in this case systemd-inhibit(1).

Changing Fedora's shutdown timeouts

Posted Jan 19, 2023 12:22 UTC (Thu) by mbiebl (subscriber, #41876) [Link]

The inhibitor mechanism is one way or units specifying an explicit timeout via TimeoutStopSec

https://www.freedesktop.org/software/systemd/man/systemd....

Changing Fedora's shutdown timeouts

Posted Jan 20, 2023 12:35 UTC (Fri) by jezuch (subscriber, #52988) [Link] (1 responses)

> "15 seconds will for sure kill the modem on the Pinephones for good." Removing the power without waiting the 20-30 seconds its modem needs to shut down will apparently brick the modem.

Huh. So that would tell me that there are plenty of Pinephones with bricked modems out there? I would say that we need an equivalent of the list of fallacies of distributed computing, but for power management: power is always reliable, you can control when the power gets switched on or off, etc. But that's something the hardware engineers would have to read, and I know how it looks like in practice...

Anyway...

Changing Fedora's shutdown timeouts

Posted Jan 21, 2023 7:07 UTC (Sat) by marcin (subscriber, #159076) [Link]

It is actually a software bug. We have a replacement open source firmware [1] for that modem (not the baseband) which has fixed the issue.

[1] https://github.com/the-modem-distro

Changing Fedora's shutdown timeouts

Posted Jan 20, 2023 13:17 UTC (Fri) by poc (subscriber, #47038) [Link] (1 responses)

Not a general solution, but works for me on my desktop:

$ cat /etc/systemd/user.conf.d/99-stop-fast.conf
[Manager]
DefaultTimeoutStopSec=5s
$

To show the current value:
systemctl --user show --property=DefaultTimeoutStopUSec

[USec instead of Sec as documented in org.freedesktop.systemd1(5)]

Changing Fedora's shutdown timeouts

Posted Jan 24, 2023 13:31 UTC (Tue) by jec (subscriber, #5803) [Link]

What about Ubuntu?
Is the timeout the same?
Is it user configurable?

Changing Fedora's shutdown timeouts

Posted Jan 23, 2023 17:27 UTC (Mon) by Tet (guest, #5433) [Link] (22 responses)

reduce the timeout for everyone. Once this happens, we should start getting feedback about what services where this doesn't work

Yes, you'll start getting feedback. You'll also, by the very nature of what you're doing, likely going to corrupt people's databases and cause other non-trivial failures. That's not a price worth paying for many people, and certainly not a decision that a distribution that cares about its users would make. I've been in the Red Hat ecosystem for over a quarter of a century now. But it's moronic decisions like this that makes me wonder whether I should be looking elsewhere for something else. Particularly when there's no obvious downside to living with a longer timeout, other than slightly slower reboots, which the vast majority simply don't care about anyway.

Changing Fedora's shutdown timeouts

Posted Jan 23, 2023 19:20 UTC (Mon) by madscientist (subscriber, #16861) [Link] (21 responses)

Any "database" which is corrupted by a hard stop cannot be called a "database". What happens when someone kicks the power cord out from your system? Or your electricity goes out? Or something runs away with your memory and the kernel OOM killer decides to "kill -9" your process? This is just not a real problem.

There may be services which are corrupted by a hard shutdown but if so they are buggy and should not be used in any system that is intended to be reliable or recoverable.

Changing Fedora's shutdown timeouts

Posted Jan 24, 2023 2:13 UTC (Tue) by Wol (subscriber, #4433) [Link] (14 responses)

> Any "database" which is corrupted by a hard stop cannot be called a "database".

You're making the exact same mistake the filesystem guys did at the transition from ext3 to ext4. The users don't give a damn about the filesystem, or the database. THEY CARE ABOUT THEIR DATA.

By journalling the filesystem metadata, ext4 reduced the need for a fsck and sped up boot times after a crash. The problem was, ext3 saved the user's data in the event of a crash, ext4 corrupted it. The user doesn't give a monkeys if the system recovers 10 minutes quicker, if the result is waiting ten hours for your corrupted files to be recovered from a backup.

Likewise, users don't give a monkeys that the database recovers quickly by replaying a journal, if the database has been corrupted by losing "data in flight". If the system crashes with a power outage, then maybe it's acceptable, it's hard to protect 100% against it. But (as we are discussing here) for data in flight to get lost because the system can't be bothered to wait for the database to flush its buffers? THAT IS TOTALLY UNACCEPTABLE. My employer is a supermarket. If customers commit a transaction, the LAST thing we want is for the system to "crash" and lose those transactions. But if the system does not give the database time to flush those transactions to the log, that is exactly what will happen.

It all depends what you mean by "a corrupted database". To you it means a database where the file STRUCTURE is damaged and needs to be repaired. To me it means a database where the file CONTENTS have been damaged, and may be irretrievably lost.

Cheers,
Wol

Changing Fedora's shutdown timeouts

Posted Jan 24, 2023 2:23 UTC (Tue) by pizza (subscriber, #46) [Link]

> THAT IS TOTALLY UNACCEPTABLE.

Fortunately (and it seems to be repeatedly lost in the "discussion") daemons that can result in data loss if shut down uncleanly already can (and do) override the force-kill timeouts (or actively set shutdown inhibitors).

But this change isn't intended to affect "well-behaved" stuff like the above, but rather stuff that's well and truly hung. Part of this change is to change the signal being sent so that killing the process triggers a core dump, which in turn will get reported out through abrt/etc if the user/admin so chooses, which in turn will allow these misbehaving things to be seen and properly dealt with.

Changing Fedora's shutdown timeouts

Posted Jan 24, 2023 9:16 UTC (Tue) by brunowolff (guest, #71160) [Link] (4 responses)

If the application is doing this correctly, then committed transactions won't get lost. The application shouldn't report a transaction as committed until after the database says it is.

Changing Fedora's shutdown timeouts

Posted Jan 24, 2023 9:35 UTC (Tue) by Wol (subscriber, #4433) [Link] (3 responses)

And you're missing the point COMPLETELY. By the USERS' DEFINITION, the database is corrupt.

The database STRUCTURE may be perfectly okay. That's no use to me if I've lost a thousand or so customer transactions because the user interface didn't have time to flush to the transaction log ...

The database is protecting ITSELF. It is not successfully protecting the user's data. THAT is the problem.

Cheers,
Wol

Changing Fedora's shutdown timeouts

Posted Jan 24, 2023 11:31 UTC (Tue) by mjg59 (subscriber, #23239) [Link] (2 responses)

Why is the database informing clients that data is committed if the data is not actually committed?

Changing Fedora's shutdown timeouts

Posted Jan 24, 2023 11:37 UTC (Tue) by Wol (subscriber, #4433) [Link] (1 responses)

Because the layer below lied to it? Because hardware lies?

At the end of the day, I DON'T KNOW. But I've seen enough stories about lying for benchmarks, to make me very wary of trusting an acknowledgement further than I can throw it.

Cheers,
Wol

Changing Fedora's shutdown timeouts

Posted Jan 25, 2023 4:34 UTC (Wed) by mjg59 (subscriber, #23239) [Link]

If hardware lies, how do you know any form of shutdown will result in the data ending up on disk? Either the database knows it's hit disk (in which case it can delay sending confirmation until it knows that) or it doesn't (in which case it doesn't matter how you shut down, there's still a chance of data loss)

Changing Fedora's shutdown timeouts

Posted Jan 24, 2023 9:38 UTC (Tue) by kleptog (subscriber, #1183) [Link] (7 responses)

> My employer is a supermarket. If customers commit a transaction, the LAST thing we want is for the system to "crash" and lose those transactions. But if the system does not give the database time to flush those transactions to the log, that is exactly what will happen.

Then that database is broken. Once the database acknowledges the COMMIT from the user, it must not lose the transaction, even if the power fails, SIGKILL, whatever (asteroid impact may be an exception). If you require synchronisation with other systems, you have two-phase commit where the decision whether a transaction will be committed or aborted can be made even after a crash and restart.

There are many ways to do this, usually the database has it's own transaction log. For individual files you have the write/sync/rename trick which every editor should be doing. You're right that the file-system doesn't guarantee anything without using one of the *sync() syscalls, it's debatable whether it should.

Changing Fedora's shutdown timeouts

Posted Jan 24, 2023 9:49 UTC (Tue) by Wol (subscriber, #4433) [Link] (6 responses)

> Then that database is broken. Once the database acknowledges the COMMIT from the user, it must not lose the transaction, even if the power fails, SIGKILL, whatever (asteroid impact may be an exception). If you require synchronisation with other systems, you have two-phase commit where the decision whether a transaction will be committed or aborted can be made even after a crash and restart.

"The decision to commit or abort". LOOK AT IT FROM THE USER'S POINT OF VIEW. That decision is a no-brainer. ABORT IS UNACCEPTABLE.

Everybody's looking at it from the database's POV - "protect the database". Nobody's looking at it from the user's POV - "protect the data".

The correct sequence of events is the user confirms checkout, the database accepts the transaction, flushes it to the log, and doesn't return "purchase confirmed" until that log is safely on disk.

But if you've got a bunch of containers/vms/whatever on a machine in the cloud/your own cloud, can you guarantee that that log is safe? With all those layers of indirection in the way? Especially if the *container* confirms receipt of the log to the database, without the container itself checking that the layer below - and the layer below that etc etc - has really truly flushed the transaction? It's Russian Roulette.

Yup. If the software is well written, everything is fine. But can you guarantee that the stack is behaving as you expect / as it should?

Cheers,
Wol

Changing Fedora's shutdown timeouts

Posted Jan 24, 2023 10:04 UTC (Tue) by unilynx (guest, #114305) [Link]

> But if you've got a bunch of containers/vms/whatever on a machine in the cloud/your own cloud, can you guarantee that that log is safe? With all those layers of indirection in the way? Especially if the *container* confirms receipt of the log to the database, without the container itself checking that the layer below - and the layer below that etc etc - has really truly flushed the transaction? It's Russian Roulette

That's the easy part. And even in your simple scenario without containers, there is a possibility that the data was actually committed but we were unable to tell the user. Containers/VMs/clouds/layers only add some additional latency to these scenarios. And the final result is easy: either the change is committed or it isn't

(yep, you have to assume the layers are bugfree - but I've seen more dataloss with local systems ('confirmed' data not actually persisted) than with cloud volumes)

It gets harder when you need to persist something to multiple locations (spare/multimaster databases, a database and a payment system) because you can't atomically commit over both systems and need to deal with 2 phase commits, split brains, and all that fun.

Changing Fedora's shutdown timeouts

Posted Jan 24, 2023 14:32 UTC (Tue) by pizza (subscriber, #46) [Link] (3 responses)

> "The decision to commit or abort". LOOK AT IT FROM THE USER'S POINT OF VIEW. That decision is a no-brainer. ABORT IS UNACCEPTABLE.

Sure, and here's another pony for the user's menagerie.

(If it's so "unacceptable" then the user is paying top $$$$ for systems, software, and expertise to prevent this sort of thing from happening. Otherwise it's just a deluded naive fantasy)

Changing Fedora's shutdown timeouts

Posted Jan 24, 2023 16:25 UTC (Tue) by Wol (subscriber, #4433) [Link] (2 responses)

And the contractor is making megabucks using wet-behind-the-ears consultants just out of Uni with no real-world experience ...

What! Me? Cynical???

I would have thought we had plenty of decent staff looking after our systems, but given the steady trickle of emails where we've had a double-commit here, lost data there, ... okay these are user-level errors not database-level, but ...

Given that your average member of staff is, well, average, defensive programming should be the norm, not the exception.

Cheers,
Wol

Changing Fedora's shutdown timeouts

Posted Jan 24, 2023 16:47 UTC (Tue) by pizza (subscriber, #46) [Link] (1 responses)

> Given that your average member of staff is, well, average, defensive programming should be the norm, not the exception.

Don't forget that "Defensive programming" also includes "design your services to be robust in the face of sudden failures" (and the corollary "sudden failures WILL happen")

This now-decade-old slide deck from Netflix is quite informative in that respect:

https://www.slideshare.net/adrianco/high-availability-arc...

Changing Fedora's shutdown timeouts

Posted Jan 24, 2023 17:22 UTC (Tue) by Wol (subscriber, #4433) [Link]

Very interesting read (apart from the damn ads every few slides!!!).

I noticed they use a NoSQL database ... :-)

Now to try and get my lot to use a proper database instead of BigQuery (and Oracle) and a bunch of Excel spreadsheets. So long as they don't insist I use a hammer I'll be happy :-)

BigQuery's own documentation says it's OLAP, and not to use it for OLTP, so if they insist I use BigQuery and Google Sheets to write our production database I will *not* be happy ... oh well, lemmings will be lemmings ...

Cheers,
Wol

Changing Fedora's shutdown timeouts

Posted Jan 24, 2023 18:37 UTC (Tue) by james (guest, #1325) [Link]

LOOK AT IT FROM THE USER'S POINT OF VIEW. That decision is a no-brainer. ABORT IS UNACCEPTABLE.
What if the transaction arrives 2 ms later, when the database is already down? That transaction is not going to be applied.

So you need support for telling the user "that transaction couldn't be committed" anyway.

Changing Fedora's shutdown timeouts

Posted Jan 24, 2023 15:24 UTC (Tue) by andresfreund (subscriber, #69562) [Link] (5 responses)

People on dev systems do things like disable fsync. And no, that won't stop then from complaining about corruption during shutdown.

Changing Fedora's shutdown timeouts

Posted Jan 24, 2023 16:44 UTC (Tue) by madscientist (subscriber, #16861) [Link] (4 responses)

As I said, there are many many ways that a process can go down without having a chance to die "cleanly". I have no particular opinion on what the timeout for shutdown before kill should be, but saying that reducing it is unacceptable because something that could happen at almost any time, could now happen at some other time too isn't a compelling argument in any way.

If you care about your data and your system's integrity then you'll use software that makes the effort to care as well and you'll be sure to not subvert that software by disabling those features. If you don't care, then no complaints.

Changing Fedora's shutdown timeouts

Posted Jan 24, 2023 18:02 UTC (Tue) by Wol (subscriber, #4433) [Link]

Well, take my system. I've got ext4 over lvm over raid over dm-integrity over hardware.

Dunno how long it takes for in-flight to get to disk, but probably noticeably more than many other systems.

This is where just cutting time before a force-shutdown could be a disaster. Okay, with gentoo and all that, it won't affect me, but if they're cutting timeouts to keep devs happy, that's exactly what'll cause production systems, that might need extra time, to get burnt...

Cheers,
Wol

Changing Fedora's shutdown timeouts

Posted Jan 24, 2023 19:26 UTC (Tue) by andresfreund (subscriber, #69562) [Link] (2 responses)

I am not saying that it's unacceptable to reduce shutdown times. In fact, I agree that the default should be decreased. We might differ on what a good value for services that need a longer shutdown is, but that's details.

My point was just that I don't think it's as binary as >>Any "database" which is corrupted by a hard stop cannot be called a "database"<< because people intentionally use settings like fsync=off, to reduce overhead on systems when durability is not paramount, and still would like data to not be corrupted outside of genuine crashes.

Changing Fedora's shutdown timeouts

Posted Jan 24, 2023 21:25 UTC (Tue) by Wol (subscriber, #4433) [Link]

> My point was just that I don't think it's as binary as >>Any "database" which is corrupted by a hard stop cannot be called a "database"<< because people intentionally use settings like fsync=off, to reduce overhead on systems when durability is not paramount, and still would like data to not be corrupted outside of genuine crashes.

c/database/filesystem/

You have the ext4 debacle where the DEFAULT would lose data in the event of a system crash, resulting in a corrupt filesystem and necessitating a "restore from backup". And it was all totally unnecessary.

The point is not to avoid all corruption (that's not possible), but don't ask for trouble. The best user-space in the world won't save you if the infrastructure underneath drops you in it without due cause ...

Cheers,
Wol

Changing Fedora's shutdown timeouts

Posted Jan 25, 2023 1:18 UTC (Wed) by madscientist (subscriber, #16861) [Link]

What I'm trying to say is that the definition of "genuine crashes" is so wide and variable that expanding it to include "can't shut down in 30s" (or whatever amount of time) is not worth worrying about.

If people disable safety features then they are at the mercy of any sort of problem that happens. If they don't care enough about their data to preserve it in the face of any other problem they might face then I don't see why "slow shutdown" being added to that list should impact the rest of us.

Of course others may have other opinions, but I'll simply have to agree to disagree with those people: I don't see any grey area between "your data is reliably durable" and "your data is not reliably durable" that's worth arguing about.

Changing Fedora's shutdown timeouts

Posted Jan 24, 2023 13:32 UTC (Tue) by jec (subscriber, #5803) [Link]

What about Ubuntu?
Is the timeout the same?
Is it user configurable?


Copyright © 2023, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds