Changing Fedora's shutdown timeouts
On today's Fedora systems, a reboot cycle—for a kernel update, say—is normally a fairly quick affair, but that is not always true. The system will wait for services to shut down cleanly and will wait for up to two minutes before killing a service and moving on. A recent proposal to change the default timeout to 15 seconds, while still allowing some services to require more time, ran into more opposition than was perhaps anticipated. Not everyone was comfortable shortening the timeout period, though the decision has now been made to reduce it, but not as far as was proposed.
Change proposal
The proposal
to shorten the timeout for Fedora 38, which is due in late April, was
posted to the devel mailing list on December 22. The feature is owned
by Michael Catanzaro and Allan Day; it would reduce the "extremely
frustrating
" delays that can occur when shutting down a Fedora system.
The Fedora workstation
working group has had an open bug for two
years targeting the problem and has made efforts to change
the upstream systemd default timeout, but to no avail. Thus, they are
proposing
that Fedora make a change to benefit its users:
The primary benefit of the change will be to mitigate a very annoying and - frankly - embarrassing bug. Our users shouldn't have to randomly sit waiting for their machine to shutdown.
An informal proposal to change the timeout was made to the Fedora Engineering Steering Committee (FESCo) late in the Fedora 37 cycle, but it was closed because more information (in the form of a Fedora change proposal) was needed. In that discussion and the one on the current proposal, the problem of simply hiding underlying bugs, where services should be shutting down cleanly but are not, was raised. The change proposed this time around—also available on the Fedora wiki—notes that concern:
Although this change will "paper over" bugs in services without fixing them, we emphasize that reducing the timeout is not merely a workaround for buggy services, but also the desired permanent design. Of course it is desirable to fix the underlying bugs as well, but it doesn't make sense to require this before fixing the service timeout to match our needs.
There are mechanisms to inhibit system shutdown when that is needed by a given service. In addition, packages can set a different timeout in their systemd unit files if that is required. But those timeouts can also stack up if multiple hanging service shutdowns are serialized, so the cumulative effect can be more than just one timeout period. The proposal would lower the current default timeouts (for services that do not set their own) to 15 seconds from either two minutes or 90 seconds currently, depending on the type of service.
Reaction
Adam Williamson was concerned
that the proposal was too aggressive; there may be situations where the
system needs to cleanly shut down multiple virtual machines (VMs), which
could take longer, so he thought that 30 seconds might be a more
reasonable choice. "Going all the way from 90/120 down to 15 seems
pretty radical.
" Chris Murphy wondered
it if made sense to make the shorter timeouts opt-in or to provide a way
for servers and other types of installations to opt out of the change. A
concrete reason to wait longer was provided by
"allan2016": "15 seconds will for sure kill the modem on the Pinephones
for good.
" Removing the power without waiting
the 20-30 seconds its modem needs to shut down will apparently brick
the modem.
Peter Boy was adamant
that the timeout remain unchanged, at least for the Fedora server edition.
Servers may have a lot of work to do before they can cleanly shut down
(e.g. terminate VMs with their own delays, complete in-progress database
transactions) and there is no available data on how long that might all
take. The current values are generally working for servers; "this
proposal brings no advantage at all for servers,
only potential problems
".
But Neal Gompa sees
things differently; if the administrator is shutting the system down,
they are doing so for a reason and, if the timeout is hit, it's likely
because the service is hung. He suggested that either 15 or 30
seconds would be reasonable, especially in light of how systemd handles the
timeout: "It's per service being
shut down, rather than a global timeout.
" Boy disagreed,
arguing that the current values "are empirically obviously a safe
solution
",
but Gompa
said:
"If the end result is the same, it doesn't matter whether it's
30 seconds or 2 minutes.
"
Debugging
Trying to figure out what is causing a shutdown to time out is another part of the problem. The proposal notes that PackageKit is the most common offender, which is going to be difficult to fix, according to Gompa in the workstation bug entry, but there are others. Steve Grubb thought there should be a way to easily find out which service is holding things up, but Tomasz Torcz said that a message like that already exists. Debugging is still a problem though:
The problem is: at this points it is hardly debuggable. One cannot start a new shell, sshd is off already, journalctl too. No way to gather any information what's wrong with the process holding up shutdown. We only get a name. And usually you cannot reproduce the problem easy on next shutdown.
Grubb was unaware of
the "trick" needed to access that information. Typing "Esc" at the stalled
graphical console (which only shows "a black screen and a spinning
circle
") will show the textual messages, but Grubb thought that option was
completely hidden by the interface. Fabio Valentini concurred
with that:
Even if systemd prints nice diagnostic messages, they're useless if nobody is going to see them. And I doubt that many people know that pressing the Esc key makes plymouth go away.Would it be possible to print an informative message in Plymouth instead? Something like "Shutdown is taking longer than expected, please do not force off the computer".
In another part of the thread, Catanzaro noted that
killing the services with a SIGKILL after the timeout did not
really leave any information behind to figure out what went wrong:
"Killing things silently makes it real hard to report
bugs.
" He thought it would make sense to change
FinalKillSignal for systemd to SIGQUIT so that a core
dump would be created. Lennart Poettering suggested a
different solution:
Don't use FinalKillSignal=SIGQUIT.Use TimeoutStopFailureMode=abort instead. (which covers more ground, and sends SIGABRT rather than SIGQUIT on failure, which has the same effect: coredumping).
He also cautioned that dumping core is not without costs, including time to
write the core file. "You
might end delaying things more than you hope shortening them.
" But
Zbigniew Jędrzejewski-Szmek was not
concerned about that particular problem; it would ultimately make the
problems more visible:
It'll obviously delay the shutdown, making the whole thing even more painful. I assume that we would treat any such cases as bugs. If we get the coredumps reported though abrt, it'd indeed make it easier to diagnose those cases.
Catanzaro amended the proposal to follow Poettering's advice, but Kevin Fenzi wondered if it made more sense to selectively add shorter timeouts to services that are known to take too long, but that can be safely killed. Jędrzejewski-Szmek said that approach would mean that thousands of packages would need to be updated to get lower timeouts, which is not something that is realistically going to happen.
Instead, the idea is to attack the problem from the other end: reduce the timeout for everyone. Once this happens, we should start getting feedback about what services where this doesn't work. Some services legitimately need a long timeout (databases, etc), and for those the maintainers would usually have a good idea and can extend the timeout easily. Some services are just buggy, and with the additional visibility and tracebacks, it should be much easier to diagnose why they are slow.Approaching the problem from this side is much more feasible. We'll probably have to touch a dozen files instead of thousands.
The existing timeout values were chosen arbitrarily when they were
originally added to systemd, Poettering said.
System V init had no timeouts at all, so the systemd developers chose
"a conservative (i.e. overly long) value to not
upset things too badly
", though there were still some who were unhappy
that there were timeouts. He is in favor of the change:
"lowering the time-outs by default would make sense to me, but
of course, people will be upset
".
The FESCo issue for the change has more comments along the lines of those in the mailing-list discussion. The committee took up the question at its January 17 meeting. After a lengthy discussion, FESCo approved the proposal with two changes: the new default timeout would be 45 seconds and various Fedora editions (e.g. server) must be able to override the change. The timeout could potentially be lowered again in some future Fedora release.
There are few things more infuriating than waiting for one's computer to finally decide to give up and reboot, so it is nice to see a reduction in just how long that wait might be. Server administrators may have different needs and/or expectations, but even there, an infinite wait is not particular tenable. Obviously, it would be even better if the services themselves got fixed so that they did not unnecessarily delay the inevitable, but it looks like this change will bring some more tools toward making that a reality.
