"dnf update" considered harmful

By Jake Edge
October 5, 2016

Updating a Linux distribution has historically been done from the command line (using tools like Debian's apt-get, openSUSE's zypper, or Fedora's yum—or its successor dnf). A series of crashes during system updates on Fedora 24 led Adam Williamson to post a note to fedora-devel and other mailing lists warning people away from running "dnf update" within desktop environments. It turns out that doing so has never truly been supported—though it works the vast majority of the time. The discussion around Williamson's note, however, makes it clear that the command is commonly run that way and that at least some users are quite surprised (and unhappy) that it isn't a supported option.

The underlying problem is that when running an update in a graphical terminal under GNOME, KDE, or some other desktop environment, there are a number of components that could crash (or restart) due to the update process. If X, GNOME, or the terminal program crashes, they will take the update process with them—which may leave that process in an indeterminate state. Williamson reported that users were getting "duplicated packages" and other messages when trying to rerun the update. So he was blunt in his recommendations:

[...] but in the meantime - and this is in fact our standard advice anyway, but it bears repeating - DON'T RUN 'dnf update' INSIDE A DESKTOP.

[...] If you're using Workstation, the offline update system is expressly designed to minimize the likelihood of this kind of problem, so please do consider using it. Otherwise, at least run 'dnf update' in a VT - hit ctrl-alt-f3 to get a VT console login prompt, log in, and do it there. Don't do it inside your desktop.

That led to several replies indicating that some had been doing updates that way frequently, for years, and with no problems. It also led some to wonder why the process could not be made more robust against this kind of problem, especially since it was a relatively common thing to do. Andrew Lutomirski asked:

How hard would it be to make dnf do the rpm transaction inside a proper system-level service (transient or otherwise)? This would greatly increase robustness against desktop crashes, ssh connection loss, KillUserProcs, and other damaging goofs.

But Stephen Gallagher thought that would be a waste of time, given that the offline update process has been available since Fedora 18. That process downloads the packages in the background, then lets the user choose when to reboot to install them. It then boots to a minimal environment, which is meant to minimize the possibility of the update breaking something and leaving the system in an indeterminate state.

But rebooting every time there are updates is a pretty heavy-handed approach. Lutomirski noted that he would rather avoid that step and that, for servers, it isn't even obvious how to trigger the offline update (for GNOME, the desktop simply gives the option to reboot and update). Gerald B. Cox seemed incredulous that the recommended path required a reboot: "As far as rebooting after every update? Huh? Who does that? Are we Windows?" In another message, Cox suggested that dnf could be made more robust:

Seems to me it would be more worthwhile to build in better error recovery within DNF than to always require "offline" - especially since the incidence of failure (at least anecdotally) just isn't that high. Instead of dealing with the problem (failed updates and error recovery) - this approach just tries to avoid it by always requiring a reboot.

But Chris Murphy strongly disagreed:

Sufficiently impractical that it's not possible. This is why offline updates exists. It's why work is being done on ostree>rpm-ostree>atomic host, which affects the entire build system, deployments, updates, and eventually all of the mirrors. It's why Microsoft and Apple don't allow anything other than offline updates. It's why openSUSE has spent a ton of resources, and a few bloody noses, getting completely atomic updates working with Btrfs and snapper, with very fine rollback capabilities. There's a reason why so many different experts at system updates have looked at this problem and just say, yeah no, not anymore of that.

Sam Varshavchik pointed out that tmux can already handle a crash of the X server, so it should be possible to make dnf itself more resistant to those kinds of problems using those techniques—others mentioned screen as a possibility as well. In order to do that, though, dnf would need to add the functionality of tmux/screen, but that's "far outside the scope of a package manager", Chris Adams said. Furthermore, the lack of a controlling TTY after a crash means that dnf would need to ignore the SIGPIPE that would result from writing its output, which is not something it should do. He suggested that those who want that functionality "run it under something that handles that, like tmux or screen".

Varshavchik pointed to Android as a system that can do application updates without a reboot: "The only time you need to reboot an Android device is for a kernel-level update." For Fedora, though, the problem is that it follows "the 'distribution is just a big pile of RPMs' model", Williamson said. Fedora cannot distinguish between updates that are "system level" versus those that aren't, but Android can (and even it requires reboots on more than just kernel updates):

No, in fact, it's for any *system level* update. Any change to the underlying system (as opposed to an app) requires the full reboot treatment. Only updates to app packages don't.

The reason Android can do fairly good app updates is precisely because it does exactly what Flatpak and Snappy are trying to do for Linux: hard separation between app space and system space. Flatpak and Snappy didn't just spring fully formed from a vacuum, they're very obviously the product of someone using Android and/or iOS and going 'huh, maybe we should do that'.

Murphy expanded on that idea some:

Strictly speaking [rebooting is] not necessary for every update, there's just no mechanism for knowing for sure what updates entail more risk than others. You'll notice that once an application is installed, whether by dnf or Gnome Software, it's considered part of the system. There's no separation of OS upgrades from application updates.

But there is a misconception that dnf update completely updates the system without a reboot, Peter Larsen said:

People think that "yum/dnf update" leaves their system in a new updated stage. But it doesn't (completely). It never has. Only after a reboot are all your patches applied and active. Existing/running processes are rarely if ever reloaded. So when you update libraries, kernels etc. your system will keep running with the old versions of those libraries loaded. [...]

The only real complete update you can do is one that does a full reboot. We do have a few tricks with DNF which will attempt to let you know what needs restarting. But you'll find that a good part of our updates requires a restart of most if not all your system, in order for the updates to become fully active.

Perhaps partly because it normally works just fine, it is surprising to find that dnf update is not really a supported way to update a Fedora system—even from a virtual terminal outside of the desktop. Many will probably keep on doing it, but once in a while may get bitten. As Williamson put it: "It works fine all the time until it doesn't, and then you're left with a pile of broken bits that you get to spend all afternoon fixing."

Meanwhile, though, Williamson and others were working on tracking down just what caused the spate of problems that led to the original warning. It turns out that an update to the systemd-udev package would restart the service, which would result in the video adapter(s) being plugged in again, which caused X to crash, but only for devices with multiple graphics adapters (for example, hybrid graphics in laptops). As detailed in Williamson's blog post, the problem will be fixed at both ends, which is clearly to the good.

As far as rebooting after every update goes, though, one guesses that many longtime users will make their own decisions on when to do that, while newer users will likely just do what GNOME suggests. Some will still end up with "broken bits" occasionally, but won't have to reboot as frequently—with an update stream as constant as Fedora's, that tradeoff may well be worth it.

"dnf update" considered harmful

Posted Oct 6, 2016 3:15 UTC (Thu) by josh (subscriber, #17465) [Link] (8 responses)

This isn't specific to dnf, either. Debian's apt is *supposed* to support live updates. But increasingly, the response to critical bugs like "updated while the desktop was running and now I can't unlock the screen" is "upstream says you shouldn't have been upgrading from within the desktop anyway, wontfix".

"dnf update" considered harmful

Posted Oct 6, 2016 4:17 UTC (Thu) by xanni (subscriber, #361) [Link] (7 responses)

Debian and derivatives already offer to terminate your screen locking when doing an update that might prevent you from unlocking afterwards. And you can use "checkrestart" or the more recent "needrestart" which hooks into apt post-update to let you know what running code is still using the old libraries from before the upgrade, and offer to restart it.

"dnf update" considered harmful

Posted Oct 6, 2016 4:32 UTC (Thu) by raven667 (subscriber, #5198) [Link]

I think the reason that systems have stopped preferring online updates is that this is inherently not scalable to all the crazy software out in the wild, when Debian had a few hundred or a thousand packages then online upgrades could be tested but with todays wide software ecosystem, there will always be flakiness and unreliability in the online upgrade model, which is why offline upgrades like ChromeOS or fedup or Windows or OSTree are being implemented more widely

"dnf update" considered harmful

Posted Oct 6, 2016 5:45 UTC (Thu) by josh (subscriber, #17465) [Link] (4 responses)

> Debian and derivatives already offer to terminate your screen locking when doing an update that might prevent you from unlocking afterwards.

I haven't seen that one, but the problem I usually run into isn't upgrading with the screen locked, but rather running an upgrade and then letting the screen lock without restarting the session first.

"dnf update" considered harmful

Posted Oct 6, 2016 5:52 UTC (Thu) by xanni (subscriber, #361) [Link] (3 responses)

Yes, that's exactly what it's supposed to prevent. If the screen locker process is killed, it won't lock during your upgrade.

"dnf update" considered harmful

Posted Oct 6, 2016 6:16 UTC (Thu) by josh (subscriber, #17465) [Link] (2 responses)

There's no "screen locker process" anymore; gnome-shell has the screensaver built in. (Also, it'd be incredibly dangerous to disable the screen locker, as someone might close their laptop lid and expect their system to be secure.)

"dnf update" considered harmful

Posted Oct 6, 2016 6:19 UTC (Thu) by xanni (subscriber, #361) [Link]

Ah, well, I use XFCE and LXDE, not GNOME.

"dnf update" considered harmful

Posted Oct 6, 2016 8:54 UTC (Thu) by sourcejedi (guest, #45153) [Link]

It sounds like it actually prompts before disabling the screensaver. Suspending (the default lid-close action) half-way through an upgrade is not necessarily a great idea either.

In GNOME, I think it could be implemented cleanly with an inhibitor. Just like when watching fullscreen video. (Maybe there's even a portable api / cross-desktop function to do it).

I agree it generally sounds hinky though. I've been known to use `pkcon update` myself, but when it kills my GNOME session some day I won't have anyone to blame but myself.

"dnf update" considered harmful

Posted Oct 7, 2016 15:52 UTC (Fri) by AdamW (subscriber, #48457) [Link]

dnf has a similar feature, but it's generally believed that such features can never really catch *every* case where a software may need a restart. Just as an example, how do you handle the case where the plugins for some app are updated, and a running copy of the app won't be compatible with the new plugins, but none were loaded at the time the update transaction occurred? If the user then switches to the app window and tries to load a plugin things are going to go wrong, but it's very hard for such a mechanism to catch cases like that.

"dnf update" considered harmful

Posted Oct 6, 2016 7:17 UTC (Thu) by mjthayer (guest, #39183) [Link] (2 responses)

Considering that we have live kernel patching I would expect there to be people willing to pay to have live updates of a system - though as another poster pointed out, that might not be any random system with crazy combinations of packages. For Fedora I would think the sensible option would be to allow live "dnf update" runs, but to warn the user first and let them cancel in time if they want. Then the people who do it will know there is a risk (which they should do already) and can help debug the problems.

"dnf update" considered harmful

Posted Oct 6, 2016 19:28 UTC (Thu) by fandingo (guest, #67019) [Link] (1 responses)

Live kernel patching is academic at best.

You can only do very simple patches, and at least Fedora, doesn't want to get into evaluating what updates would be compatible and having a separate update path that only applies to a fraction of kernel updates. All Fedora releases track the upstream kernel (with a few week delay), and they don't commit to an official or unoffical LTS version; Fedora 23 (released December 2015) has 4.7.5, Fedora 24 (released May 2016) has the exact same version. It's just not practical whatsoever for Fedora to do live kernel updating.

"dnf update" considered harmful

Posted Oct 7, 2016 7:05 UTC (Fri) by mjthayer (guest, #39183) [Link]

I have always been doubtful about it too (but then my approach would be to pull anything which can be out of the kernel into user-space, which is clearly not a majority idea), but clearly many people see enough value in it to invest a lot of money. Presumably then there are cases where it seems to be worth the effort needed to get it right.

"dnf update" considered harmful

Posted Oct 6, 2016 8:32 UTC (Thu) by rsidd (subscriber, #2582) [Link] (1 responses)

I believe Ubuntu's upgrade command (do-release-upgrade) runs within a screen session, so it doesn't crash and you can always reconnect to it if need be. But I have never had an issue with plain "apt-get dist-upgrade", not sure how it deals with crashing programs (or maybe it's just that I don't use gnome/kde).

"dnf update" considered harmful

Posted Oct 7, 2016 15:55 UTC (Fri) by AdamW (subscriber, #48457) [Link]

Someone told me in a Phoronix thread that apt-get is designed to survive the death of the controlling terminal, but I haven't checked that for myself.

"dnf update" considered harmful

Posted Oct 6, 2016 8:40 UTC (Thu) by sourcejedi (guest, #45153) [Link] (9 responses)

I know `pkcon upgrade` is technically both evil and incomplete. It used to break running Firefox (not sure if changed). Debian say exactly the same about feature upgrades too: don't run them inside a desktop session. (It doesn't help how tangled the Debian upgrade docs are though). I'm very glad to have an offline upgrade mechanism for release upgrades, so actual humans are able to follow the procedure as documented.

It'd just be nice not to pretend this

which may leave that process in an indeterminate state. Williamson reported that users were getting "duplicated packages" and other messages when trying to rerun the update.

is a good reason to scare users here. We use filesystems and apps that survive power failure. It is not too much to expect the package management database to survive an interruption. The clue being in the word database (and use of the dreaded fsync()).

I remember dpkg-based systems surviving interruptions of `apt-get`. Sure you get cryptic errors, but it also then tells you the exact repair command. Now I can't help wondering if this is related to the state of rpm development, as mentioned in previous LWN articles.

AIUI there's a separate issue about linked updates, e.g. between two packages which must upgrade in lockstep. If bootstrapping depends on them, you can't interrupt the upgrade, and doing so would require a rescue. Which is not necessarily supported/documented very much... anyway, that seems a valid reason to warn people about.

I think both `dnf`and `apt-get` should be able to resume the upgrade, without starting the network. (E.g. from `init=/bin/bash` + `mount / -oremount,rw`. `apt-get update` works from cache. `dnf` saves interrupted transactions such that you can resume them later. (And the dependencies of the package manager & shell would be architectured to avoid lockstep transitions. At least provided you don't try to skip releases (debian-style releases) or delay your update "too long" (rolling releases)). `pkcon update` I'm not sure about though, I don't know if it saves interrupted transactions.

"dnf update" considered harmful

Posted Oct 6, 2016 9:02 UTC (Thu) by sourcejedi (guest, #45153) [Link] (5 responses)

then you're left with a pile of broken bits that you get to spend all afternoon fixing... it's more or less impossible to figure out and execute precisely whatever scriptlet actions should have happened but did not

In principle, this should have been obvious. Debian policy:

6.2 Maintainer scripts idempotency
It is necessary for the error recovery procedures that the scripts be idempotent. This means that if it is run successfully, and then it is called again, it doesn't bomb out or cause any harm, but just ensures that everything is the way it ought to be. If the first call failed, or aborted half way through for some reason, the second call should merely do the things that were left undone the first time, if any, and exit with a success status if everything is OK

"dnf update" considered harmful

Posted Oct 6, 2016 9:08 UTC (Thu) by sourcejedi (guest, #45153) [Link] (4 responses)

FWIW, someone told me today (in rather colorful terms) that apt-get is designed exactly this way. It survives

"dnf update" considered harmful

Posted Oct 6, 2016 11:21 UTC (Thu) by guillemj (subscriber, #49706) [Link] (3 responses)

dpkg itself should support recovering from arbitrarily crashes. I'm actually surprised by Florian's claims (which I think I've seen now more than once?), and I'd like to see evidence to back them up, which I'd consider a bug that I'd happily fix. dpkg also supports being terminated with Ctrl-C at any point in time, and should be able to cope, in the same way as with power-outages and similar.

Of course individual packages might or might not recover correctly in their maintainer scripts, but that's something that needs fixing in those specific packages. And dpkg should always be able to recover from an inconsistent state (obviously barring filesystem catastrophes, or rm -rf / or similar) by upgrading a package to a fixed version.

"dnf update" considered harmful

Posted Oct 6, 2016 11:54 UTC (Thu) by sourcejedi (guest, #45153) [Link]

I have a neglected deb8 box. Testing `apt-get upgrade`, it did not seem very responsive to Ctrl+C.

Two occasions it didn't exit immediately, but exited later with "W: Operation was interrupted before it could finish".

Sometimes it seemed to terminate immediately with no message... as standard for a unix command. At least one of those required a subsequent `apt-get -f install`.

Suspending it with ctrl+Z suspended the foreground process and the console output. However `top` showed processes like `gzip` still running. `fg` gave me a slew of log output. (I think this was rebuilding the initramfs).

It was if `apt-get` was detaching child processes and buffering the output. Dunno if that's exactly what was going on, but if you suspending a normal multi-process job like "strace cat", it suspends both processes.
I wouldn't dare gainsay apt's behaviour myself, I was just curious about what happened :).

"dnf update" considered harmful

Posted Oct 8, 2016 18:10 UTC (Sat) by rvolgers (guest, #63218) [Link] (1 responses)

Interestingly, this seems like something that would be really easy to unit test. Install and reinstall the package a bunch of times with interruptions, or even just run maintainer scripts multiple times and verify that the installation does not error out at least. Is it just a matter of scale and resources that this isn't done?

"dnf update" considered harmful

Posted Oct 14, 2016 22:39 UTC (Fri) by guillemj (subscriber, #49706) [Link]

If those interruptions are just random then the functional tests might take a long time until they hit interesting cases. I think something smarter would be needed here. The same applies to out-of-memory/fd/etc conditions, for which dpkg is designed to be resistant against, and in practice it should currently be (barring bugs), in the same way it should be resistant against the above mentioned abrupt crashes.

I'd actually expect a stream of bug reports if any of the above would happen, as we've seen in the past, for both types of bugs (a recentish example of the latter could be the infamous ext4 zero-length problems).

In any case, those are indeed things that I'd like to (eventually) add as part of its test suite, but have not found the time yet. Take into account that when I joined the maintainer team at the time, around 10 years ago, dpkg didn't even have any kind of unit nor functional tests! Now it's a bit better, but it could certainly be improved:

In-project test suite: http://dpkg.alioth.debian.org/coverage/
External functional test: https://anonscm.debian.org/cgit/dpkg/dpkg-tests.git/ (need coverage reports for this one)

And if you are interested, of course patches more than welcome!

"dnf update" considered harmful

Posted Oct 7, 2016 16:02 UTC (Fri) by AdamW (subscriber, #48457) [Link] (2 responses)

Well, I object to the use of the word 'pretend' there. Given that it's really what *happens* right now, I'd say it's an *excellent* reason to scare users, because it's a rather inconvenient situation to clean up if it happens.

As a QA person I'm generally in the business of dealing with the software as it exists, not as it *should be* in an ideal world ;) I don't disagree that dnf could do some things better in this area, and there's been a resurgence in activity on three bugs/feature requests for dnf in response to this bug (allowing access to the offline update workflow via dnf, designing dnf to survive the death of the controlling terminal, and a 'complete-transaction' feature for recovering from crashes). But the real DNF we have in the real world right now doesn't have those features, and I was simply trying to alert real-world users to a known case where things could go wrong right now.

For the record, when I sent the initial mail in the thread, all the information we had was that ~7 or 8 users had showed up in #fedora on Freenode (the main user support channel) with the symptoms of a crashed update process, and I was talking to one of them live at the time and was able to confirm that they definitely had had a crash during an update, and that it was X that had crashed.

That was the point where I thought, OK, we don't know everything yet, but that seems like enough information that it would be responsible to send out a mail at least advising people not to run updates live from within X for now. In retrospect it might have been better to skip the part about how that's *generally* a bad idea, because that kinda muddied the waters a bit, but I can't re-do it now :)

"dnf update" considered harmful

Posted Oct 7, 2016 19:42 UTC (Fri) by sourcejedi (guest, #45153) [Link] (1 responses)

Fair objection. Sorry for raising the tone, I enjoy the usually civil comments here.

As said elsewhere, I'd be liable to fall foul of this issue. And in reality I'd be just as annoyed if I hadn't been warned, as I would be annoyed at rpm's poor error recovery.

I shouldn't begrudge constructive discussions (e.g. whether or not dnf should detach). Nor do I begrudge discussion of tmux, and why it's not completely n00b-friendly.

Thanks for outlining how to recover downthread; understanding that makes me personally much more comfortable. Perhaps what I most objected to was being told rpmdb would screw up, without my knowing exactly what degree of screwed up. (I hope the bug reports you mentioned... achieve something, at least to provide similar reference material).

For... reasons, I currently prefer `pkcon update`. I suspect PackageKit (being a daemon) would fare rather better in this scenario. Very interesting if anyone can comment on that.

"dnf update" considered harmful

Posted Oct 7, 2016 23:31 UTC (Fri) by AdamW (subscriber, #48457) [Link]

I'm not actually sure exactly how the process works with pkcon - whether the actual transaction runs under the packagekitd process, or what. It'd be interesting to know.

"dnf update" considered harmful

Posted Oct 6, 2016 9:56 UTC (Thu) by tdz (subscriber, #58733) [Link] (8 responses)

All packages have their dependencies listed, and systemd is tracking most of the services on the computer. With these information shouldn't it be possible to automatically solve the problem of live updates, at least partially?

For example, if you're updating the web server, let dnf shut down the web server's service (via systemd), update the package, and restart the service. If you're updating a desktop library, restarting the desktop during update might be required. If you're updating glibc, restarting the system, or at least the complete user space, would be required.

Figuring out correctly what to restart is certainly not easy, but it doesn't sound impossible either.

"dnf update" considered harmful

Posted Oct 6, 2016 10:37 UTC (Thu) by matthias (subscriber, #94967) [Link] (7 responses)

It is not that easy.

1. I doubt that restarting every possible affected service is the best solution. A service like gdm might be considered affected because some dependency was updated, but restarting gdm will kill the active session. From a users perspective restarting the session is not much better than restarting the entire system. It will be a few seconds faster, but all applications will be closed. I noticed that in most cases a sevice like gdm will do quite fine without a restart, even if some libraries where updated. So, I prefer to delay the restart until it is more convenient.

2. Most applications are not system services. Live updating should be possible for user level applications. But those applications are not (yet?) tracked by systemd. E.g., systemd does not know if firefox is running or not. I noticed (on Gentoo) that firefox is quite funny between update and restart (of firefox). It should definitely be restarted after update.

"dnf update" considered harmful

Posted Oct 6, 2016 11:00 UTC (Thu) by pabs (subscriber, #43278) [Link]

1. is a long-standing bug in gdm that really needs fixing. Perhaps now that Xorg runs as the logged-in user that would be possible.

"dnf update" considered harmful

Posted Oct 6, 2016 12:08 UTC (Thu) by tdz (subscriber, #58733) [Link] (5 responses)

Sure, I agree that it's not *that* easy. I was rather asking a question than making a statement.

When uninstalling a package during the update, an 'lsof' for all the contained files should tell which programs are affected. From there, dnf could figure out which behavior is required. I would not want dnf to kill my browser's process, but it would be OK if it applies updates the next time I start the browser. I guess that would require further integration of dnf and desktop toolkits with systemd. *duck*

Right now, Fedora reboots twice for an update. It would already be nice if the update process would only restart once; especially on machines with longer boot times.

"dnf update" considered harmful

Posted Oct 6, 2016 12:15 UTC (Thu) by mchapman (subscriber, #66589) [Link] (4 responses)

> When uninstalling a package during the update, an 'lsof' for all the contained files should tell which programs are affected. From there, dnf could figure out which behavior is required.

The "tracer" dnf plugin already does all that. It recommends actions to the sysadmin to restart affected services.

> Right now, Fedora reboots twice for an update. It would already be nice if the update process would only restart once; especially on machines with longer boot times.

I would like to see it make use of systemd's ability to isolate to various targets. Instead of the current approach:

1. Reboot.
2. Install updates.
3. Reboot.

it should be possible to do something like:

1. Isolate to a special "system-update" target.
2. Install updates.
3a. If the kernel was updated, reboot; otherwise
3b. Isolate back to the default boot target.

"dnf update" considered harmful

Posted Oct 12, 2016 3:16 UTC (Wed) by draco (subscriber, #1792) [Link] (3 responses)

If isolating to the system-update target kills off everything running, how is that meaningfully different than a reboot into a special update session?

[Most] people are concerned about being able to update without interrupting their work session -- not maintaining kernel uptime.

"dnf update" considered harmful

Posted Oct 12, 2016 7:04 UTC (Wed) by mchapman (subscriber, #66589) [Link] (2 responses)

> If isolating to the system-update target kills off everything running, how is that meaningfully different than a reboot into a special update session?

It isn't any different, but many systems take a long time to get through their firmware initialization.

Also, it opens the possibility of having *some* services remaining running during a package update -- you could have them WantedBy= the system-update target, and systemd won't touch them when isolating to this target or the original boot target.

"dnf update" considered harmful

Posted Oct 12, 2016 11:38 UTC (Wed) by rahulsundaram (subscriber, #21946) [Link] (1 responses)

This is how Fedora offline updates are handled

https://www.freedesktop.org/software/systemd/man/systemd....

I would posting any posting any ideas to improve it to systemd list.

"dnf update" considered harmful

Posted Oct 12, 2016 13:33 UTC (Wed) by mchapman (subscriber, #66589) [Link]

> I would posting any posting any ideas to improve it to systemd list.

It's more of a PackageKit RFE than a systemd one. Except for providing system-update-generator, systemd has no part in offline upgrades.

But you're right. I should actually bring this up somewhere more useful.

"dnf update" considered harmful

Posted Oct 6, 2016 17:22 UTC (Thu) by juhah (subscriber, #32930) [Link] (2 responses)

Proper way to update from command line[1]:

sudo pkcon refresh force && \
sudo pkcon update --only-download && \
sudo pkcon offline-trigger && \
sudo systemctl reboot

[1] https://www.happyassassin.net/2016/10/04/x-crash-during-f...

"dnf update" considered harmful

Posted Oct 6, 2016 17:50 UTC (Thu) by atai (subscriber, #10977) [Link] (1 responses)

For Fedora?

"dnf update" considered harmful

Posted Oct 6, 2016 19:34 UTC (Thu) by pbonzini (subscriber, #60935) [Link]

Yes, pkcon is from PackageKit.

Running outside the desktop session is not enough

Posted Oct 6, 2016 21:51 UTC (Thu) by cesarb (subscriber, #6266) [Link] (3 responses)

> If you're using Workstation, the offline update system is expressly designed to minimize the likelihood of this kind of problem, so please do consider using it. Otherwise, at least run 'dnf update' in a VT - hit ctrl-alt-f3 to get a VT console login prompt, log in, and do it there. Don't do it inside your desktop.

> How hard would it be to make dnf do the rpm transaction inside a proper system-level service (transient or otherwise)? This would greatly increase robustness against desktop crashes, ssh connection loss, KillUserProcs, and other damaging goofs.

I'm sorry, but this is not enough. The update system *must* be able to recover from being interrupted, unless by bad luck the interruption left the system unable to boot (and even then, it should be possible to recover by booting from a CD and running its copy of the update system with a parameter to use the installed system as the filesystem root). Either it should be able to roll back the update, or it should be able to roll forward and finish the partial update.

Last month, my laptop crashed during yum update (actually dnf update, through the dnf-yum package). It wasn't because I was running yum/dnf within a terminal window (I was). It was a hardware crash, and it wouldn't even turn on; I had to replace the motherboard.

Luckily, yum/dnf didn't break, probably because it had actually finished before the crash (I wasn't looking). But this shows that depending on the updater not being interrupted is not the way to go. And if you're using a desktop, a power outage can happen at any time.

Running outside the desktop session is not enough

Posted Oct 6, 2016 22:14 UTC (Thu) by davecb (subscriber, #1574) [Link]

Once done the kernel bits is done replacing running process in a controlled manner is easy.

Indeed, it's has been a "feature" since v6, where I first encountered it, but most people don't know about mv and atomic file(name) replacement: Google for "atomic replacement mv file". Windows reputedly has a similar implementation.

Running outside the desktop session is not enough

Posted Oct 6, 2016 23:14 UTC (Thu) by gerdesj (subscriber, #5446) [Link] (1 responses)

"The update system *must* be able to recover from being interrupted"

In my experience they generally do a pretty good job, depending on the job. My personal baseline is Gentoo and Arch and I use Debian and Ubuntu for servers (apart from the handcrafted Gentoo ones). I've got a couple of CentOS boxes as well.

With portage (~amd64 FTW) you live and laugh within a private hell that simply does not exist anywhere else. It takes a lot of time to fix up and yet rarely knocks the system off line for long. Linux is really, really, really resilient to thing going awry for a while and the commonly used daemons are as well. I've had Exim + Spam Assassin + ClamAV and Squid with lots of other stuff and GlibC in a weird state and the init system halfway between OpenRC and SystemD and yet the service carries on working. I've turned MySQL into MariaDB on the fly and Apache into nginx as well.

I'll grant you that binary distros are quicker for updates but you can do some weird shit with source based rolling distros. Love 'em all to be honest.

Windows does not cut it in the updates stakes. A recent Exchange 2016 update (CU2) took bloody ages and the system was offline for at least an hour. That's a single box system for a few 10s of users. Rubbish.

After updates, try this recipe (can't remember where I found it) to look for systems needing a restart:
# lsof | grep 'DEL.*lib' | cut -f 1 -d ' ' | sort -u

Running outside the desktop session is not enough

Posted Oct 7, 2016 15:50 UTC (Fri) by AdamW (subscriber, #48457) [Link]

The main problem you get with dnf/rpm if a transaction dies halfway through is that the RPM database usually gets into a messy state where it thinks it has two versions of a lot of packages installed; this is because the way rpm actually handles updates is that it installs the new package first, then removes the old package, so these 'duplicates' will exist pretty much from the time of the first update to the time of the last removal, and if the process dies anywhere in that period, you get duplicates to clean up.

The system itself *usually* continues to work pretty much fine (there can be exceptions to this if the update process died at a particularly unfortunate point), but the RPM database mess is a problem because it can result in spurious errors and stuff when you subsequently try to install updates or new packages or anything.

There's a tool called `package-cleanup` in the yum-utils package which has a `--dupes` and a `--cleandupes` option that can help with the cleanup; some people say it's great, I personally didn't find it worked that well the first time I ran into this case, but YMMV. If that tool doesn't work, you wind up basically twiddling with the rpm query format to get a nice list of duplicated packages, then run 'rpm -e --justdb --noscripts' on the old versions, and 'dnf reinstall' on the new ones.

dnf could definitely do some things to help out with this whole area, and this bug has revitalized a few bug reports and things about that.

"dnf update" considered harmful

Posted Oct 7, 2016 2:55 UTC (Fri) by flussence (guest, #85566) [Link] (2 responses)

> As far as rebooting after every update? Huh? Who does that? Are we Windows?
The desktop crashing mid-update is a very Windows-esque thing too. Maybe Fedora and Debian should strive to be more like Gentoo instead... :-)

"dnf update" considered harmful

Posted Oct 9, 2016 21:19 UTC (Sun) by dirtyepic (guest, #30178) [Link] (1 responses)

Yeah, I don't get it. Over the last decade I've done major system rebuilds, upgraded glibc in place a couple dozen times, switched desktop environments, upgraded kde from 3 to 4 and now 5, and done everything else possible with portage all running on a live system without ever having to do anything other than restart the app or service if I was already using it. The only time I've ever needed to reboot is when updating the kernel. I'm not sure what the deal is but if you need to drop out to a minimal console just to update your system you're doing it wrong.

"dnf update" considered harmful

Posted Oct 10, 2016 0:48 UTC (Mon) by raven667 (subscriber, #5198) [Link]

> Over the last decade I've done major system rebuilds

I'm sure you have, but this isn't about what a skilled administrator can do with tools and planning, this is about what completely dumb automation can do reliably and at great scale when there is no one who can fix things if they go wrong, so that even a minor error will effectively brick machines for some number of people. There is a continuum of more reliable to less, with active/standby partitions to upgrades from a special single-user mode to live upgrades with testing and tool support or just copying new files over old, some have more possibilities for failure and require more admin intervention than others.

"dnf update" considered harmful

Posted Oct 7, 2016 15:37 UTC (Fri) by AdamW (subscriber, #48457) [Link] (1 responses)

Thanks for the thoughtful article, and for doing a lot better than most news sites who just re-sent my original email, before we knew any of the bug details :)

Having said that, though, let me offer one muddification (it's the opposite of a clarification):

I probably went a bit far in suggesting that using 'dnf update' from a desktop terminal is 'not supported'. I don't know if I used those words exactly - I usually avoid the word 'supported' like the plague, as it's very difficult to interpret when it comes to a non-paid, no-formal-support-offered distribution - but I won't object to the inference.

I don't think we (Fedora as a project) really offer sufficient clarity on this to claim that any method is 'supported' or 'not supported', to be honest; after some replies in the thread I went looking and we really just don't offer much formal advice on this. So I'm gonna walk that back slightly and say that *I personally* strongly recommend against doing it, and we can say categorically as a project that doing it is riskier than offline updates, updating from a VT, or updating in a screen/tmux session. But I don't think we as a project offer sufficiently good guidance on this and we should probably fix that somehow.

Let me also offer a correction: I don't believe I ever stated or implied that running 'dnf update' from a VT is not 'supported' or 'recommended'. In fact that was one of my two recommendations for avoiding the bug. I'd say that theoretically speaking offline updates are slightly safer, but doing updates 'live' but not from within a desktop environment is very common and it's what *I* do on most of my systems, so I'm not going to tell people it's not recommended. We definitely don't as a project say Thou Must Do Offline Updates; it is the default workflow for Fedora Workstation, but not really for Server or Cloud (or KDE).

"dnf update" considered harmful

Posted Oct 12, 2016 21:14 UTC (Wed) by nix (subscriber, #2304) [Link]

Perhaps the right thing to say is that updates can cause a lot of churn and so it's better to have as little as possible that can potentially fail (for whatever reason) and bring down the updater: the updater should sit on as as short a stack of stuff as possible. sshd is specially handled these days, but in days of yore a few distros tried to kill all running sshds when ssh was updated (a bad idea right there), and killed the update if it was running over ssh. Doing updates in a terminator terminal must be particularly fun, since it depends on Python, vte, gobject introspection, etc etc etc... most of this is read in only at startup, but are you willing to bet it all is?

I had a similar case a few days ago where I had a hang when running a systemwide debugger, but only when running with lots of internal debug output going to stderr and only when not on the console... because it had ptraced sshd(1) briefly and stopped it to stick breakpoints in, and printed a bunch of debugging output about that, which had gone to the sshd which was stopped and blocked until it was resumed, oops, deadlock!

Offline update

Posted Oct 16, 2016 11:58 UTC (Sun) by swilmet (subscriber, #98424) [Link] (2 responses)

In theory I prefer offline updates, I understand why it should be more robust.

In practice, on my computer, I get a minimal grub if the kernel was updated during an offline update. With Fedora 22, 23 and 24 (not yet tried with F25). With dnf update, there is no problem. The bug is reported since a long time, but it doesn't seem to affect a lot of people (or when it happens to other people, they go install another distro, I don't know).

Offline update

Posted Oct 16, 2016 12:42 UTC (Sun) by rahulsundaram (subscriber, #21946) [Link]

That certainly doesn't happen for me. Wonder what triggers it though.

Offline update

Posted Oct 17, 2016 20:04 UTC (Mon) by bfields (subscriber, #19510) [Link]

Huh. I also got into the habit of manual dnf updates after finding gnome updates always dumped me at a grub prompt. I guess I should have complained. It was too easy to just upgrade the way I always had....