FreeBSD 13.0 released

Posted Apr 14, 2021 12:33 UTC (Wed) by nix (subscriber, #2304)
In reply to: FreeBSD 13.0 released by rsidd
Parent article: FreeBSD 13.0 released

Well, I updated my FreeBSD install (largely used for testing that GNU toolchain software still works on FreeBSD!) from 12 to 13 and... it stopped booting. It turns out that when ZFS decides to move things around, the bootloader does not (or, possibly, "sometime in the past did not") get appropriately updated. There is no message about this at all and the system doesn't try to fix it: you just get a stage 2 bootloader that is referring (possibly by raw sector number, though I thought the FreeBSD bootloader was smarter than that) to stuff that will eventually get overwritten. For me this showed itself as a system that worked fine until I rebuilt, not the core system, but the *ports*. This obviously involves a lot more writes to the filesystem, and after that the system failed to boot with the incredibly obvious error message

ZFS: out of temporary buffer space.

which nobody has ever reported happening before after an update that I can find (so presumably I am somehow triggering a bug nobody else has triggered, on several installations, despite updating from one stable release to the next on a ZFS-using installation hardly being a rare operation). The fix was, obviously, as documented nowhere at all that I could tell, to do this before rebooting:

gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i1 vtbd0

or at least I hope that was the fix: it hasn't crashed afterwards. I wouldn't expect this with any bootloader more recent than LILO (and the FreeBSD bootloader is far more recent and much nicer than any Linux bootloader including GRUB), and at least with LILO the distro does the necessary /sbin/lilo'ing for you rather than leaving you to fall in a pit trap.

Presumably this is a ZFS-specific peccadillo. The FreeBSD documentation is excellent regarding things that predate some unknown number of years ago (10? longer?) but its docs around more recent things like the nest of lethal trapdoors which seems to be doing anything at all with ZFS is ropey to nonexistent. (Though this is the first time ZFS has broken badly enough that the system wouldn't boot despite my doing nothing unusual with it at all other than just using it as an ordinary filesystem.)

Hence this rant!

While the FreeBSD kernel is perfectly functional with some lovely features, and the system works fine, administering FreeBSD even enough to merely keep it updated is, in my experience, about as easy and trouble-free as doing the same on a linux-from-scratch system, even when using binary packages. Significant expertise appears to be required. You can expect routine operations to fail catastrophically a couple of times a year in ways that will require source-level debugging to fix: usually this is due to inadequate error handling. (Note that I am not a heavy user: I do little with this system other than keep it updated and do a lot of compiles, and it *still* fails horribly, routinely: both binary packages and source ports are troublesome in different ways.) FreeBSD handles failure in the same way as traditional Unix: i.e., it doesn't really try, and the system is full of arbitrary limits and terrible error checking. Apparent untrodden snow is everywhere.

For me, the last random failure I ran into on an update was an exemplar of this: in the middle of a portupgrade, pkg suddenly started hitting a segfault in one of the children it invoked, due to the usual crash-inducing failure in an updated bdb that hadn't had its databases db_upgraded (the sooner everyone migrates off bdb the happier I'll be). bdb-upgrade-related problems are not at all rare, but rather than report that something db-related had gone wrong or mention that the ports databases needed a db_upgrade run on them or mention that the databases even existed or god forbid db_upgrade them itself, pkg just exited with exitcode 72 (or something like that, a high nonzero exitcode <128 anyway) and then left the update as a whole to silently die in the same unhelpful fashion, with no indication of what had actually gone wrong. I suppose I should be glad that it noticed the crash at all, but if it did why didn't it at least fprintf a single line in even one of the crashing processes to indicate even vaguely what the problem was? Why's it this unhelpful? Is it because most of the code dates from so long ago that reporting errors properly was an intolerable waste of space, or is it just shoddy coding? I don't see a third option.

(This is just an example, but almost every failure I've encountered, and I've run into a lot by now, has similar "oops we could have told you what was wrong but we printed not a line or didn't even check for failure" at its root.)

This particular bug had already been reported by the time I ran into it. This is usually my fate, because I only start the thing a few times a month to update it and do test runs.

Linux is definitely usually better at reporting problems than this, and with an order-of-magnitude larger user base there are fewer serious problems during routine operations. In my experience as a *very* light FreeBSD user, on FreeBSD, they're incessant.

Software testing is hard and most error paths will not be tested. That makes it crucial that error paths report what is going wrong when they are hit. FreeBSD, like traditional Unix, by and large doesn't bother.

FreeBSD 13.0 released

Posted Apr 14, 2021 12:49 UTC (Wed) by nix (subscriber, #2304) [Link]

A lot of this is just the inherent fragility of systems too complex and with far too many code paths to test, of course. Heck I just tried to build a kernel on a (Linux) system I thought I hadn't changed at all and, oops! cordumps all over the place, because of a change I was sure couldn't affect anything and thus had forgotten about. This has been routine for *decades*, so I suppose I shouldn't be surprised that systems that date from long ago, with the cavalier error handling typical of back then (and here I include C itself), are a source of considerable trouble at this date.

(I'd say "Lisp Machines were never this painful", but oh yes they were whenever anything went wrong enough. Just in different ways.)

FreeBSD 13.0 released

Posted Apr 14, 2021 15:14 UTC (Wed) by rsidd (subscriber, #2582) [Link] (1 responses)

Heh, I think you reminded me why I stopped using FreeBSD about 15 years ago!

At that time a guaranteed way to panic the system was to remove a USB drive while it was mounted. Just accidentally jiggling it was enough. The bug had been there for years, and their excuse was that the system was fundamentally designed with the assumption that drives won't disappear while in use, so just don't remove a mounted drive. It verged on a WONTFIX response. Ironic since FreeBSD boasted of getting USB support before Linux did.

Then Matt Dillon fixed it in Dragonfly BSD (which I used for a while).

FreeBSD 13.0 released

Posted Apr 14, 2021 22:36 UTC (Wed) by flussence (guest, #85566) [Link]

Ha, and now we're all going through this rigmarole again with Thunderbolt/USB-C...

FreeBSD 13.0 released

Posted Apr 15, 2021 2:49 UTC (Thu) by Comet (subscriber, #11646) [Link] (2 responses)

The gpart commands sounded familiar so I checked my logbooks and that's pretty much the command you're told to run by zpool upgrade, which I notice I need to do when zpool tells me there are upgrades available.

So this sounds like a missing step in the automated upgrade flow. Normally, using new features in zpool is deferred until you choose to upgrade the pool after the reboot, so you get to see the warnings. At a guess (because I'm on FreeBSD 11.4 still), the OpenZFS migration forces the issue to do the zpool upgrade early and they missed the gpart requirement.

If you have a mirrored disk setup, make sure to run that on both disks in the mirror!

I'm surprised to hear about bdb related to pkg, I thought that was sqlite all the way. You have my deepest sympathy here, I've fought and lost too many battles trying to run software which used bdb back when it was the sanest of the available choices and has never moved on while bdb's development processes have ... changed.

FreeBSD 13.0 released

Posted Apr 15, 2021 12:28 UTC (Thu) by nix (subscriber, #2304) [Link] (1 responses)

> that's pretty much the command you're told to run by zpool upgrade

FYI, it went wrong even without my running zpool upgrade. I just ran zpool upgrade because it felt like it was probably time, and it never suggested I run anything else. It completed silently. (I reinstalled the bootloader anyway, just in case.)

This was less painful than it sounds because it's all in a backed-up QEMU VM and after the first failure I restored it from backup and started experimenting using poor-man's-snapshots (i.e. cp --reflink of the image): the only painful part was that testing it involved a complete portupgrade -afR, which takes *forever*.

(I find using cp --reflink often preferable to real QEMU snapshots if I might want to decide to preserve a snapshot at any time without a shutdown *and* I don't know if I'll want to preserve the image I'm changing later *and* I don't want my experiments to bloat the image -- usually, QEMU in-image snapshots give the first of these properties and disk images with backing stores give the other two, but cp --reflink gives all at once :) ).

FreeBSD 13.0 released

Posted Jun 14, 2021 15:47 UTC (Mon) by nix (subscriber, #2304) [Link]

FYI: this is https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=256024, which originated in OpenZFS (and the switch to that is why the message went away). It's now been fixed in OpenZFS: https://github.com/openzfs/zfs/commit/65d9212aeeb531e9f98...