Growing pains for Fedora CoreOS

Posted Jun 3, 2021 5:42 UTC (Thu) by geuder (subscriber, #62854)
Parent article: Growing pains for Fedora CoreOS

Automatic updates, new features and full backwards compatibility are an unsolvable equation.

We only recently had a small breakage in our FCOS-based system: Podman used to have some defaults where it pulled images from. Our code worked. The defaults changed (or were removed, don't remember from the top of my head) in some automatic update and our code stopped working. Just a little detail, but it demonstrated that basically after every automatic upgrade you need to test your system and be prepared to fix something.

Of course that's not completely different from manually upgraded systems, especially if you run something that others might consider fragile code or just not consider at all.

For a rolling distro additional difficulty are how/when to do these bigger changes which are more likely to break something.

Maybe some selected automatic updates should bundle bigger changes and be announced as higher risk in advance???

Growing pains for Fedora CoreOS

Posted Jun 3, 2021 15:38 UTC (Thu) by mattdm (subscriber, #18) [Link]

> Maybe some selected automatic updates should bundle bigger changes and be announced as higher risk in advance???

That's basically what a Fedora Linux release is.

Growing pains for Fedora CoreOS

Posted Jun 3, 2021 17:55 UTC (Thu) by dbnichol (subscriber, #39622) [Link] (3 responses)

For Endless we use something referred to as a checkpoint release to help handle some of these upgrade issues when you have a rolling automatic ostree process. Normally, the updater pulls the tip of the ostree ref and deploys that. However, if the commit has some additional metadata, it will see that there's a new ref it should follow, but only after deploying and booting into the tip of the current ref. This allows us to stuff some migration code into the commits on the old ref and ensure it'll run before something tries to upgrade to the current ref. This is the only way we can truly remove old features or ensure systems are prepared for a major change. In a way it acts like a traditional upgrade tool.

Growing pains for Fedora CoreOS

Posted Jun 4, 2021 3:12 UTC (Fri) by bgilbert (subscriber, #4738) [Link] (2 responses)

Fedora CoreOS has a barrier release mechanism that does something similar: all updates that traverse the barrier release must update to exactly that release before updating any further. The Fedora CoreOS update client selects the target OS release from a graph of permissible updates maintained outside of the ostree, so barrier releases can be accomplished without an ostree ref switch.

Growing pains for Fedora CoreOS

Posted Jun 4, 2021 5:32 UTC (Fri) by dbnichol (subscriber, #39622) [Link] (1 responses)

Oh, that's neat. The ref switch is simple but pretty ugly. I hadn't considered anything besides "I want to be at the head of the ref", but having a client that negotiates specific commits is nice.

How often do you actually use a barrier release?

Growing pains for Fedora CoreOS

Posted Jun 4, 2021 14:47 UTC (Fri) by dustymabe (guest, #107864) [Link]

> How often do you actually use a barrier release?

The barrier releases and a link to the reason behind it are kept in https://github.com/coreos/fedora-coreos-streams/blob/main... Usually about once every 6 months or so.

Growing pains for Fedora CoreOS

Posted Jun 3, 2021 19:29 UTC (Thu) by walters (subscriber, #7396) [Link] (3 responses)

> We only recently had a small breakage in our FCOS-based system: Podman used to have some defaults where it pulled images from. Our code worked. The defaults changed (or were removed, don't remember from the top of my head) in some automatic update and our code stopped working.

That's a really great example of a bug on the risk/reward spectrum around automatic updates and a relatively "fresh" Linux userspace.

Do you have a bit more detail on this? I'm guessing it was something around short names i.e. just `busybox` and not `docker.io/busybox` or so? Has it been fixed since? Did you engage with an upstream issue? How hard was the workaround?

Personally I think it's all around worse for everyone if admins stay on relatively frozen userspace or we try to lump things like this even around e.g. 6 month windows because I think in practice if it's just every 6 months, a good number of people fall out of habit of upgrading at all (when it requires manual intervention) and drop off the train entirely. And that's bad because you're not applying critical kernel security updates etc. that are particularly relevant with containers.

Growing pains for Fedora CoreOS

Posted Jun 4, 2021 9:37 UTC (Fri) by geuder (subscriber, #62854) [Link] (2 responses)

> Do you have a bit more detail on this? I'm guessing it was something around short names i.e. just `busybox` and not `docker.io/busybox` or so?

That was also my understanding after seeing the original error because I have noticed the need to change that in my (very rare) manual use of podman. I did neither debug nor fix the problem myself and our git log tells

 source /etc/os-release
 cat <<EOF >/usr/local/foo/Dockerfile
-FROM f${VERSION_ID}/fedora-toolbox:latest
+FROM registry.fedoraproject.org/fedora-toolbox:latest
 
 RUN dnf install foo
 EOF

(This code is being run on CoreOS)

So I wonder what they did there. Before before the code fetched f34/fedora-toolbox:latest, I believe from docker.io. Now they fetch fedora-toolbox:latest from registry.fedoraproject.org. Where did the version number go??? Of course lwn is not a code review site for the code of our company, but interesting in the context of this article is

$ grep VERSION_\\\|VARIANT /etc/os-release 
VERSION_ID=34
VERSION_CODENAME=""
VARIANT="CoreOS"
VARIANT_ID=coreos

The article quoted without correction

> I think this is the fundamental difference here, Fedora CoreOS does not have a version number. It has 3 streams, stable, testing and next,

So is that really true??? No such number is being advertised AFAIK, but internally it is there and I guess at some point in the future it will change. With potentially surprising effects to those who have used it.

Growing pains for Fedora CoreOS

Posted Jun 4, 2021 13:27 UTC (Fri) by zdzichu (guest, #17118) [Link] (1 responses)

So is that really true??? No such number is being advertised AFAIK, but internally it is there and I guess at some point in the future it will change. With potentially surprising effects to those who have used it.

Yes, that's true. I would expect “stable” branch is equivalent of current stable Fedora release (which today is 34), but there are Fedora features missing in FCOS.

Growing pains for Fedora CoreOS

Posted Jun 4, 2021 15:46 UTC (Fri) by geuder (subscriber, #62854) [Link]

But some day 34 will no longer be current and that day FCOS stable needs to make a bigger jump with your logic of being equivalent.

Growing pains for Fedora CoreOS

Posted Jun 4, 2021 14:44 UTC (Fri) by dustymabe (guest, #107864) [Link] (2 responses)

> Automatic updates, new features and full backwards compatibility are an unsolvable equation.

Part of the way we try to make automatic updates more reliable is by offering 3 different update streams (`next`, `testing`, and `stable`) to our users and encouraging everyone to run `next` and `testing` on a percentage of their systems. If you're "testing" nodes encounter a problem you can report it and we can hopefully get it fixed before the much larger pool of "stable" nodes are affected.

More info at https://docs.fedoraproject.org/en-US/fedora-coreos/update...

Growing pains for Fedora CoreOS

Posted Jun 8, 2021 17:00 UTC (Tue) by geuder (subscriber, #62854) [Link] (1 responses)

> encouraging everyone to run `next` and `testing` on a percentage of their systems.

Good point.

However, we have (only) 2 instances, not 200. One is for production and one for testing (of our systems, not of FCOS). Running our testing system with a different version than the production system does not sound like a great idea. All test results would basically be possibly non-reproducable.

So we would need to run a 3rd one just for FCOS testing, a 50% overhead. And of course someone would need to check the instance at every update and run some test set. Which is a bit a against the idea of having automatic updates.

Well, no free lunch, I know...

Growing pains for Fedora CoreOS

Posted Jun 9, 2021 5:02 UTC (Wed) by raven667 (subscriber, #5198) [Link]

In this situation you'd be doing all your changes on the test system first, right? It's not that much of a departure as there would often be a difference between what is running in test and what is in prod, test can't guarantee repro of problems found in prod unless you reset it back to the versions used in prod. There is value in finding upgrade related problems in test first, but as you note the initial advice is probably targeted more toward admins with dozens or hundreds of systems where having a small cadre running bleeding edge code is relatively low risk to the overall system health. The quality benefits of going from prod-only to prod & qa to prod, qa & test to prod, qa, test & dev environments are diminishing while the cost increases but its the scalability of work that goes up the most which is mainly of benefit to larger organizations.