Growing pains for Fedora CoreOS
When last we looked in on Fedora CoreOS back in December, it was under consideration to become an official Fedora edition. That has not happened, yet at least, but it would seem that the CoreOS "emerging edition" is still undergoing some difficulties trying to fit in with the rest of Fedora. There are differences between the needs of a container operating system and those of more general-purpose distributions, which still need to be worked out if Fedora CoreOS is going to "graduate".
Catching up
In mid-May, Dusty Mabe posted
an announcement that the stable stream of Fedora CoreOS was being
updated to Fedora 34. In it, he noted a few caveats
(e.g. "systemd-resolved is still enabled but not used yet [1]
"),
some recently added features, and some new features that are coming
soon. All pretty normal stuff except that Fedora 34 was released at
the end of April and Mabe's post showed that Fedora CoreOS has not really kept up.
In fact, as Tomasz Torcz pointed
out, the systemd-resolved change was made for Fedora 33, while
an upcoming feature ("Move to cgroup v2 by default [5]
")
was originally made for Fedora 31, which was released in October 2019. That seems to indicate that Fedora
CoreOS is lagging the main distribution, which may cause confusion for
users, he said. "Should Fedora CoreOS use the same
version number while not containing all the changes from main Fedora Linux?
"
But Fedora CoreOS does not have version numbers like those of the editions, Clément Verna said:
I think this is the fundamental difference here, Fedora CoreOS does not have a version number. It has 3 streams, stable, testing and next, these streams are based on a version of Fedora Linux but that should just be a detail that most end users should not have to care about.
In addition, Fedora CoreOS has automatic updates, which need to be
"rock solid
" so that users will trust (and enable) them. But,
up until recently,
Docker has not had support for version 2 of control groups (cgroups),
so a container distribution, which has many users dependent
on Docker, could not roll out that change without major disruption. Verna
suggested that user confusion might actually be "a good thing
" if it
leads them to investigate Fedora CoreOS and to learn more about how it works.
Neal Gompa said
that Verna's response was "a cop-out and a bad answer
". The
problem, he said, is that the Fedora CoreOS (or FCOS as he and others
abbreviate it) working group has historically not participated in the
development of Fedora, and the Changes
process in particular. Instead of adapting to the feature changes made for
Fedora, FCOS generally just rolls them back, "which has frustrated pretty much
everyone
". Beyond that, it is not
just FCOS that needs to have solid upgrades; breaking upgrades for Fedora
are not acceptable either, Gompa said.
But Verna believes that the working group is actually participating in the process. He pointed to four GitHub issues tracking changes for Fedora 32-35 (e.g. for Fedora 32 and for Fedora 35) that were (or need to be) incorporated into FCOS. Vít Ondruch replied that most or all of that work is not visible within the rest of Fedora, though. Verna agreed and suggested that the working group should be more vocal on mailing lists and the like.
Verna was also concerned about changes that are not backward-compatible. Regular Fedora can make those kinds of changes when the major version of the distribution changes, but there is no such opportunity for FCOS:
Breaking or non backward compatible changes are acceptable in Fedora Linux tho between major version bump. Again here the cgroups v2 is a good example, folks using Docker had to perform some manual steps to switch back to cgroups v1 to keep using their workflow working. This is fine when you have a major version bump but this does not happen in FCOS.
One of Verna's questions remained unanswered, though: what should happen if a new Fedora feature conflicts with the needs of another edition (or emerging edition for that matter)?. How are those differing needs going to be resolved?
[...] what happens when a Change proposals breaks FCOS (like cgroups v2 for example) ? Should that just be rejected ? AFAIK not all changes are adopted by every Editions or Spins.
As Fedora evolves and adds more official editions, those kinds of situations are likely to become more frequent. It may be difficult to be on the forefront of new features—part of Fedora's mission is to be "First" with Linux innovations, after all—if some environments and communities are unable to move as quickly. It is something that the Fedora project will need to resolve moving forward.
What's in a name?
Joe Doss disagreed
with Verna's initial reply as well. Since FCOS has the Fedora name in it, "it should have
the same fundamental features and changes that ship with each Fedora
release
". He found Verna's arguments "pretty
dismissive
". Verna was apologetic,
but acknowledged that he has a bias that may not be universally shared:
I am a developer and I don't have a strong interest in the OS, I just expect it to work and provide me the tools needed to do my job. To me that's the beauty of FCOS, I get a solid, tested OS that get automated updates and just works, I honestly don't care to know which version of Fedora Linux it is based on or which features it has. I want to spin-up an instance make sure that my application works and forget about it.I also understand that there are other type of users that will care much more about the base OS than me:-).
It is the inclusion of "Fedora" in the FCOS name that is causing much of
the problem, Ron Olson said.
"I was surprised when I learned Fedora CoreOS didn't support cgroups
v2 and that confused me; it's Fedora, of course it would have the
latest-n-greatest.
" He noted that he had used CoreOS before Red
Hat bought the company and did not have those kinds of expectations in
those days.
Though he recognized the likely futility of the idea, he suggested that a
name change might help:
I'm guessing this is laughably not possible, but I'm going to suggest anyway that maybe it be renamed either back to simply "CoreOS" or something new like "Bowler" or whatever that indicates that it is its own special thing and expectations can be set accordingly.
Verna acknowledged that the Fedora name brought along some expectations, he also noted that FCOS is less than two-years-old at this point, so it is to be expected that there will be some rough spots that need to be worked out:
FCOS has a different release model than Fedora Linux and I think it is fair to give it time to, on one hand continue to improve how features are making their way in FCOS, and on the other hand get people be more familiar with what FCOS is and what expectations to have about it.
The cgroups issue reared its head several times in the discussion, though Colin Walters thought that the issue had been beaten to death long before. In addition, as Mabe noted, FCOS does already support cgroups v2, it is just not the default. Over the next month, that will be changing so that v2 is the default going forward:
We're trying to make sure users have a good experience. Docker users are a big part of that. Changing the default before Docker supported cgroups v2 was really not an option for us at the time.
The proposal to make Fedora CoreOS into an edition was originally targeted for Fedora 34, but that was not to be. The Change entry has been pushed to Fedora 35 and the Fedora Engineering Steering Committee (FESCo) issue tracking the change proposal was closed at the end of February. So far, no change proposal has been submitted for Fedora 35, though there is still plenty of time to do so. This discussion might indicate that it is still a bit too early to make that change, but time will tell.
Posted Jun 3, 2021 5:42 UTC (Thu)
by geuder (subscriber, #62854)
[Link] (12 responses)
We only recently had a small breakage in our FCOS-based system: Podman used to have some defaults where it pulled images from. Our code worked. The defaults changed (or were removed, don't remember from the top of my head) in some automatic update and our code stopped working. Just a little detail, but it demonstrated that basically after every automatic upgrade you need to test your system and be prepared to fix something.
Of course that's not completely different from manually upgraded systems, especially if you run something that others might consider fragile code or just not consider at all.
For a rolling distro additional difficulty are how/when to do these bigger changes which are more likely to break something.
Maybe some selected automatic updates should bundle bigger changes and be announced as higher risk in advance???
Posted Jun 3, 2021 15:38 UTC (Thu)
by mattdm (subscriber, #18)
[Link]
That's basically what a Fedora Linux release is.
Posted Jun 3, 2021 17:55 UTC (Thu)
by dbnichol (subscriber, #39622)
[Link] (3 responses)
Posted Jun 4, 2021 3:12 UTC (Fri)
by bgilbert (subscriber, #4738)
[Link] (2 responses)
Posted Jun 4, 2021 5:32 UTC (Fri)
by dbnichol (subscriber, #39622)
[Link] (1 responses)
How often do you actually use a barrier release?
Posted Jun 4, 2021 14:47 UTC (Fri)
by dustymabe (guest, #107864)
[Link]
The barrier releases and a link to the reason behind it are kept in https://github.com/coreos/fedora-coreos-streams/blob/main... Usually about once every 6 months or so.
Posted Jun 3, 2021 19:29 UTC (Thu)
by walters (subscriber, #7396)
[Link] (3 responses)
That's a really great example of a bug on the risk/reward spectrum around automatic updates and a relatively "fresh" Linux userspace.
Do you have a bit more detail on this? I'm guessing it was something around short names i.e. just `busybox` and not `docker.io/busybox` or so? Has it been fixed since? Did you engage with an upstream issue? How hard was the workaround?
Personally I think it's all around worse for everyone if admins stay on relatively frozen userspace or we try to lump things like this even around e.g. 6 month windows because I think in practice if it's just every 6 months, a good number of people fall out of habit of upgrading at all (when it requires manual intervention) and drop off the train entirely. And that's bad because you're not applying critical kernel security updates etc. that are particularly relevant with containers.
Posted Jun 4, 2021 9:37 UTC (Fri)
by geuder (subscriber, #62854)
[Link] (2 responses)
Posted Jun 4, 2021 13:27 UTC (Fri)
by zdzichu (guest, #17118)
[Link] (1 responses)
Yes, that's true. I would expect “stable” branch is equivalent of current stable Fedora release (which today is 34), but there are Fedora features missing in FCOS.
Posted Jun 4, 2021 15:46 UTC (Fri)
by geuder (subscriber, #62854)
[Link]
Posted Jun 4, 2021 14:44 UTC (Fri)
by dustymabe (guest, #107864)
[Link] (2 responses)
Part of the way we try to make automatic updates more reliable is by offering 3 different update streams (`next`, `testing`, and `stable`) to our users and encouraging everyone to run `next` and `testing` on a percentage of their systems. If you're "testing" nodes encounter a problem you can report it and we can hopefully get it fixed before the much larger pool of "stable" nodes are affected.
More info at https://docs.fedoraproject.org/en-US/fedora-coreos/update...
Posted Jun 8, 2021 17:00 UTC (Tue)
by geuder (subscriber, #62854)
[Link] (1 responses)
Good point.
However, we have (only) 2 instances, not 200. One is for production and one for testing (of our systems, not of FCOS). Running our testing system with a different version than the production system does not sound like a great idea. All test results would basically be possibly non-reproducable.
So we would need to run a 3rd one just for FCOS testing, a 50% overhead. And of course someone would need to check the instance at every update and run some test set. Which is a bit a against the idea of having automatic updates.
Well, no free lunch, I know...
Posted Jun 9, 2021 5:02 UTC (Wed)
by raven667 (subscriber, #5198)
[Link]
Posted Jun 3, 2021 17:07 UTC (Thu)
by highvoltage (subscriber, #57465)
[Link]
Posted Jun 16, 2021 12:54 UTC (Wed)
by geuder (subscriber, #62854)
[Link]
Growing pains for Fedora CoreOS
Growing pains for Fedora CoreOS
For Endless we use something referred to as a checkpoint release to help handle some of these upgrade issues when you have a rolling automatic ostree process. Normally, the updater pulls the tip of the ostree ref and deploys that. However, if the commit has some additional metadata, it will see that there's a new ref it should follow, but only after deploying and booting into the tip of the current ref.
This allows us to stuff some migration code into the commits on the old ref and ensure it'll run before something tries to upgrade to the current ref. This is the only way we can truly remove old features or ensure systems are prepared for a major change. In a way it acts like a traditional upgrade tool.
Growing pains for Fedora CoreOS
Fedora CoreOS has a barrier release mechanism that does something similar: all updates that traverse the barrier release must update to exactly that release before updating any further. The Fedora CoreOS update client selects the target OS release from a graph of permissible updates maintained outside of the ostree, so barrier releases can be accomplished without an ostree ref switch.
Growing pains for Fedora CoreOS
Growing pains for Fedora CoreOS
Growing pains for Fedora CoreOS
Growing pains for Fedora CoreOS
> Do you have a bit more detail on this? I'm guessing it was something around short names i.e. just `busybox` and not `docker.io/busybox` or so?
Growing pains for Fedora CoreOS
That was also my understanding after seeing the original error because I have noticed the need to change that in my (very rare) manual use of podman. I did neither debug nor fix the problem myself and our git log tells
source /etc/os-release
cat <<EOF >/usr/local/foo/Dockerfile
-FROM f${VERSION_ID}/fedora-toolbox:latest
+FROM registry.fedoraproject.org/fedora-toolbox:latest
RUN dnf install foo
EOF
(This code is being run on CoreOS)
So I wonder what they did there. Before before the code fetched f34/fedora-toolbox:latest, I believe from docker.io. Now they fetch fedora-toolbox:latest from registry.fedoraproject.org. Where did the version number go???
Of course lwn is not a code review site for the code of our company, but interesting in the context of this article is
$ grep VERSION_\\\|VARIANT /etc/os-release
VERSION_ID=34
VERSION_CODENAME=""
VARIANT="CoreOS"
VARIANT_ID=coreos
The article quoted without correction
>
I think this is the fundamental difference here, Fedora CoreOS does not have a version number. It has 3 streams, stable, testing and next,
So is that really true??? No such number is being advertised AFAIK, but internally it is there and I guess at some point in the future it will change. With potentially surprising effects to those who have used it.
Growing pains for Fedora CoreOS
So is that really true??? No such number is being advertised AFAIK, but internally it is there and I guess at some point in the future it will change. With potentially surprising effects to those who have used it.
Growing pains for Fedora CoreOS
Growing pains for Fedora CoreOS
Growing pains for Fedora CoreOS
Growing pains for Fedora CoreOS
Growing pains for Fedora CoreOS
Today I get a motd
Growing pains for Fedora CoreOS
############################################################################
WARNING: This system is using cgroups v1. For increased reliability
it is strongly recommended to migrate this system and your workloads
to use cgroups v2. For instructions on how to adjust kernel arguments
to use cgroups v2, see:
https://docs.fedoraproject.org/en-US/fedora-coreos/kernel-args/
To disable this warning, use:
sudo systemctl disable coreos-check-cgroups.service
############################################################################
So they are proceeding, but as expected that won't work fully automatically in all cases.
