Moving Google toward the mainline
Two Google engineers came to Open Source Summit North America 2021 to talk about a project to change the way the company creates and maintains the kernel it runs in its data centers on its production systems. Andrew Delgadillo and Dylan Hatch described the current production kernel (Prodkernel) and the problems that occur because it is so far from the mainline. Project Icebreaker is an effort to change that and to provide a near-mainline kernel for development and testing within Google; the talk looked at the project, its risks, its current status, and its plans.
Prodkernel
Delgadillo began the talk with Prodkernel, which runs on most of the machines in Google's data centers; those systems are used for handling search requests and "other jobs both externally facing and internally facing", he said. Prodkernel consists of around 9000 patches on top of an older upstream Linux kernel. Those patches implement various internal APIs (e.g. for Google Fibers), provide hardware support, add performance optimizations, and contain other "tweaks that are needed to run binaries that we use at Google". Every two years or so, those patches are rebased onto a newer kernel version, which provides a number of challenges. For one thing, there are a lot of changes in the kernel in two years; even if the rebase of a feature seems to go well, tracking down any bugs that crop up involves a "very large search space".
![Andrew Delgadillo [Andrew Delgadillo]](https://static.lwn.net/images/2021/ossna-delgadillo-sm.jpg)
There were some specific internal needs and timelines that drove the need for Prodkernel, he said, which is why Google could not simply use the mainline and push any of the extra features needed into that. He gave some examples of the features that are needed for Google's production workloads but that are not available in the mainline; those included a need to be able to set quality of service values from user space for outgoing network traffic, to have specific rules for the out-of-memory (OOM) killer, and to add a new API for cooperative scheduling in user space.
One of the big problems with Prodkernel is that it detracts from upstream participation, he said. Any features that Google creates for production are developed and tested on Prodkernel, which can be as much as two years behind the mainline kernel. If the developer wants to propose the feature for the mainline, the Prodkernel model imposes two main hurdles. For one, the feature needs to be rebased to the mainline, which may be a difficult task when the delta between the versions is large. Even if that gets done, there is a question of the testing of the feature. The feature has been validated on Prodkernel with production workloads, but now it has been taken to a new environment and been combined with totally new source code. That new combination cannot be tested in production because the mainline lacks the other features of Prodkernel.
Google's workloads tend to uncover bottlenecks, deadlocks, and other types of bugs, but the use of Prodkernel means that the information is not really useful to others. If Google is running something close to an LTS stable kernel, for example, reporting the bug might lead the team to a fix that could be backported; generally, though, finding and fixing the bugs is something that Google has to do for itself, he said. In addition, any fixes are probably not useful to anyone else since they apply to a years-old kernel. Also, any bugs that have been fixed in more recent kernels do not get picked up until either they are manually found and backported or the next rebase is done.
The rebasing process is "extremely costly" because it takes away time that developers could be working on other things. Each patch has to have any conflicts resolved with respect to the upstream kernel; it may well be that the developer has not even looked at the code for a year or more since they developed it, but they have to dig back in to port it forward. Then, of course, the resulting kernel has to be revalidated with Google's workloads. Bugs that are found in that process can be difficult to track down. Kernel bisection is one way, of course, but conflicts from rebasing need to be resolved at every step; that could perhaps be automated but it still makes for a difficult workflow, he said.
The delay associated with the rebasing effort worsens the problems with upstream participation, which makes the next rebase take that much more time. It is a pretty clear example of technical debt, Delgadillo said, and it just continues to grow. Each Prodkernel rebase increases the number of patches, which lengthens the time it takes to do the next one; it is not sustainable. So an effort is needed to reduce the technical debt, which will free up time for more upstream participation—thus further reducing the technical debt.
Icebreaker
![Dylan Hatch [Dylan Hatch]](https://static.lwn.net/images/2021/ossna-hatch-sm.jpg)
Hatch then introduced Project Icebreaker, which is a new kernel project at Google with two main goals. The first is to stay close to the upstream kernel; the idea is to release a new Icebreaker kernel for every major upstream kernel release. Those releases would be made "on time, we want to stay caught up with upstream Linux". That will provide developers with a platform for adding features that is close enough to the mainline that those features can be proposed upstream.
The second goal is to be able to run arbitrary Google binaries in production on that kernel. It would be "a real production kernel" that would allow validating the upstream changes in advance of the Prodkernel rebase. Under the current scheme, Google has been "piling everything into the tail end of this two-year period", he said. With Icebreaker, that testing can begin almost as soon as a new mainline kernel gets released.
Those goals are important because the team needs "better participation upstream". Developers working on features for kernels far removed from the current mainline have a hard time seeing the benefit of getting that feature upstream. There is a lot of work that needs to be done to untangle the feature from Prodkernel, test it on the mainline kernel, and then propose it upstream—all of which may or may not result in the feature being merged. The alternative is to simply wait for the rebase; time will be made available to do that work, but once the new Prodkernel is qualified, it is already too late for the feature to go upstream.
Having kernels closer to mainline will also help Google qualify and verify all of the upstream patches that much sooner. Rather than waiting until the two years are up and doing a huge rebase and retest effort, the work can be amortized over time
Structure
There are two sides to consider when looking at the structure of the Icebreaker effort, he said. On one side is how features can be developed in order to get them deployed on an Icebreaker kernel. On the other is how those patches need to be upgraded in order to get them onto a new mainline kernel for the next Icebreaker release.
Icebreaker creates a fork from the mainline at the point of a major release. It adds a bunch of feature branches (also known as "topic branches") to that, each of which is a self-contained set of patches for one chunk of functionality that is being added by Google. That is useful in and of itself, because each of those branches is effectively a patch series that could be proposed upstream; "so you are starting with something upstreamable and not going the other way around", Hatch said.
Development proceeds on those feature branches, with bug fixes and new functionality added as needed. Eventually, those feature branches get merged into subsystem-specific staging branches for testing. The staging branches then get merged into a next branch for the release. The next branch is an Icebreaker kernel that is "ready to go, but it still has its roots in these feature branches", he said. After the release is made, a "fan-out merge into the staging branches" is done, in order to synchronize them with the release version. Importantly, this fan-out merge is not done into the feature branches. Those stay in a pristine upstreamable state.
By following the life of one of these feature branches, we can see how the upgrade process goes, he said. When a new mainline kernel is released, a new branch for the feature is created and the branch for the earlier kernel is merged onto it. The SHA1 values for the commits on the earlier feature branch are maintained and the conflict resolution is contained in the merge commit.
Bug handling is easier with this workflow. The bugs can be fixed on the earliest supported feature branch where they occur and then merged into all of the successive feature branches. The SHA1 of the commit that introduced the bug and that of the fix will remain the same on those other branches. There is no need to carry extra metadata to track the different fix patches in each of the different supported kernel versions.
The Icebreaker model is much more upstream-friendly than the current Prodkernel scheme, Hatch said. The Icebreaker feature branches are done on an up-to-date mainline kernel and they get tested that way, so the test results are more relevant to upstream. This will allow developers to propose features for the mainline far more easily. Much of the Icebreaker branch structure and the like can be seen in the diagrams in the slides from the talk.
Risks
"There are some risks with Icebreaker, unfortunately", he said. One of the bigger ones is that there needs to be a lot of feature branch testing. There may be a tendency to treat those branches like a file cabinet, where patches are stored and merged into wherever they are needed. But that is not useful if it is not known whether "it builds or boots or passes any tests".
Thus it is important to validate just the feature branch before merging it elsewhere. If it is known that it was working before the merge, then any subsequent breakage will have been caused by something in the merge. Otherwise, it is just complicating the whole process to merge a feature in an unknown state into a new tree. The same goes when upgrading to a new mainline kernel version, he said.
The dependencies between features could be a risk for Icebreaker as well. The model is that features are mostly self-contained, but that is not completely true; there are some dependencies. They can range from APIs being used in another feature to performance optimizations that are needed for a feature to do its job correctly. That could be handled by resolving the dependencies on the staging branch, but those branches are not carried along to the next Icebreaker kernel, only the feature branches are.
The answer is to do merges between feature branches, which does work, but adds some complexities into the mix. There is a need to figure out which branches can be merged into each other. "How crazy can we let these merges become?", he asked. There are no rules for when two feature branches should simply be turned into a single feature branch or when there is utility in keeping them separate; those things will have to be determined on a case-by-case basis.
Another risk is that Icebreaker is much less centralized than the Prodkernel process is. Feature owners and subsystem maintainers within Google will need to participate and buy into this new workflow and model. They will need to trust that this new Icebreaker plan—confusing and complicated in the early going—will actually lead to better outcomes and a reduction in the technical debt.
The last risk that Hatch noted is that features in Icebreaker do actually need to get upstream or it will essentially devolve into Prodkernel again. If more and more patches are added to Icebreaker without a reduction from some being merged due to features going upstream, the team will not be able to keep up with the mainline. The production kernel team needs to take advantage of the fact that Icebreaker is so close to mainline and get its features upstream.
Status and plans
Delgadillo took over to talk about the status of Icebreaker. At the time of the talk, it was based on the 5.13 kernel, at a time when the 5.15 kernel is in the release-candidate stage. So the project is essentially one major release behind the mainline, which is "a lot closer that we have ever been".
![Dylan Hatch (l) & Andrew Delgadillo [Dylan Hatch & Andrew Delgadillo]](https://static.lwn.net/images/2021/ossna-delgadillo-hatch-sm.jpg)
In the process, some 30 patches were dropped from the tree because they were merged upstream. Out of 9000 patches being carried, 30 may not sound like a lot, he said, but it is a start. It is not something that would have happened without a project like Icebreaker. The team is working on 5.14 now and was able to drop 12 feature branches as part of that. Those were for features Google was backporting from the mainline, but that does not need to be done for recent kernels. That is another reduction in the technical debt, he said. Hopefully that process will get "easier and easier as we go along".
In addition, issues have been found and fixed, then reported upstream or have been sent to the stable trees for backporting. That is not something that happened frequently with Prodkernel because it was so far behind the mainline. In general, they were build fixes and the like, he said, but were useful to others, which is a plus.
Looking forward, Icebreaker plans to catch up to upstream at 5.15 or 5.16, which will be a turning point for the project. It will be "riding the wave" of upstream at that point, which will allow "us to relax the cadence at which we need to update our tree", he said. One of the problems that has occurred is that feature maintainers have had to rebase and fix conflicts every three or four weeks as Icebreaker worked on catching up; in the Prodkernel model, that would only happen every two years or so. Once the project has caught up, there will only need to be rebases every ten or so weeks, aligned with the mainline schedule.
Testing the Icebreaker feature branches on top of mainline kernel release candidates is also something the project would like to do. That would allow Google to participate in the release-candidate process and help test those kernels. Once Icebreaker is aligned with mainline, it will make upstream development of features possible in a way that simply could not be done with Prodkernel.
At that point, Delgadillo and Hatch took questions. The first was about the plans for Prodkernel: will there still be the giant, two-year rebase? Hatch said that for now Icebreaker and Prodkernel would proceed in parallel. Delgadillo noted that Icebreaker is new, and has not necessarily worked out all of its kinks. He also said that while Icebreaker is meant to be functionally equivalent to Prodkernel, it may not be at parity performance-wise. It is definitely a goal to run these kernels in production, but that has not really happened yet.
Readers may find it interesting to contrast this talk with one from the 2009 Kernel Summit that gives a perspective on how things worked 12 years ago.
[I would like to thank LWN subscribers for supporting my travel to Seattle for Open Source Summit North America.]
Index entries for this article | |
---|---|
Kernel | |
Conference | Open Source Summit North America/2021 |
Posted Oct 5, 2021 23:05 UTC (Tue)
by ndesaulniers (subscriber, #110768)
[Link] (8 responses)
There are cases where using the credit card make sense, but it's easy to get into a case where you're stuck paying off revolving balance without making a dent in principal.
Getting prodkernel building with LLVM was an exercise in deja-vu; they weren't using branches of stable, so we had to chase backports again that we had already done for Android and CrOS. Makes me wonder how much duplicated effort goes into backports for distros not using stable...
Posted Oct 6, 2021 1:15 UTC (Wed)
by tsoni.lwn (subscriber, #139617)
[Link] (2 responses)
Posted Oct 6, 2021 10:17 UTC (Wed)
by Sesse (subscriber, #53779)
[Link] (1 responses)
BTW, I find it amusing that “the prod kernel” now seemingly has the name “Prodkernel”.
Posted Oct 7, 2021 4:50 UTC (Thu)
by tsoni.lwn (subscriber, #139617)
[Link]
Posted Oct 6, 2021 9:28 UTC (Wed)
by taladar (subscriber, #68407)
[Link] (4 responses)
This might make some sense with central, mature libraries in a stable distro but anything without major reverse dependencies (e.g. almost all binaries or other leaf nodes in the dependency graph) or with a very active development breaking APIs and ABIs all the time should not use the backport model.
Posted Oct 6, 2021 21:56 UTC (Wed)
by gps (subscriber, #45638)
[Link] (3 responses)
This is the Linux Kernel project we're talking about here. Upstream is effectively untested.
Especially when compared to the real world testing needed before mass multi-billion dollar production use.
"It boots on my dev box and can compile a new kernel" is not testing.
Posted Oct 6, 2021 22:21 UTC (Wed)
by pizza (subscriber, #46)
[Link]
Just because the tests they run don't include your particular workloads doesn't mean something isn't tested.
(Over the years, the "can it compile a new kernel" has been a far more useful "test" than most synthetic stress tests..)
Meanwhile, feel free to contribute your own test suite to those working upstream.
Posted Oct 7, 2021 15:43 UTC (Thu)
by bfields (subscriber, #19510)
[Link] (1 responses)
For what it's worth, as one of the knfsd maintainers, I have a set of NFS-focused test suites (xfstests, connectathon, pynfs, a few smaller tests, run over a variety of NFS protocol versions and security flavors) that I run on anything that I publish. I also run them nightly on the latest trees from Linus, stable, and linux-next, and a few other NFS developers.
I'm by no means the most conscientious. Maintainers do this kind of thing all the time.
I also get pretty regular mail from bots run on mainline and linux-next, and their coverage seems to be improving over time.
"It boots on a dev box and can compile a new kernel" isn't really the current situation.
I mean, I think I'm with you on the basic sentiment, testing is really important and we can and should do better.
Posted Oct 8, 2021 4:59 UTC (Fri)
by zblaxell (subscriber, #26385)
[Link]
All we can say about the second kind of software is that it hasn't been tested on our workload, so we don't know any of those things. Sure, it might have been tested by some group of domain-expert people, and certified by another group of accredited generalist people, and a third group of people with a lot of reddit upvotes swears it's awesome, and some robots didn't notice any of the more common problems--in fact, we'd insist on most or all of that before we bother downloading it for a test build. None of that fancy pedigree matters if we throw our production app on it, and it immediately falls over because we're doing something nobody else does. If we're providing a production service on a commercial scale with the software, it's highly likely we're doing something nobody else does. Even if others start doing what we do, we'd write some new code and be doing something different again. Maybe we're doing something wrong, and our tests (and only our tests) will make the problem visible.
The QA gatekeeper in front of the production server farm has one job: keep the server farm producing at least whatever it is producing now. They can keep running the kernel they already have, so they have no incentive to take risks that might jeopardize that. The gatekeeper will not accept broad assurances of quantity testing--they'll need to be *convinced* to upgrade, with evidence of monotonic improvement in the new versions, or dire and irreparable problems arising in the old versions. "Personally tested by the maintainer and a team of leading experts in the subsystem" is an excellent start, but we'll run our own test suite on it before we call it "tested."
At every node in the integration graph, from developer's laptop to integration tree to merge window to release, LTS, and production deployment, someone is doing testing and deciding whether the code they pulled as input to their node is good enough to push to the output of their node (or in the case of testing robots, snooping on the edges between nodes and advising the node owners). Every node must consider its inputs "effectively untested," or the integration graph doesn't work. That's the whole point of having an integration graph: to combine diverse and isolated pools of domain expertise into a comprehensive testing workflow.
Posted Oct 6, 2021 12:25 UTC (Wed)
by geert (subscriber, #98403)
[Link] (7 responses)
IMHO that's too late: I'm working on v5.16. That is (at the time of the talk) v5.15-rc3 + lots of for-next branches from subsystems I care about.
Posted Oct 6, 2021 17:32 UTC (Wed)
by Paf (subscriber, #91811)
[Link] (6 responses)
Posted Oct 6, 2021 18:08 UTC (Wed)
by geert (subscriber, #98403)
[Link] (5 responses)
Posted Oct 6, 2021 20:38 UTC (Wed)
by pbonzini (subscriber, #60935)
[Link] (3 responses)
The team doing the rebases won't even be the same that is doing the upstream contributions.
Posted Oct 8, 2021 5:05 UTC (Fri)
by rahvin (guest, #16953)
[Link] (2 responses)
The Pixel 5, up until a few weeks ago was the newest google headset and it uses 4.19. When it pulls down Android12 it might be slightly newer but I doubt it will even be in the 5 series kernels and that means years old, not months.
Posted Oct 8, 2021 6:49 UTC (Fri)
by patrick_g (subscriber, #44470)
[Link] (1 responses)
I think Android 12 is using a 5.10 kernel.
Posted Oct 16, 2021 6:29 UTC (Sat)
by codewiz (subscriber, #63050)
[Link]
In theory it would be possible to "uprev" the kernel on a particular SoC, but most vendors just don't bother because adding features to discontinued SoCs doesn't help them sell more of their newer SoCs.
Posted Oct 6, 2021 22:20 UTC (Wed)
by Paf (subscriber, #91811)
[Link]
Posted Oct 7, 2021 4:54 UTC (Thu)
by tsoni.lwn (subscriber, #139617)
[Link]
Posted Oct 7, 2021 7:01 UTC (Thu)
by jonas.bonn (subscriber, #47561)
[Link]
Posted Oct 7, 2021 16:27 UTC (Thu)
by mtaht (subscriber, #11087)
[Link] (4 responses)
Posted Oct 8, 2021 11:58 UTC (Fri)
by willy (subscriber, #9762)
[Link] (3 responses)
Posted Oct 10, 2021 1:18 UTC (Sun)
by flussence (guest, #85566)
[Link] (2 responses)
Posted Oct 10, 2021 1:21 UTC (Sun)
by willy (subscriber, #9762)
[Link] (1 responses)
Posted Oct 12, 2021 8:11 UTC (Tue)
by JanC_ (guest, #34940)
[Link]
Most money is going to marketing, I suppose (and that’s probably necessary to get people to ignore the bad quality & support they get).
Posted Oct 8, 2021 11:23 UTC (Fri)
by error27 (subscriber, #8346)
[Link]
Posted Oct 15, 2021 21:01 UTC (Fri)
by SomeOtherGuy (guest, #151918)
[Link] (1 responses)
I did look at patching it, but load averages are surprisingly complicated - at least in implementation!
(ADVICE WELCOME AND YES I'D LOVE THIS LOAD-AVERAGE - don't question that please)
Posted Oct 18, 2021 9:43 UTC (Mon)
by farnz (subscriber, #17727)
[Link]
Depending on your precise use case for a CPU-only load average, you might want to look at Pressure Stall Information (PSI) as a mainline feature that gets some of what you want.
PSI is a different formulation of the load average concept - instead of looking at total utilization, it looks at how much time is spent with a task blocked completely on a resource. The Facebook PSI microsite has a good explanation of how the different numbers are calculated; basically, though, the `avg` numbers are %age of the time on which either some tasks, or the full set of tasks, are blocked waiting for a given resource to be made available to them. As per the source code comments, for CPU time, full only exists when tasks are restricted from using 100% of available CPU via cgroups, but some is present all the time.
A task is deemed stalled on a resource (CPU, memory, I/O) if the task would run now, but it's waiting for this resource. So, an I/O stall means that the task would be runnable if it wasn't waiting on I/O (whether via blocking I/O calls, or because it's blocked in epoll or the like on an I/O that's not yet completed), while a CPU stall means that the task is runnable, but none of the CPUs it's allowed to run on are idle.
The clever bit is that what's output is stall percentage, not time in which there's a stall; so a value avgXXX=10.0 means that with 10% more of this particular resource (memory, I/O, CPU), there would have been no tasks waiting for resource. This matters when you see a CPU stall of the form "some avg60=10.00"; it means that with 10% more CPU cores, all tasks would have run immediately they were able to; similarly, a CPU stall of the form "full avg300=5.00" means that something limited a cgroup to not use all CPU cores, but if that limit had been raised by 5%, nothing would have waited for a CPU core.
Not the same as a load average, but possibly of use to you in fixing whatever problem you're facing where a CPU-only load average is interesting.
Posted Oct 18, 2021 10:38 UTC (Mon)
by teksturi (guest, #153896)
[Link]
This way every time new rolling stable tag comes your feature branches will be rebased top of that (maybe to new branch). Then your feature maintainer will get info if there is conflict. Some of your maintainers will be little bit more far of from rolling release stable than others, but this will be totally ok. You can very easily see which are bottlenecks.
You can at any time start new kernel internal kernel. Just choose highest version number which all of yours feature branch is compared to rolling stable. Example feature A got big conflict in 5.14 and it takes 3 weeks resolve it. When conflicts are resolved most likely there will be just small conflicts against newest rolling stable (example 5.14.5) and those will be resolved quickly in feature A. Now that bottleneck work in feature A you choose to make new internal kernel. You guys notice that highest as you can go right now is 5.14.3 as there is couple maintainers who are working some issues introduced in 5.14.4. So now 5.14.3 is chosen and testing can start with it.
This way kernel community will be very very happy that you also take stable stuff. Taking 5.14.0 as base make no sense as it gets fixes all the time and then you kinda have to resolve those internal. This is also nice to your maintainers as they can resolve conflicts as they come and some can be aligned all the time with upstream. Usually there will not be many conflict in y.y.x versions. So probably when your conflict resolving work is done you can choose highest stable kernel as internal base.
Moving Google toward the mainline
Moving Google toward the mainline
Moving Google toward the mainline
Moving Google toward the mainline
Moving Google toward the mainline
Moving Google toward the mainline
Moving Google toward the mainline
Moving Google toward the mainline
Moving Google toward the mainline
Moving Google toward the mainline
one major release behind the mainline." and "The team is working on 5.14 now."
Moving Google toward the mainline
Moving Google toward the mainline
Moving Google toward the mainline
Moving Google toward the mainline
Moving Google toward the mainline
Moving Google toward the mainline
Moving Google toward the mainline
Moving Google toward the mainline
Moving Google toward the mainline
Moving Embedded toward the mainline
Moving Embedded toward the mainline
Moving Embedded toward the mainline
Moving Embedded toward the mainline
Moving Embedded toward the mainline
Moving Google toward the mainline
Moving Google toward the mainline
Moving Google toward the mainline
Moving Google toward the mainline