Challenges for KernelCI

By Jake Edge
August 1, 2023

Kernel testing is a perennial topic at Linux-related conferences and the KernelCI project is one of the larger testing players. It does its own testing but also coordinates with various other testing systems and aggregates their results. At the 2023 Embedded Open Source Summit (EOSS), KernelCI developer Nikolai Kondrashov gave a presentation on the testing framework, its database, and how others can get involved in the project. He also had some thoughts on where KernelCI is falling short of its goals and potential, along with some ideas of ways to improve it.

Kondrashov works for Red Hat on its Continuous Kernel Integration (CKI) project, which is an internal continuous-integration (CI) system for the kernel that is also targeting running tests for kernel maintainers who are interested in participating. CKI works with KernelCI by contributing data to its KCIDB database, which is the part of KernelCI that he works on. He noted that he was giving the talk from the perspective of someone developing a CI system and participating in KernelCI, rather than as a KernelCI maintainer or developer. His hobbies include embedded development, which is part of why he was speaking at EOSS, he said.

There are a number of different kernel-testing efforts going on, including the Intel 0-day testing, syzbot, Linux Kernel Functional Testing (LKFT), CKI, KernelCI, and more. Each system has its own report email format and its own dashboard for viewing results. KernelCI has its own set of results from tests run in the labs it works with directly, while those results and the results from other testing efforts flow directly into KCIDB. Having a single report format and dashboard for all of the myriad of testing results is one of the things that the KCIDB project is working on. "Conceptually it is very simple", he said; various submitters send JSON results from their tests, the failures get reported by email to those who have subscribed to receive them, the results also get put into a database, which then can be displayed in the dashboard.

Currently, KCIDB gets around 300K test results per day, from roughly 10K different builds. He briefly put up a screen shot of the Grafana-based KCIDB dashboard (which can be seen in his slides or the YouTube video of the presentation). He also showed an example of the customized report email that developers and maintainers can get; the one he displayed aggregated the results from four different CI systems, with links to the dashboard for more information.

CI metrics

He gave a somewhat simplified definition of CI for the purposes of his talk: test every change to a code base, or as many changes as possible, and provide feedback to the developers of that code. Given that, there are four metrics that can be defined for a CI system, starting with its coverage, which is how much functionality is being tested. The latency metric is based on how quickly the feedback is being generated, while the reliability measure is about "how much can you trust the results"—does a reported failure correspond to a real failure and does a "pass" really mean that? The last metric is accessibility, which is "how easy it is to understand the feedback"; that is, whether the reports provide enough information of the right kind to allow developers to easily track down a problem and fix it.

Using those metrics, the ideal CI system would cover everything, provide instant feedback as changes are created, the feedback "is always true", and the report "just says what's broken, so you don't have to figure it out". On the flipside, the worst CI system "is not covering anything useful, takes forever, and never tells you the right thing, and you cannot understand what it is saying". That CI system "is worse than no CI", Kondrashov said.

KernelCI obviously falls between those two extremes, so he wanted to look at the project in terms of those metrics. For coverage, "nobody really seems to know" how much of the kernel is being tested by the various systems that report to KernelCI. Each testing project has its own set of tests that it runs and there is no entity coordinating all of the testing. He has heard that some of the testing projects have done some measurements, but those results are not really available. He put up an LCOV coverage report that was generated from the CKI tests for Arm64. It showed an overall coverage of 12%, but that was only for the most important subset of the kernel tree.

In an unscientific sampling of the mailing lists, he sees latencies of several hours after a change is posted, "which is quite good", up to "a few weeks"; the latencies are typically faster for changes that are actually merged into a public branch. Pre-merge testing is rare overall.

The results are not particularly reliable, however. Many people who run CI systems have to do manual reviews of the test results before sending them to developers, "because things go wrong quite often". The accessibility of the results "is quite good in places" as some CI systems make an effort for their results to be understandable and actionable.

There are certain "hard limits" on what can actually be accomplished. In terms of coverage, the amount of hardware that is available to be used for testing is a hard limit; the kernel is an abstraction over hardware, so it needs to be tested on all sorts of different hardware. The latency of feedback is also limited by hardware availability; more hardware equates to more tests running in parallel, which reduces the time for producing feedback.

The reliability of the tests is governed by the reliability of the hardware and of the kernel, "but tests contribute to improving kernel reliability, so that's good". The reliability of the tests themselves would also seem to be a big factor here, though Kondrashov did not mention it. The limits of accessibility are partially governed by hardware availability, yet again, because it is difficult to fix a bug that is reported on hardware that the developer does not have access to. The complexity of the kernel also plays a role in limiting the accessibility of the results.

Challenges

He thinks that there are a lot of people who want to write tests, a lot of tests already in the wild, and a lot of companies that have test suites, all of which could lead to more coverage, but integrating those new tests is being held back by other problems.

Doing CI on code that has not yet been merged is dangerous. Anybody can post to the mailing list, so picking up those patches to test can cause problems: "you don't want them to start mining Bitcoin and you don't want them to wreck your hardware". The need for "slow human reviews" of the results also contributes to the latency problems.

He thinks that a big reason why the tests can be unreliable is because they get out of sync with the kernel being tested. Kernels change, as do the tests, but the lack of synchronization means that a test may not be looking for the proper kernel behavior. That leads to tests that repeatedly fail until the two get back in sync; meanwhile, the maintainers do not want to hear about the repeated failures that are not actually related to real bugs in their code. Nobody wants to "waste their time investigating a problem that they had nothing to do with".

The main challenge he sees for accessibility is the proliferation of report formats and dashboards, which makes it difficult for developers. That is something that he thinks KCIDB can improve.

The challenges also compound: low reliability and accessibility lead to low developer trust in what the CI system produces. If a developer knows that the tests often fail due to problems completely outside of their control, "their trust and interest for these results plummets". Likewise, if the reports are hard to understand because the developer does not have access to the hardware where it broke, or the reports leave out important information, they will be ignored. That means the results will not be used for gating patches into the kernel. Since the results are generally ignored, the test developers do not get feedback about the tests, so the tests do not improve, and any actual bugs that the tests do find are not acted upon; the whole improvement feedback loop breaks down.

High latency also leads to a lack of gating; you cannot wait a week for test results to decide whether a patch is sensible to be merged. That leads to bugs getting into the kernel that would have been caught in a lower-latency system. That all leads to greater latency as time is wasted on finding and fixing bugs that could have been detected; the extra time spent cannot go into improving the tests. "It's a vicious loop", he said.

He summarized his takeaway from the challenges with a meme: "Feedback latency is too damn high!" After that, though, he wanted to move on to what can be done to fix the problems: "that's enough gloom".

What to do?

First up was a look at what cannot be done, however. The kernel community is not a single team, working for a single employer; that is also the case with most other open-source projects. It all means that open-source developers cannot be forced to look at test results. In a company, you can bootstrap the testing into the development process by getting the tests just good enough to start gating merges on them; after that, the tests start improving and the positive feedback loop initiates. "After a bit of fighting and stalling, it starts up." In an open-source project, though, the tests need to be in good shape in order to gain developer trust; "without developer trust, it's not going to work".

Turning to things that can be done, he started with coverage. Companies have the most hardware, so attracting more companies into the testing fold will lead to more hardware, more tests, and more results, thus more coverage. Companies that have their own CI system and want to contribute to KernelCI can send their results to KCIDB. Another way to contribute is by setting up a LAVA lab and connecting it with KernelCI; developers will be able to submit trees and tests to be run on the hardware. The right place to get started with either of those is the KernelCI mailing list.

Kondrashov said that he thinks more pre-merge testing is needed to try to head off bugs before they get into public code and to shorten the feedback loop for developers. There are multiple approaches to doing pre-merge testing; some are using Patchwork to pick up patches from the mailing list for testing, which is working well. There is still a problem with authentication, however, since anyone can send a patch to the list; some patches could be malicious.

There are around 50 entries in the kernel MAINTAINERS file that refer to a GitHub or GitLab repository. Those systems provide a way to authenticate the patches that are submitted and connect them to a CI system. Something that KernelCI is exploring is to add integration with those "Git forges" so that, for example, there could be a GitHub Action that submits a patch to KernelCI and gets back a success or failure indication. The benefit is that those patches can be tested on real hardware as part of the pre-merge workflow.

If that all can be made to work, he would like to encourage more maintainers to use the forges. "I know this is controversial, it's been discussed to death in the community." But a few kernel trees are already using the pull-request-based workflow; he thinks more could benefit from doing so. The "selling point" is the CI integration and early testing of pull requests.

In order to get the process going, "CI systems have to start talking to maintainers". A CI system can offer to test a staging branch from the maintainer's repository; the maintainer's merge of a patch into their branch provides the authentication. That is not pre-merge testing, but is a starting point to help prove that the CI system and its tests are reliable and useful. To start with, a few of the most stable tests can be chosen. The KCIDB subscription feature will allow developers to get reports of other, related test results; users can filter the reports that they get on a variety of criteria, such as Git branch, architecture, compiler, tester, and so on.

There are so many tests and so many failures that manually reviewing all of them is inefficient. CI systems are starting to set up automatic triaging to analyze the results in order to more efficiently find real problems. KCIDB is working on such a system, but other CI efforts, such as the Intel GFX CI (for Intel graphics), CKI, and syzbot, already have working versions for this triage. The best triaging is currently done by syzbot in order to not emit multiple reports of what are the same underlying bug by analyzing the kernel log output of the crash.

Another controversial suggestion that he has is to avoid the synchronization problem between the kernel and its tests by moving more tests into the kernel tree. That allows fixes or changes to the kernel functionality to come with the corresponding changes to the tests. He suggested starting with popular, well-developed tests, such as those from the Linux Test Project (LTP). In order to make that work, though, it needs to be integrated into the kernel documentation and best practices, so that the tests become a "more official" part of the kernel workflow.

Currently, LTP is being run on mainline, stable, and other kernels, so it has to be able to handle all of the test differences among them. If those tests got integrated into the kernel tree, that would no longer be needed; the tests in the tree would (or should) simply work for that branch. If a fix that gets backported to a stable branch changes a test somehow, the test change would be backported with it; that would greatly simplify the tests. In order to keep those in-tree tests functioning well, they would need to be prioritized in the CI systems, so that the feedback loop for the tests themselves is shortened as well.

Accessibility can be improved by standardizing the report formats and dashboards. The KCIDB project is working on some of that, but needs feedback from maintainers and developers. He also encouraged people to get involved with the development of KCIDB to help make it better.

In the Q&A after the talk, several attendees agreed with Kondrashov's analysis and suggestions. There were also invitations to work with other testing efforts, such as for the media subsystem. Finding a way to allow regular developers to test their code on a diverse set of hardware was also deemed important, but depends on being able to authenticate the requester, Kondrashov reiterated; the Git forges provide much or all of the functionality needed for that. He closed by noting that there are few who are working on KCIDB right now, largely just Kondrashov—who is busy with other Red Hat work—and an intern, so there is a real need for more contributors; he has lots of ideas and plans, but needs help to get there.

[I would like to thank LWN's travel sponsor, the Linux Foundation, for assisting with my travel to Prague.]

Index entries for this article
Kernel	Development tools/Testing
Conference	Embedded Open Source Summit/2023

Challenges for KernelCI

Posted Aug 2, 2023 2:32 UTC (Wed) by mupuf (subscriber, #86890) [Link] (2 responses)

The analysis looks fine to me, but incomplete in major ways: the lack of reliability is the nature of integration testing, and who is in charge of developing the drivers' test plans and looking over the results for the driver?

IMO, this is why kernel CI appears stuck with build and boot testing, and little about driver testing when projects like the Intel gfx ci and Mesa CI went much further and with great success.

Let's start with reliability. Now how do you address that? There are many ways:

* add tests to a per-device blocklist when found to be flaky (the Mesa CI way, more applicable when having hundreds of thousands of tests), remove when the driver or test is fixed (tracking is important)
* find a way to create signatures for known issues, and filter them out of the report unless an additional regression happened (the Intel gfx CI way, better suited for big integration tests)

In practice, the former looks like a ton of test names in files stored along with the code, and the latter is tons of bugs in bugzilla with regular expressions matching logs and a subset of trees/machines (along with cryptic emails/pages: https://patchwork.freedesktop.org/series/121233/)

I have experience in both ways (I mostly work on MesaCI nowadays, but I designed and operated the latter at my previous job at Intel). In both cases, the vast majority of the job is tending to results, reworking the tests and test plans with the other developers, and keeping the system afloat.

IMO, the big centralised approach for testing the kernel is a mistake: create tools and maintain CI farms, but let driver developers make up their test plans (provide full control to them), bug filing, ... The idea that outsiders can just come and test drivers is a recipe for pain and disappointment, even if the devs provided the automated test suite... This is at least true in the accelerator subsystems where flakiness is guaranteed, but I'm sure it also applies to other subsystems.

Results and test plans can then be shared between trees, but that's the cherry on the cake, not a requirement.Fix the maintainer trees first, then work your way up!

Challenges for KernelCI

Posted Aug 8, 2023 8:34 UTC (Tue) by spbnick (subscriber, #93917) [Link] (1 responses)

The talk's author here.

Thanks for a thoughtful response, Martin. Sorry it took me a while to answer.

> The analysis looks fine to me, but incomplete in major ways:

I had a lot more to say, but it's hard to put things together in a coherent manner, and not just ramble for 25 minutes.

> the lack of reliability is the nature of integration testing,

This is only true for certain sets of things you're integrating, and depends on overall complexity. If you're integrating a few pieces of generally-synchronous software or hardware, it is likely to turn out reliable. Then, "integration testing" is relative, and from different points of view *all* testing is, and is not "integration testing", at the same time.

So you can't really say that integration testing in general is unreliable, but I get what you mean. Kernel, hardware, and concurrency are all complicated, and it's hard to test those things together reliably.

> and who is in charge of developing the drivers' test plans and looking over the results for the driver?

That's a good question. The current answer is a major problem with the current CI situation in the Linux kernel. The developers should be in control of what test results they receive, and be able to fix/disable the tests quickly in response to breakage.

> IMO, this is why kernel CI appears stuck with build and boot testing, and little about driver testing when projects like the Intel gfx ci and Mesa CI went much further and with great success.

First, let's separate "kernel CI" from "KernelCI". I know it's hard, given the name. The talk is mostly about the former, that is CI for the Linux kernel project in general. The latter is the name of the project I work with, developing KCIDB.

Looking at KCIDB database (I don't work on testing myself), I can see that aside from boot tests ("baseline"), in the past 30 days, KernelCI (the project) executed a bunch of tests, including:

* Chrome OS EC
* igt-gpu
* kselftest
* libcamera
* libhugetlbfs
* LTP
* preempt-rt
* v4l2-compliance
* vdso

See for yourself at https://kcidb.kernelci.org/d/test-node/test-node?orgId=1&...

Lack of developer control doesn't really stop CI systems from *executing* the tests, but it slows down feedback loop, keeping result quality low, which in turn reduces developer trust in those results.

That is the reason I am advocating for moving as many tests as possible into the kernel repository itself. Let the developers take care of choosing the tests that go with a particular kernel version, and let the CI systems simply run what they thusly prescribe for that version, and report the results. This would help avoid failures due to code/bug/test desync, and give developers more control of what CI systems execute. Apart from that, it would make the tests easier to discover, make them more "official", and make it easier to tell contributors (how) to run them.

> Let's start with reliability. Now how do you address that? There are many ways:

> * add tests to a per-device blocklist when found to be flaky (the Mesa CI way, more applicable when having hundreds of thousands of tests), remove when the driver or test is fixed (tracking is important)
> * find a way to create signatures for known issues, and filter them out of the report unless an additional regression happened (the Intel gfx CI way, better suited for big integration tests)

> In practice, the former looks like a ton of test names in files stored along with the code, and the latter is tons of bugs in bugzilla with regular expressions matching logs and a subset of trees/machines (along with cryptic emails/pages: https://patchwork.freedesktop.org/series/121233/)

> I have experience in both ways (I mostly work on MesaCI nowadays, but I designed and operated the latter at my previous job at Intel). In both cases, the vast majority of the job is tending to results, reworking the tests and test plans with the other developers, and keeping the system afloat.

Oh, absolutely. I can see this unfolding every day at Red Hat's CKI, where I work, I feel the pain.

We also use both approaches at CKI: controlling which tests to execute on which hardware/kernels/distros to manage breakage and fixes in progress, and a database of patterns looking for known issues in test results automatically, for the same purpose, but more dynamically. Same thing more-or-less is happening with other CI systems. I presented an overview of these approaches at FOSDEM last year:

https://archive.fosdem.org/2022/schedule/event/masking_kn...

There's an effort to do the same in KernelCI (the project) now, too:
https://lore.kernel.org/kernelci/f40c35d3-5558-6a46-4ad4-...

> IMO, the big centralised approach for testing the kernel is a mistake: create tools and maintain CI farms, but let driver developers make up their test plans (provide full control to them), bug filing, ...

> The idea that outsiders can just come and test drivers is a recipe for pain and disappointment, even if the devs provided the automated test suite... This is at least true in the accelerator subsystems where flakiness is guaranteed, but I'm sure it also applies to other subsystems.

> Results and test plans can then be shared between trees, but that's the cherry on the cake, not a requirement.Fix the maintainer trees first, then work your way up!

I agree that it's best to give developers as much control as possible over where, how, and which tests to execute. Yes, developer-owned hardware farms are the best. However, there's a tradeoff: you can get *more* hardware if you are prepared to relinquish *some* control.

The next best thing is KernelCI Native (the project's CI system, soon "KernelCI API") - a lot of its tests execute on hardware made available by companies and private contributors, and being a community project it's very much open to discussing which tests to run and how. You can increase your coverage by using hardware available in KernelCI. Yes, you don't own the hardware and cannot access it directly, but you get your tests to execute there.

After that comes KCIDB. It doesn't run tests itself, it just receives reports from other CI systems. Those CI systems include ones maintained by (big) businesses: Google, Microsoft, ARM, Red Hat, and Intel, with more working on joining, but also community CI systems, including the KernelCI system that is running the tests.

The main thing is those businesses have *still more* hardware, likely including some of what you're interested in. However, they want to choose themselves which tests they run, according to their areas of interests. Let's say, ARM is interested in testing on their hardware, Intel - on their, Red Hat on their partner's hardware, as well as just testing a wide swath of server functionality, Microsoft is concerned about its Azure performance, and so on.

Yet the only reason they send their results to KCIDB is to alert the developers of any issues they find, so they could fix them sooner. I.e. even though they choose how to use their hardware, which tests to run, there is no reason for them to do it, *unless* they can get the attention of developers. So they're prepared to do their best to make the results good enough, and many of them do.

This could be seen as a trade: businesses offer test runs (potentially on special hardware), developers offer attention. Both want the kernel to work.

The idea of KCIDB is to provide a sort of a marketplace for this trade to happen, a place where those CI systems can offer their results, in the best shape possible. And a place where developers (and maintainers) can pick and choose which test results to see and be alerted about.

We're implementing a known-issue detection and tracking system in KCIDB, using the techniques other CI systems use (including some of Mesa CI ones), and the corresponding data from the submitting CI systems to get our results in a better shape. The plan also includes using issue descriptions and patterns from one CI system on results from all CI systems, which should make them even better. We plan to offer a UI for submitting those issue patterns manually as well, so CI system maintainers, test maintainers, *and* kernel developers can come in and mark particular results as invalid.

The subscription system KCIDB already offers is allowing developers to subscribe to particular results of specific tests, executed on code appearing in specific branches, on particular architectures, and so on. You can also select which CI systems you want to trust, if you want, and that would be a clear signal to other CI systems to shape up.

We would be glad to add more conditions as requested by any interested developers, so that they get just the results they need. Essentially and ultimately we want to give you the ability to build your own "test plan", or if you'd like, "test result shopping list" out of what the CI systems have to offer.

Of course, it's best if you go and talk to the maintainers of the corresponding CI systems to ask them for particular tests and hardware and to fix any issues with the testing they do. And, as I said, making the tests "official" and tying them to a particular kernel version by merging them into the kernel repo could help communicate what should be executed (and improve result quality). But KCIDB is aiming to provide you with the tools to get the best out of the results CI systems send, and to protect you from the noise.

Challenges for KernelCI

Posted Aug 8, 2023 12:44 UTC (Tue) by mupuf (subscriber, #86890) [Link]

> Thanks for a thoughtful response, Martin. Sorry it took me a while to answer.

You're welcome, and don't worry about it, we are all busy fighting regressions and improving systems, aren't we? :p

> Then, "integration testing" is relative, and from different points of view *all* testing is, and is not "integration testing", at the same time.

Right, I get what you mean, but testing functions in isolation (by controlling all inputs and outputs, AKA unit testing) will always be the most reliable. I am really happy that the kernel is moving towards it (KUnit), but using it to the fullest extent requires refactoring code which takes time and motivation.

IMO, in an ideal world we could test all the common code in isolation, and focus hardware testing to the hardware-specific code. Basic primitives would be tested using kselftests, and full integration tests would be used to detect bad interactions (in the gfx world, one would replay vulkan/GL traces or decoding videos).

In a more practical world, common code could be tested by creating "fake" drivers that allow emulating different scenarios, so that existing user-space test suites could be used. In the DRM world, this would be the VKMS driver. This isn't as reliable, but at least everyone can run the tests locally, which is great!

I hope I explained better what I meant with "Integration testing is by nature unreliable"!

> First, let's separate "kernel CI" from "KernelCI". I know it's hard, given the name.

Ha ha, yeah, thanks for letting me know!

> Lack of developer control doesn't really stop CI systems from *executing* the tests, but it slows down feedback loop, keeping result quality low, which in turn reduces developer trust in those results.

Indeed :) And it also does not incentivize developers to take an active role in the testing. More on that later ;)

> That is the reason I am advocating for moving as many tests as possible into the kernel repository itself. Let the developers take care of choosing the tests that go with a particular kernel version, and let the CI systems simply run what they thusly prescribe for that version, and report the results.

That's Mesa's approach... ish! Mesa has the test lists, results expectations, and how to build the test environment in the tree. The latter will be rebuilt and shared between the test machines thanks to Gitlab CI.

In the case of the kernel, I don't believe we want to include the testsuites inside the projects needing them... especially for the kernel which depends on maintainer trees! Indeed, if a test suite test suites can be used by more than one driver... how do you make changes there since it may require changing files maintained by multiple trees?

Rather than having the test suite in the kernel, I would prefer if drivers just specified the test list, results expectations, and the test environment they want to use (using... a Dockerfile and a unique tag which is bumped every time the changes to the dockerfile are made?). Test farms would build the test containers and cache them using their unique tag, and test machines could run the test container directly (using boot2container) or the container can be extracted to make a rootfs that LAVA could use. It would be less of a pain if the kernel.org was hosting a forge and container registry... but it is not necessary.

That being said, I know for a fact this would not work for i915 and probably many other complex drivers. Integration test results are not binary (always pass, or always fail): multiple intermittent issues may be affecting it (some of them being hardware related). So, you can't just skip or mark them all the these tests as flaky... unless you want to end up in the situation Kernel CI is with IGT (only ~0.5% of the test suite is executed) and miss a lot of the regressions.

The issue is that: unless you test every merge request hundreds of times, you won't catch many of these issues and thus will fail to document them in the commit that introduced it. As the amount of tests grows, you will tend towards 100% of the merge request failing due to CI. AKA, the system would be unusable. One solution may be that documenting all these low-probability flakes would be done only at the time of asking Linus to merge the tree (and before every rc), so that the commit may be amended or at least updated in a later commit.... but that sucks because you really want such a system to work for both pre- and post-merge! Developers will ignore CI results if most runs bring false positives.

In the interest of transparency though, I would like to say that while I believe such approach would be an improvement over the status quo, I do not believe it would serve us well in the end. Due to hardware/timing differences, the reality is that expectations are very much tied to the machine(s) that generated them. So shipping expectations in the tree, while it would work 90% of the time, feels a little silly. They could could just as well have been stored inside the CI farm that generated them (and will be testing it in the future), and no need to amend commits would ever be needed. One could then decide to use git notes to store or amend expectations, but that seems error-prone in practice.

This is why I went for a centralized approach when I developed CI Bug Log (https://gitlab.freedesktop.org/gfx-ci/cibuglog). It isn't perfect at tracking 10+ trees, but that's fine for driver developers since they should only care about a handful of them (their development tree, linux-next, linus' tree, and stable trees). In any case, it was enough to get i915 from being constantly in the news to instead being praised for its quality a couple of years later. During this time, we went from running tens of thousands of tests per week to 6+ millions by increasing our machine count and test count by over 10x, and making them 24/7 (idle runs, used to catch more flakiness). The tool ingests execution reports, filters out known issues, forces a bug filer to document all unknown issues found post-merge, keeps track of reproduction rates for you, gives you a prioritized list of bugs to work on (based on the impact rate), and even helps you triage and stay on top of bugs filed for your project.

It may feel a bit corporate, but if you want to achieve production-readiness with upstream code... you'll need strict processes like that.

> Apart from that, it would make the tests easier to discover, make them more "official", and make it easier to tell contributors (how) to run them.

Absolutely!

>> In both cases, the vast majority of the job is tending to results, reworking the tests and test plans with the other developers, and keeping the system afloat.
> Oh, absolutely. I can see this unfolding every day at Red Hat's CKI, where I work, I feel the pain.

Thanks for doing this, this is a thankless job!

> I presented an overview of these approaches at FOSDEM last year

Niiiiice, I see we are definitely thinking alike! Thanks for looking into the Intel GFX CI and CI Bug Log too, that must have been quite a bit of work :)

> Yes, developer-owned hardware farms are the best. However, there's a tradeoff: you can get *more* hardware if you are prepared to relinquish *some* control.

Oh yes, I agree here! Physical access to the machines should not be a requirement, but it is also hard to trust that farm technicians did not change the GPU of the wrong machine :D

This is why the CI system I am developing as an alternative to LAVA verifies that the DUT's hardware config still matches the one expected by the test job. This is done by automatic labeling of the machine and comparing the set of tags with what is found in the database. If the set differs, the machine is taken offline is subjected to multiple boots to check for the stability of such tags, then re-exposed for testing jobs with the updated set if it booted reliably and the set of tags was consistent. Of course, there are more changes than that (container-based testing from an initramfs, automatic discovery of machines, lower reliance on serial consoles, ...)

> This could be seen as a trade: businesses offer test runs (potentially on special hardware), developers offer attention. Both want the kernel to work.

Ha ha, yeah! I secured a deal like this with another tech giant when I was at Intel. Got us 70 machines that we could install in our CI farm, no strings attached!

Of course, having hardware companies expose the machines to begin with would be even better :) Just need to make it as maintenance-free as possible!

> The idea of KCIDB is to provide a sort of a marketplace for this trade to happen, a place where those CI systems can offer their results, in the best shape possible. And a place where developers (and maintainers) can pick and choose which test results to see and be alerted about.

What you describes sounds good from an infrastructure PoV! In the userspace, this is largely what gitlab.freedesktop.org provides us.

I would however really like to pick your brain on the human side of this. From your Fosdem presentation, I saw that you live in Finland and turns out we live in the same city. Shall we meet up?

Challenges for KernelCI

Posted Aug 2, 2023 2:48 UTC (Wed) by mupuf (subscriber, #86890) [Link] (2 responses)

The security concern is real, and I do not have a complete solution here... What I do for Mesa CI (which is pretty open to contributions) is to:

* Not allow internet access from the test machines
* Run the tests from (privileged) containers, from an initramfs I developed: https://gitlab.freedesktop.org/gfx-ci/boot2container
* Have un-overideable timeouts on execution time

Running tests from containers is not much for security, but rather reducing the chances of a job affecting the results of another job because it modified the config of the OS. With containers, you get fresh boots every time... along with a standardised transport mechanism, runtime, and cache-ability of its layers for fast boots! Not even mentioning that it enables composition of test suites/monitoring services, and it makes reproducing the test environment on devs' machines waaaaay easier, am I right?

Anyway, the above tricks do not cover attacks on the firmware, or the bootloader (in the case of arm boards where we need to flash u-boot in order to netboot), but at least generic attacks won't persist across jobs \o/

Unprivileged containers for Mesa tests

Posted Aug 2, 2023 15:12 UTC (Wed) by DemiMarie (subscriber, #164188) [Link] (1 responses)

Could the Mesa tests use unprivileged containers, or even virtual machines?

Unprivileged containers for Mesa tests

Posted Aug 8, 2023 13:05 UTC (Tue) by mupuf (subscriber, #86890) [Link]

So sorry, I forgot to answer Marie!

I would not want to use VMs for testing since device pass-through is more likely to introduce issues that Mesa really doesn't care about (it is more of a kernel testing thing to do).

As for the containers, we can definitely use unprivileged containers provided we mount /dev/dri/ and /sys/class/drm/ in the test container. That being said, what's your threat model if the container runs from an initramfs?

Challenges for KernelCI

Posted Aug 2, 2023 9:53 UTC (Wed) by metan (subscriber, #74107) [Link]

I do not think that we can simply merge tests such as LTP into the kernel tree and forget about backward compatibility. In many cases we want a regression tests work all the way back to 10 year old enterprise kernels, since that is what the people writing these tests are usually paid for. In the end I do not think that backporting a test into ten different stable kernel branches works better than having a single version with a few special cases for different kernel versions. Generally LTP works mostly on kernel userspace API that is mostly stable and if possible tries not to assume internal kernel implementation details that are not guaranteed or easily inferred at runtime. In the end I think that the most reasonable split is that things that prod heavily into kernel internals and assume a lot of details should go to kernel selftest. More generic tests should be put into LTP. Such split seems to work fine so far and I do not see how changing this would make things better.

If anything throwing more resources on testing would help, we are struggling with manpower for years, if people want to get better coverage and CI, throwing more hands at the problem is something that will improve the situation a lot.