Continuous-integration testing for Intel graphics
Two separate talks, at two different venues, give us a look into the kinds of testing that the Intel graphics team is doing. Daniel Vetter had a short presentation as part of the Testing and Fuzzing microconference at the Linux Plumbers Conference (LPC). His colleague, Martin Peres, gave a somewhat longer talk, complete with demos, at the X.Org Developers Conference (XDC). The picture they paint is a pleasing one: there is lots of testing going on there. But there are problems as well; that amount of testing runs afoul of bugs elsewhere in the kernel, which makes the job harder.
Developing for upstream requires good testing, Peres said. If the development team is not doing that, features that land in the upstream kernel will be broken, which is not desirable. Using continuous-integration (CI) along with pre-merge testing allows the person making a change to make sure they did not break anything else in the process of landing their feature. That scales better as the number of developers grows and it allows developers to concentrate on feature development, rather than bug fixing when someone else finds the problem. It also promotes a better understanding of the code base; developers learn more "by breaking stuff", which lets them see the connections and dependencies between different parts of the code.
![Martin Peres [Martin Peres]](https://static.lwn.net/images/2017/xdc-peres-sm.jpg)
CI also helps keep the integration tree working at all times. That means developers can rebase frequently so that their patches do not slowly fall behind as the feature is developed. They don't have to fight with breakage in the integration tree because it is tested and working. The rebases are generally small, since they are frequent, which allows catching and fixing integration and other problems early on.
There are a number of objectives of the CI testing that Intel does. It is meant to provide an accurate view of the current state of the hardware and software. In order to do that, there are some requirements on the test results. They must include all of the information needed to diagnose and reproduce any problems found including the software and hardware configurations and log files. They also need to run quickly so that developers can get fast feedback on a proposed change. To that end, Intel has two levels of testing, one that gives results in 30 minutes or less and another that is more complete but takes up to half a day to complete.
The results need to be visible. Publishing the test results on a web site would work, he said, but it is better to make the results hard to miss. If the patch being tested is something that was posted to a mailing list, the results should be posted as a reply. "Put the data where the developer is." In addition, false positive test failures make developers less apt to believe in the test results, so they must be avoided. Any noise in the results needs to be aggressively suppressed.
Vetter went into some detail of the actual tests that are being run. The fast test suite uses IGT, which consists of mostly Intel-specific tests, on actual hardware. There are 250 test cases that run on 30 machines. That test takes about ten minutes to run, but may not complete that quickly depending on the load on the testing systems.
![Daniel Vetter [Daniel Vetter]](https://static.lwn.net/images/2017/lpc-vetter-sm.jpg)
The next step is the full test suite, which takes six hours of machine time. It tests against multiple trees, including the mainline, linux-next, and various direct rendering manager (DRM) subsystem trees (drm-tip, fixes, next-fixes). Those tests run with lockdep and the kernel address sanitizer (KASAN) enabled. There is a much bigger list of a few thousand test cases that gets run as well, he said. The tests, results, and more are all available from the Intel GFX CI web page.
Pre-merge testing is one of the more interesting and important parts of the test system. It picks up patches from the mailing lists using a modified version of Patchwork and runs the test suites on kernels with those patches. If the fast tests pass, that kicks off the full test; meanwhile the patches can start to be reviewed. Metrics that measure things like test system usage, test latency compared to the 30-minute and six-hour objectives, bugs found, and so on are also measured and reported. The test matrix (example) was "much smaller and full of red" a year ago, Vetter said. "CI is definitely awesome".
Peres filled in some more of the details. He said there were 40 different systems (up from Vetter's 30) and 19 different graphics platforms. Those range from Gen 3 from 2004 to the announced, but not yet available, Gemini Lake. There are a number of "sharded" machines, which are six or eight duplicate systems that can be used to parallelize the testing. The number of tests run has increased from around 22,000 per day in August 2016 to 400,000 per day in August 2017.
There are some things that have been learned along the way, Vetter said. Noise, in the form of false positives, will kill any test system. "Speed matters"; if it takes three days to get initial results, people will ignore them. In addition, Vetter said that pre-merge testing is the only time to catch regressions. Once a feature has been merged, it is impossible to get it reverted because people get attached to their work, but they don't always fix any regressions in a timely manner.
Linux-next is difficult to test with because it requires lots of fixup patches and reverts to get to something that will function. Part of the Intel testing does suspend-resume cycles on the machines which finds a lot of regressions in other subsystems, Vetter said. Greg Kroah-Hartman suggested posting those regressions as a reply to the linux-next announcement, but Vetter said there would be way too much noise with that approach.
Beyond that, trees like linux-next make bisecting problems way too hard. It takes "a lot of hand holding" to make bisect work so that these regressions can be found. It takes more work to ask the subsystem maintainer to get them fixed, so the Intel team ends up reverting things or changing the configuration to avoid those problems. They do try to report them to the maintainers, but the root of the problem is that some subsystem maintainers put unready code into linux-next in the hopes that others will test it for them; that makes it less than entirely useful as an integration tree.
The problems are generally outside of the graphics subsystem and are often exposed by the thousands of suspend-resume cycles that are part of the Intel graphics testing. The 0-Day test service does not do suspend and resume testing, though Kroah-Hartman suggested that it be added if there is an automatic way to test it. Regressions are hard to get fixed even in graphics, Vetter said, and reverts are difficult to sell. That's why pre-merge testing to find problems before the code gets merged is so important.
Peres also had a list of lessons learned, some of which, unsurprisingly, overlapped Vetter's. For one thing, kernel CI testing is unlike CI testing for any other project because it requires booting the machine, sometimes with broken kernels. In the year since the project started, they have realized that anything that is not being continuously tested is likely broken. Once again, this is something that is part and parcel of kernel testing because there are so many different configuration options, many of which cannot even be tested without the right hardware.
New tools were needed as Bugzilla is not a good tool to track test failures. Peres has been working on a CI bug log tool to fill that gap. He hopes to release the code for it by the end of the year once the database format has stabilized. It is also important that the developers own the CI system and that the CI team works for them. It should not be a separate team that reports to a different manager outside of the development team. As the developers start to see the value of the CI system, they will suggest improvements to the system and the tests that will help make the testing better.
Other graphics teams that have an interest in replicating the Intel work should have an easier time of it because much of the groundwork has been laid, Peres said. There is still a need for infrastructure and hardware, of course, along with software for building kernels, deploying them, scheduling test jobs, and the like. Several components are particularly important, including ways to power cycle systems and to resume them from suspend—an external watchdog will also help by restarting systems that are no longer responsive. There is a need for qualified humans both to create bugs and to respond quickly to those bugs and fix them.
There are a number of challenges that are specific to CI testing for the kernel, Peres said. The first is that various kernels will not boot or function properly; it could be the network drivers, filesystem corruption, or something else that may be difficult to automatically diagnose. Getting tracing and log information out of the system, especially during suspend and resume failures or if there is a random crash while running the tests, can be difficult. Using pstore for EFI systems and serial consoles for the others will provide a way to get information out of a broken system. Note that memory corruption can lead to all sort of nastiness, including trashing the partitions of the disk, so an automated way to re-deploy the system will be needed.
Slides [PDF] and a YouTube video of Peres's presentation are available for interested readers.
[I would like to thank the Linux Foundation and the X.Org Foundation for travel assistance to Los Angeles for LPC and Mountain View for XDC.]
Index entries for this article | |
---|---|
Conference | Linux Plumbers Conference/2017 |
Conference | X.Org Developers Conference/2017 |
Posted Oct 11, 2017 16:19 UTC (Wed)
by jhoblitt (subscriber, #77733)
[Link] (27 responses)
Posted Oct 11, 2017 16:39 UTC (Wed)
by drag (guest, #31333)
[Link] (2 responses)
With the Linux kernel it's big enough and popular enough that there are people running RC kernels and can file bags.
If nobody is testing the hardware or certain configs then it's likely to get broken and remain broken. This is a problem on older or more obscure hardware. Linux may have drivers for it, but chances are they won't work without some effort.
Posted Oct 11, 2017 16:44 UTC (Wed)
by bnorris (subscriber, #92090)
[Link]
I like this misspelling. It'd be much more fun to spend all my day in a bag tracker, rather than a bug tracker.
Posted Oct 11, 2017 22:17 UTC (Wed)
by roc (subscriber, #30627)
[Link]
Because minimal testing is performed by the kernel developers themselves, kernel RCs often contain regressions that break rr. We try to run RC1 at least through the rr test suite to see if they've broken anything. Of course regressions can land later, and do --- a patch late in 4.12 broke rr, and the release shipped that regression, so rr was broken in the stable kernel for a month.
So the answer to "how the LKML community survives without it" is really "by regressing often and letting kernel consumers deal with the fallout".
Posted Oct 11, 2017 16:45 UTC (Wed)
by tialaramex (subscriber, #21167)
[Link] (10 responses)
1. From very early, much earlier than for a comparable commercial system like NT, the Linux kernel's developers subsisted largely and in some cases entirely on dogfood. So all core systems that must work to have an environment in which you can edit text files, compile and link a large project and then ship it somewhere were tested de facto continuously by their developers. If you broke chown() then it didn't fail a unit test that would result in an email and public shame - it broke your computer, and you spent miserable hours figuring out what was wrong. If you broke the filesystem your files got trashed and you didn't have anything to show for it.
2. Almost all drivers and subsystems which aren't actively used by developers rot and die. In some ways this is "better" than a traditional CI because traditional CI causes a bias where the "loudest screams win" - effort may be expended on fixing something that failed unit tests even though it's actually not very important, and that's always effort which could have been directed at something which _is_ important. In Linux the developers were unavoidably focused on making their own computers actually work. When those computers had NE2000 ISA cards they made sure the driver for NE2000 ISA cards worked. Today they have Intel graphics chips.
So: Lots of things broke, but, relatively few of them were things people cared about.
Posted Oct 11, 2017 22:23 UTC (Wed)
by roc (subscriber, #30627)
[Link] (9 responses)
Every time my laptop freezes I feel a little bit grumpy about how i915 is the poster child for how Linus kernel graphics should be done.
Posted Oct 12, 2017 1:05 UTC (Thu)
by ras (subscriber, #33059)
[Link] (8 responses)
Me too. The headline prompted me to read the article. I was hoping to read a mea culpa from the Intel devs, along with the how they were addressing their quantity issues. Instead I got Intel devs pontificating about testing, and it does not sit well with me.
I can only hope that they are talking about some different driver. The i915 driver was terrible. In fact maybe that drivers quality issues is what drove them to implement CI. In 4.12 it's not so bad - but it's taken 2 years(!) after the chip was released to get a stable driver. Their first 10 or so releases were so bad people were returning Dell Laptops as unusable and subsequently ranting at Dell on every form the could find. All the noise came from people running Windows - but I suspect only because the people running Linux knew who was really to blame, and trusted it would be fixed. I was one of those. But I never dreamed it would take so long.
Then there whatever driver xblacklight depends on. Promised for 4.11, not still not delivered in 4.12. https://bugs.freedesktop.org/show_bug.cgi?id=96572#c11 That's been over 2 years(!).
Posted Oct 12, 2017 7:32 UTC (Thu)
by blackwood (guest, #44174)
[Link] (7 responses)
If you read the article it says clearly that 1 year ago the entire board was red, which is around 4.10/11. We're not blind idiots who can only pontificate, the reason we pontificate is that CI actually dug us out of this huge hole we've got into over 2-3 years of no testing at all due to reorg madness within less than 1 year (you can't see all that yet because development is 4-6 months ahead of the latest release). So yeah, CI is pretty much the only way to get quality on track.
And yes skl didn't work on those older kernels. Per CI, it still didn't work well on 4.12, but at least it's better (4.14 should be actually good).
Posted Oct 12, 2017 7:46 UTC (Thu)
by andreashappe (subscriber, #4810)
[Link]
If there were a way of upvoting a comment, I would do that. Thanks for your work.
Posted Oct 12, 2017 7:48 UTC (Thu)
by ras (subscriber, #33059)
[Link] (1 responses)
A rational explanation. It also explains why 4.12 was a marked improvement. Thank $DIETY for that. Well, I guess I should be thanking you guys.
Sounds like you are back on the track. It would be interesting to know how an engineering firm like Intel fell off it in the first place, but I guess that explanation will have to wait until someone moves on.
Posted Oct 14, 2017 0:45 UTC (Sat)
by rahvin (guest, #16953)
[Link]
Posted Oct 13, 2017 20:12 UTC (Fri)
by JFlorian (guest, #49650)
[Link] (3 responses)
I just wish I could buy Intel graphics chipsets on add-in cards. The integrated video easily becomes too dated while the mainboard remains otherwise sufficient.
Posted Oct 14, 2017 1:40 UTC (Sat)
by jhoblitt (subscriber, #77733)
[Link]
Posted Oct 14, 2017 10:08 UTC (Sat)
by excors (subscriber, #95769)
[Link] (1 responses)
Then you'd need to add gigabytes of dedicated VRAM to make it work as a discrete card. And in terms of performance the current highest-end Intel GPUs would still only compete with perhaps a $70 NVIDIA card, so it doesn't seem there's much opportunity for profit there.
Posted Oct 17, 2017 15:13 UTC (Tue)
by JFlorian (guest, #49650)
[Link]
The Nvidia cards I do buy are often in the $70 range. I don't play games so most anything is overkill.
Posted Oct 11, 2017 17:42 UTC (Wed)
by arjan (subscriber, #36785)
[Link] (4 responses)
Posted Oct 11, 2017 18:00 UTC (Wed)
by blackwood (guest, #44174)
[Link] (2 responses)
Note that on the gfx side we still have a few years of catch up to do until we are where we need to be. And that's just for gfx.
Posted Oct 12, 2017 14:05 UTC (Thu)
by dunlapg (guest, #57764)
[Link] (1 responses)
Remember that people submit things to Linux that the core developers themselves don't use and often don't care about. It would be completely infeasible for the core developers themselves to actually implement regression testing for every driver and every bit of functionality that anyone ever used.
Which means there are basically two options:
The main thorny issue with this is when changes to your functionality break other people's functionality. It doesn't seem fair that (say) Intel should have to keep chasing down driver failures caused by bugs introduced by someone tweaking some TLB-flushing algorithm somewhere. But on the other hand, it also doesn't seem fair that someone who just wants to tweak a TLB-flushing algorithm should have to fix (or work around) dozens of buggy drivers whose authors disappeared years ago.
Posted Oct 12, 2017 14:44 UTC (Thu)
by blackwood (guest, #44174)
[Link]
Of course you can't cover everything, and the occasional regression on an odd combo of hw gets through.
But that's far away from the current world where patches land in linux-next that just take out 2/3rd of our machines, and intel desktop/laptop chips are probably the most well-tested platforms there are. That's just pushing untested crap in.
We want to spend the time of our CI engineers improving intel gfx CI, no chasing around the regression-of-the-day from some random other place that just took out the lab. Of course we have defense-in-depth and try to run linux-next before we have to rely on it when it all this the merge window. But then it takes weeks (or even months) to get even simple regressions fixed, and often you get the 0 day garbage that lands in the merge window (or even later) without any kind of serious testing except maybe a day in linux-next - hey at least it compiles!
Posted Oct 12, 2017 14:48 UTC (Thu)
by mupuf (subscriber, #86890)
[Link]
Martin Peres here. I agree that 0-day is doing a lot of useful testing (and yes, it is CI). However, I think I came across wrongly here, I was mostly talking about functional testing (especially for Graphics).
0day has been very helpful for the platform-agnostic parts. However, it does not try to deal with the hardware and firmwares much (outside of booting) and I cannot blame you for that, as this makes everything harder to deal with (and requires a LOT MORE of lab space). 0 Day would be helping us a lot if it was also testing suspend and suspend to disk, as we are fighting with these in linux-next and every -rc1.
I also think that my recommendations to have a public page showing the current state of Linux would be good! It would see what kind of HW coverage you have, and what is the current state of Linux on the test suites you run :)
Keep up the good work!
PS: BTW, for performance, you may be interested in ezbench, as it takes care better of the variance, handles bisection of multiple performance changes, and makes reports nicer to read (because you only output what really changed). It also supports bisecting multiple things at the same time (which should help with throughput). Ezbench also works for unit tests and rendering testing, but that may not be your goal :)
Posted Oct 11, 2017 20:48 UTC (Wed)
by ballombe (subscriber, #9523)
[Link] (6 responses)
Posted Oct 11, 2017 23:00 UTC (Wed)
by roc (subscriber, #30627)
[Link] (3 responses)
In fact, as people have pointed out, Linux does have some CI, there just aren't nearly enough tests running.
Posted Oct 12, 2017 7:15 UTC (Thu)
by blackwood (guest, #44174)
[Link] (2 responses)
The same way monolithic source control won't work for linux, monolithic CI won't work for linux (and everyone just hoping that 0day catches everything is silly because of that).
Posted Oct 12, 2017 13:54 UTC (Thu)
by arjan (subscriber, #36785)
[Link] (1 responses)
Posted Oct 12, 2017 14:49 UTC (Thu)
by blackwood (guest, #44174)
[Link]
"mostly compiles" and "boots on that virtual machines" aren't really quality standards I deem sufficient. Most of the stuff 0-day catches it only catches once your patches are in linux-next (like combinatorial build testing, or the more time-consuming functional tests, e.g. it also runs igt tests, but tends to be a few weeks behind our CI).
Also, all the other quality checkers we have (static checkers, cocci army, gcc warning army and so on) also only check linux-next. At best.
I know rather well how much 0day tests, it's just plain not good enough for the kernel overall.
Posted Oct 12, 2017 9:57 UTC (Thu)
by sorokin (guest, #88478)
[Link]
I would say that some CI testing is better than no CI testing and only manual testing. Also time requirements are different for different types of testing. If one requires some kind of fuzz-testing for weeks then yes it's too short time for that.
What I'm thinking is some kind of hand-written tests (regression or unit). Normally this type of tests runs very fast.
More expensive testing (like fuzz-testing) could be done in background and as it discovers new problems tests are added to the hand-written test corpus.
Posted Oct 12, 2017 14:14 UTC (Thu)
by drag (guest, #31333)
[Link]
There really isn't any such thing as a 'comprehensive test'. You are really running into the halting problem if you try to do that.
How CI pipelines work is generally something like this:
1. You commit your code to a branch and push.
2. CI tool picks up on the push and using instructions provided in the git repo it launches a runner process.
3. The runner process then builds the code, creates the virtual machine or uses docker to create the application environment (or anything you can really think of using really), which then launches the code and then performs a series of tests against it.
4. Based on the results of the tests it sends a email back or leaves a note on a dashboard (or whatever) indicating the result.
The tests evolve organically over time and are saved and committed along with the code in most cases.
You run into a serious bug you make a test that will check that bug and similar things in the future. So CI pipelines are there to catch common anti-patterns and mistakes that people make all the time. It may involve static analysis tools or anything else you can think of.
I am sure that everybody has set it up so that when they save code some test or compile is triggered on their laptop at some point. Same idea, but on steroids.
The benefit from this is that you are running a suite of tests on every small and trivial change. Incremental testing. That way if you accidentally introduce a common or known type of bug it gets caught while it's still fresh in your mind. Quicker and better feedback on smaller changesets makes things easier for everybody.
Of course this doesn't replace other types of testing that is already being done. Obviously something like network performance regressions or problems with aging file systems can't be tested in this way without some serious time commitment. So this is very much not a 'instead of' and more of 'in addition to'.
Posted Oct 12, 2017 20:17 UTC (Thu)
by SEJeff (guest, #51588)
[Link]
I'm virtually certain that Ingo Molnar was working on a large-ish one when he became the x86 maintainer a few years ago as well, but there are many places that rely on linux which run infra such as this, and they report bugs!
Posted Oct 11, 2017 22:36 UTC (Wed)
by error27 (subscriber, #8346)
[Link] (1 responses)
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3219/fi-g...
It would be better, if instead of 1.20-g136100c2 it just gave you a git tree and a commit to check out. Check out how the 0day bug reports do it.
Posted Oct 12, 2017 7:22 UTC (Thu)
by blackwood (guest, #44174)
[Link]
Also note that the CI we run on the branches does _not_ test anything more than what you get when you submit a patch (like I said, only pre-merge is the time to catch crap, after that it's too late). The main reason we have all that stuff is to baseline the test results so that we can compare against results with a patch series applied.
Aside: The build link has everything you want, plus lots more. The sha1 you're seeing here is just the standard output of our tests.
Posted Oct 12, 2017 3:13 UTC (Thu)
by ajdlinux (subscriber, #82125)
[Link]
FWIW, I'm contributing to the fragmentation and working on some tooling which we'll use for doing more pre-merge testing (building, static analysis, qemu/kvm testing, eventually baremetal testing) for PowerPC kernel and firmware. Hopefully will have more to report in the next few months after we get POWER9 out the door...
Posted Oct 12, 2017 8:45 UTC (Thu)
by Kamiccolo (subscriber, #95159)
[Link] (1 responses)
Posted Oct 12, 2017 10:41 UTC (Thu)
by blackwood (guest, #44174)
[Link]
It will happen.
Continuous-integration testing for Intel graphics
Continuous-integration testing for Intel graphics
Pedantry disguised as humor
Continuous-integration testing for Intel graphics
Continuous-integration testing for Intel graphics
Continuous-integration testing for Intel graphics
Continuous-integration testing for Intel graphics
Continuous-integration testing for Intel graphics
Continuous-integration testing for Intel graphics
Continuous-integration testing for Intel graphics
Continuous-integration testing for Intel graphics
Continuous-integration testing for Intel graphics
Continuous-integration testing for Intel graphics
Continuous-integration testing for Intel graphics
Continuous-integration testing for Intel graphics
Continuous-integration testing for Intel graphics
the 0day project does a heck of a lot of tests (compile, boot, performance, some functional) on all the maintainer and pre-maintainer git trees before they hit linus.... if that is not CI then what is?
Continuous-integration testing for Intel graphics
Continuous-integration testing for Intel graphics
Too little (in terms of coverage, even on intel hardware), too late (in the patch merge process, abusing linux-next to get your stuff tested really isn't cool imo).
The second one is what they do, and I think it's probably the best: The people who care about a feature should be the ones testing it and making sure it doesn't bitrot.
Continuous-integration testing for Intel graphics
Continuous-integration testing for Intel graphics
Continuous-integration testing for Intel graphics
Continuous-integration testing for Intel graphics
Continuous-integration testing for Intel graphics
Continuous-integration testing for Intel graphics
it tests a few hundred git trees, and you can ask it even to test dev branches etc etc....
Continuous-integration testing for Intel graphics
Continuous-integration testing for Intel graphics
Continuous-integration testing for Intel graphics
Continuous-integration testing for Intel graphics
Continuous-integration testing for Intel graphics
Continuous-integration testing for Intel graphics
Continuous-integration testing for Intel graphics
Continuous-integration testing for Intel graphics
Continuous-integration testing for Intel graphics