Advances in Mesa continuous integration

By Jake Edge
October 9, 2018

Continuous integration (CI) has become increasingly prevalent in open-source projects over the last few years. Intel has been active in building CI systems for graphics, both for the kernel side and for the Mesa-based user-space side of the equation. Mark Janes and Clayton Craft gave a presentation on Intel's Mesa CI system at the 2018 X.Org Developers Conference (XDC), which was held in A Coruña, Spain in late September. The Mesa CI system is one of the earliest successful CI initiatives in open source that he knows of, Janes said. It is a core component of Mesa development, especially at Intel.

Like many companies, Intel is a large organization with an "old school development model". He likened it to a Roman army, where there are legions that are made up of smaller groups, each of which has procedures for all of its activities; tents are set up and arranged the same way each time. When Intel first encountered Mesa development, it was something of a shock. There were no architects in the group, but the Mesa developers were simply running right through the Intel army.

There is a tendency to be smug and dismissive of the other parts of the companies that we work in, he said. He thinks it makes more sense to take the time to explain why the open-source methodology is superior. For 50 years, building software has been about differentiation and competing, but part of the open-source model is to collaborate with competitors. He thinks open-source has an unbeatable advantage, but it is still the underdog in many respects; it is "a lot of fun to be an underdog with an advantage", Janes said.

Technical details

He turned things over to Craft, who went over some of the technical details of the CI system. It has roughly 200 machines with full Intel hardware coverage going back to 2007. It runs multiple test suites, including piglit and the conformance test suites (CTS) for Vulkan, OpenGL, and OpenGLES. It does build tests of non-Intel platforms for Android as well. Each commit to the mainline kicks off millions of tests. Users can add their development branches to the CI system, as well, so that commits to those get tested before the code ever ends up in the mainline.

The target execution time for these tests is 30 minutes or less; otherwise, it doesn't really fit well with a developer's workflow. There is a "very very low false positive rate", Craft said, of about 0.0001%, so developers can be confident that the reports they get are accurate. For various performance tests, it can generate trend lines that make performance regressions easy to spot. The implementation of the CI system is open source and available on GitLab; "steal our stuff", he suggested.

Janes said that when the project started, there were no real success stories for testing in open source, nor was there a testing framework available to be used. Being transparent with the results of the testing and the progress of the CI system has helped with its adoption. Having a really low false positive rate, wide coverage, and rapid turnaround has helped as well. He likened working in a project to living with friends as roommates; at first it is great, but eventually the dishes start to pile up and no one wants to deal with them. The CI system is meant to help developers get along; it is the automatic dishwasher for the community, he said.

Stats and graphs

Statistics are useful, but they can also be misused. Some stats are easy to collect and highly visible, but do not necessarily correlate to the measures that get imparted to them. For example, tracking and prominently showing the number of commits in a particular area can lead to more white-space commits or to splitting commits with an eye toward boosting those numbers, he said.

That said, he started showing graphs from their slides [PDF]. He started with a graph that showed both the number of commits and the number of CI builds for each developer with more than 30 commits since the CI system started in August 2014. One thing it shows is that an active developer can also be an active user of the CI system—several of the most active developers have nearly as many CI builds as commits. Some other developers are perhaps trying to be extra safe and do more CI builds than commits, or maybe they are simply new to Mesa and are learning. In general, for those using the CI system, the number of builds was roughly on the same level as the number of commits.

Janes then described two ways to track bugs that get introduced into Mesa: reverts and fixes. Reverts are as old as Git, but fixes are relatively new. The "Fixes" tag was introduced in 2017; it notes the commit ID that is being fixed. In both cases, the "offending" commit ID can be extracted, thus the developer who introduced the bug can be identified. The Fixes tag is used to help automate the inclusion of commits into bug-fix releases.

For the following graphs, he removed the names of the developers, and just identified them by the number of commits they made over the period that the graph covered. He showed a graph of the percentage of fixes per commit as well as the number of CI builds since the Fixes tag was adopted in 2017. As might be guessed, those who are heavy users of CI tend to have much lower rates of patches needing Fixes. He joked that since the names were not on the graph, he would be happy to accept beer from any developer who wanted to ensure that no one would ever know which entry on the graph was theirs.

Looking at CI builds and reverts back to the start of the CI era in 2014 shows a similar correlation to the Fixes graph, though the signal is somewhat noisier. But heavier users of the CI system tend to have fewer reverts, in general.

There is quite a bit of work that goes into ensuring that the CI test results are useful and do not contain known failure cases, Craft said. There are lots of tests that will never run on certain platforms, so those need to be filtered out of the test results. That way, spurious failures will not clutter inboxes or the CI status web site.

In addition, once a bug has been identified stemming from a particular commit, and the developer has been notified, it does not help other developers to get notified of that problem. Craft said that he and Janes identify these commits and configure the CI system to expect the failing test(s).

At that point, though, when someone fixes the problem, the test will start passing, which could seem like a failure to the CI system. Switching back to the pre-bug CI configuration will fix that problem. But a development branch that was made before the fix will not pass the test. It can't use the post-fix configuration. So the CI configuration files are tracked in Git and the changes to the test results are tagged with commit IDs. That way branches will only have failures caused by changes on the branch and the mainline will reflect known bugs and fixes properly.

Those configuration files can be used to gather some statistics about test regressions, Janes said. For the i965 driver, commits that regressed the tests were generally fewer for those who used the CI system more. He summed up his case by saying that CI is helpful for developers, those who use it love it because they don't like having their commits reverted and the CI system allows them to avoid that. At Intel, it is believed that use of CI has a multiplicative effect on the rate of development, he said.

CI hardware

There are around 200 computers in the CI system, Craft said. On average, they are about $400 systems but, in reality, most of the money goes for the high-end, new systems. The project got the older systems for free or for pennies on the dollar from eBay.

There are servers, desktops, and laptops represented, though the builds for those systems are done on dedicated Xeon systems. Craft showed some pictures of the CI hardware, which consists of "a lot of systems in a small amount of space". Various tricks have been used to squeeze more into that space, including an innovative use of Ikea napkin holders to store laptops vertically.

As part of the talk, Intel is announcing a new, public CI results site. Previously, those outside of Intel would get an email with the results of their CI run that they would need to parse to figure out what went wrong. Those inside of Intel could use the Jenkins web site, but it took a long time for pages to appear. The old site would take nearly a minute and a half, while the new site is just over half a second.

Janes and Craft took a test drive through the new site, which provides lots of information about what builds were tested and what the outcome was. If tests failed, one can drill down to see the test output, history, and more. For example, piglit did not build earlier on the day of the talk, but because of the CI system, it was noticed and fixed before many developers even knew it was broken, Janes said.

In the Q&A, Janes suggested that those who want to have their own tests run get them added to piglit or another of the community test suites. He also said that it is difficult to identify a small subset of tests that give a good indication of driver health in, say, one minute. He noted that all of the work on the new CI web site was mostly done to support those outside of Intel; other than speeding up web site loading time, it didn't really provide anything new to Intel developers. Janes and Craft clearly hope that more Mesa developers will use the system and that all of Mesa development will benefit.

[I would like to thank the X.Org Foundation and LWN's travel sponsor, the Linux Foundation, for travel assistance to A Coruña for XDC.]

Index entries for this article
Conference	X.Org Developers Conference/2018

Advances in Mesa continuous integration

Posted Oct 9, 2018 21:16 UTC (Tue) by roc (subscriber, #30627) [Link] (13 responses)

Did they really say "there were no real success stories for testing in open source" before 2014? ("when the project started" ... "the CI system started in August 2014")???

Advances in Mesa continuous integration

Posted Oct 9, 2018 21:37 UTC (Tue) by jake (editor, #205) [Link] (9 responses)

> Did they really say "there were no real success stories for testing in open source" before 2014?

Taking a peek at the video: https://www.youtube.com/watch?v=nm3U7jaNJQQ (which was posted about the same time I finished my writeup), I do have that wrong. Mark said: "there weren't a lot of great success stories for testing in open source" (about 9:32 in the video)

The project may well have started before 2014, but it started being used in 2014.

jake

Advances in Mesa continuous integration

Posted Oct 9, 2018 21:56 UTC (Tue) by roc (subscriber, #30627) [Link] (8 responses)

I guess it depends on when they "started" their project, and what they meant by "a lot" and "great". Then when they say "there weren't any open source test projects" I guess it depends on what they mean by "open source test project".

But by the time they got their system going in 2014, Firefox and Chromium had been running millions of tests on every commit for many years. All Mozilla's infrastructure, at least, is/was open source. I'm pretty sure many other open source projects had CI by then. Travis CI started getting popular in 2012. Jenkins has been around since 2005.

Perhaps they really meant "kernel-related" or "at Intel".

Advances in Mesa continuous integration

Posted Oct 10, 2018 3:29 UTC (Wed) by marcH (subscriber, #57642) [Link] (7 responses)

> > It has roughly 200 machines with full Intel hardware coverage going back to 2007

> But by the time they got their system going in 2014, Firefox and Chromium had been running millions of tests on every commit for many years. [...] Perhaps they really meant "kernel-related" or "at Intel".

Basically yes; I suspect Mark and Clayton were implicitly narrowing the scope to drivers and other hardware testing. This project is barely starting to fill the huge gap between open and closed source in *hardware* validation.

Advances in Mesa continuous integration

Posted Oct 10, 2018 4:51 UTC (Wed) by roc (subscriber, #30627) [Link] (6 responses)

From the article (I haven't watched the whole video) it sounds like they're focused on testing software, not hardware. I.e. the hardware is fixed and software changes are tested.

I assume they test on a wider variety of hardware than most open source projects ... but Firefox and Chromium test on a pretty broad set of platforms and hardware too, including various Android platforms which are a real pain. Chromium test on specific GPUs, Firefox too to a lesser extent.

Advances in Mesa continuous integration

Posted Oct 10, 2018 5:29 UTC (Wed) by marcH (subscriber, #57642) [Link] (3 responses)

> it sounds like they're focused on testing software, not hardware. I.e. the hardware is fixed and software changes are tested.

None is fixed.

In some other, more... "traditional?" hardware validation activities, both are fixed, i.e. 1:1 relationship between software branch and hardware branch.

> but Firefox and Chromium test on a pretty broad set of platforms and hardware too, [...] Chromium test on specific GPUs, Firefox too to a lesser extent.

OK but guess where they report hardware-specific issues? Drivers are by definition what's supposed to abstract the hardware away.

> including various Android platforms which are a real pain.

Can you elaborate?

Advances in Mesa continuous integration

Posted Oct 10, 2018 11:57 UTC (Wed) by roc (subscriber, #30627) [Link]

Combinations of slow emulators and racks of phones and other devices.

Advances in Mesa continuous integration

Posted Oct 10, 2018 12:09 UTC (Wed) by roc (subscriber, #30627) [Link] (1 responses)

Drivers are supposed to abstract over the hardware but in practice they're all very buggy and users update them sporadically so browsers have to deal with pretty much all possible combinations of hardware and drivers. Firefox and Chrome maintain blacklists of driver versions with known bugs, and work around the bugs (hopefully they report the bugs as well). Often the workaround is "don't use the GPU". Firefox and Chrome also isolate driver usage to a GPU process (Firefox only on Windows so far, but Linux eventually) and detect and recover from user-space driver crashes, by disabling more GPU paths. Video decoding hardware is also part of this problem.

Advances in Mesa continuous integration

Posted Oct 10, 2018 13:34 UTC (Wed) by markjanes (guest, #58426) [Link]

You are right, the very mature testing projects for Firefox and Chrome will also test many code paths in the Linux graphic stack. To my knowledge, i965 Mesa hasn't had many issues in bugzilla that arrive through those systems. We are recently getting bugs through the test process that Google and Intel have developed for the WebGL conformance tests.

If we see new automation that does a good job identifying bugs, we'd like to incorporate it -- and reduce the amount of work that needs to be done by other communities.

Advances in Mesa continuous integration

Posted Oct 10, 2018 19:15 UTC (Wed) by robclark (subscriber, #74945) [Link] (1 responses)

> From the article (I haven't watched the whole video) it sounds like they're focused on testing software, not hardware. I.e. the hardware is fixed and software changes are tested.

testing driver software, ie. what enables the hardware.. it requires a pretty broad array of hardware to get good coverage of many generations of hardware

Advances in Mesa continuous integration

Posted Oct 10, 2018 23:27 UTC (Wed) by pizza (subscriber, #46) [Link]

Also, with device drivers in particular, it's rather difficult to test out all possible error paths or the seemingly infinite combination of events/timings that the real world supplies with ease. I've lost count of the number of times I've seen "impossible" conditions occurring.

Advances in Mesa continuous integration

Posted Oct 9, 2018 21:58 UTC (Tue) by flussence (guest, #85566) [Link]

Well that'd explain why xf86-video-intel hasn't had a stable release in years… it needs to get on the CI train!

Advances in Mesa continuous integration

Posted Oct 10, 2018 13:08 UTC (Wed) by markjanes (guest, #58426) [Link] (1 responses)

I apologize for that incorrect characterization -- my comments should have been limited to the context of XDC - the Linux graphics stack.

There were some efforts to automate graphics testing before 2014, but they were not open source and were not incorporated into the workflow of Mesa developers as they are now.

As we all know, continuous integration is a practice that has been effectively deployed by all sorts of projects going back to the start of extreme programming well over a decade ago.

Advances in Mesa continuous integration

Posted Oct 11, 2018 0:29 UTC (Thu) by roc (subscriber, #30627) [Link]

No problem, thanks for the clarification.

Advances in Mesa continuous integration

Posted Oct 10, 2018 10:15 UTC (Wed) by error27 (subscriber, #8346) [Link] (2 responses)

I invented the Fixes tag, and I'm so happy to see it used like this. Some people thought it was just a more granular version of CCing stable, but it was always meant to be about improving the process to see how bugs are introduced and get fixed.

Advances in Mesa continuous integration

Posted Oct 10, 2018 10:57 UTC (Wed) by karkhaz (subscriber, #99844) [Link]

Cool! I had not heard of the Fixes tag. To improve its visibility, I wonder if it would be worth suggesting to GitLab that they incorporate it into their web UI? Something like hyperlinks in the list of a project's commits: on a commit that fixes a bug, "This commit fixes a bug introduced in 1d2c3b4a," and on a commit that introduces bugs, "This commit introduced bugs fixed in f4e3d21c and aabbccdd".

There are lots of cool visualisations and other data that a service like GitLab could generate from this if a project applied it consistently, e.g. average time from the introduction of a bug to a fix, who amongst the project contributors fixes the most bugs, etc.

Advances in Mesa continuous integration

Posted Oct 10, 2018 15:28 UTC (Wed) by markjanes (guest, #58426) [Link]

As I mention in the talk, the statistics from this category of analysis can easily be misused, having unintended consequences for the project. The most important use of Fixes is to automate the generation of quality stable releases. The most obvious-but-incorrect use of the tag is to systematically blame developers for their mistakes -- I hope no one thinks this is a good idea.

In the Mesa data, many commits fix "radv: add initial non-conformant radv vulkan driver"

With respect to improving kernel development process, it might be helpful to analyze fixes by subsystem, to identify development practices that could be more widely employed. Another interesting statistic would be mean-time-to-fix, to see what type of kernel testing quickly identifies bugs.