Preparing for the realtime future

By Jake Edge
September 9, 2020

Unlike many of the previous gatherings of the Linux realtime developers, their microconference at the virtual 2020 Linux Plumbers Conference had a different feel about it. Instead of being about when and how to get the feature into the mainline, the microconference had two sessions that looked at what happens after the realtime patches are upstream. That has not quite happened yet, but is likely for the 5.10 kernel, so the developers were looking to the future of the stable realtime trees and, relatedly, plans for continuous-integration (CI) testing for realtime kernels.

Stable trees

Since the realtime patch set will be fully upstream "any release now", a plan needs to be made for what will happen with the realtime stable (RT-stable) kernels, Mark Brown said to start his session. Currently, there are RT-stable kernels maintained for each of the long-term support (LTS) stable kernel versions; the realtime patch set is backported to those kernels. But once the patches are in the mainline, there will be no longer be a realtime tree to backport.

He wondered if people should simply be told to use the mainline stable kernels if they want the realtime feature. If so, any realtime performance regressions that occur in the stable trees will need to be addressed and those fixes will need to be accepted by the stable maintainers. Realtime developers will need to help with any conflicts that arise in backporting fixes to the stable kernels as well.

Testing is another area that will need to be handled; in particular, realtime performance needs to be tested as part of the stable release process. Right now, Greg Kroah-Hartman largely outsources testing of specific use cases and workloads on stable kernels to those who are interested in ensuring those things continue to function well. Testing of realtime performance will need to be part of that.

Steven Rostedt was volunteered for the testing job by Clark Williams; Rostedt did not exactly disagree, noting that he had done that kind of thing in the past. Automating the realtime testing is something that needs to be done, he said. Ideally, each new stable kernel would be downloaded automatically, built, and run through a series of realtime-specific tests. Brown wryly noted that the next session in the microconference was on CI testing. He also said that it would make more sense to test the stable candidates, rather than the released kernels, so that any problems could be found before they get into the hands of users.

At that point Kroah-Hartman popped up to say that the realtime kernel is not unique in any way; "you're special just like anybody else". He will take regression fixes into the tree as needed and can provide various ways to trigger the building and testing of the kernels for realtime. Rostedt agreed that realtime is not special in any way from the perspective of the stable maintainers; but the realtime developers need to work out how to automate their testing.

Brown said that currently it is up to the RT-stable maintainers to apply the patches to a stable tree and manually test the resulting kernel. Kroah-Hartman suggested adding the realtime testing to the KernelCI infrastructure, so that it will be automatically built and tested whenever a stable candidate is released. Currently, the realtime patches are not merged into the stable tree right away, Rostedt said, because the stable changes often conflict with the realtime patches, but that should not be a problem once it is all upstream.

Getting into KernelCI is "very easy", Kroah-Hartman said, but Brown noted that the kinds of testing that need to be done for realtime is different than for other parts of the kernel. The realtime tests have a performance criteria rather than a functional criteria, Williams said. But Kroah-Hartman said that KernelCI has both functional and performance testing now, so there should be no real barrier to adding the realtime tests. Brown agreed, but said that someone needs to get the tests into a form that fits into the infrastructure.

As an example, Rostedt said that he runs a test that builds the kernel over and over again on multiple cores, while also running hackbench multiple times. All of that runs over a weekend, while he runs cyclictest with realtime tasks to record their latencies; he does not expect to find any latencies greater that 50µs. That kind of test would simply need to be packaged up and automated so that it can be run by bots of various sorts.

Another question is whether realtime should have its own separate staging tree to try out new features, such as a new futex() interface, Rostedt said. Would it make sense to turn the current RT-stable tree into a "testing playground" for new features, he asked. If those features were deemed useful for the mainline, they could be backported to the stable kernels as well. But Williams wondered if it was time to "come back into the fold and not stay out in the cold"; he sees the value in an "RT-next" for development purposes, but does not think it would work well to support these features in earlier kernel series. While it did not come up in the discussion, those kinds of changes might also run afoul of the stable kernel rules about only fixing actual bugs.

Rostedt more or less agreed with Williams but noted that there is a kind of "catch-22" for API design, in that you cannot get a good API without users testing it, but that it is hard to get users to test without having a good API. Williams agreed that there is a problem there, but did not think backporting from RT-next would really help solve it—it is likely to just bring headaches for the realtime developers. Testers could build and use RT-next itself, he said.

The main thing that needs to happen after realtime is in the mainline is to make sure there is a team paying attention to it going forward, Rostedt said. That team would ensure that realtime does not get broken in the stable kernels. Williams asked if there would be designated handlers for realtime bugs, but Rostedt thought that, once again, there is nothing special about realtime once it gets upstream. People will report bugs in the usual fashion, and the stable maintainers will direct the bugs to the realtime developers as needed.

Now is a good time to get the automated testing in place, Sasha Levin said; it is more difficult to do that after the feature is in the mainline. Most of the RT-stable patches will apply automatically on the stable candidates at this point, Williams said, so those can be used to start working up the automated testing strategy. A plan soon formed to use Daniel Wagner's scripts for the 4.4-rt tree as a starting point to try to automatically merge the stable release candidates and the realtime patches; if that succeeds, then testing could be kicked off to see if there are any realtime-specific problems in the resulting kernel. Once realtime is in the mainline, the merging step would simply be dropped.

Continuous integration

As the first session wound down, it segued nicely into a look at CI for realtime in the mainline led by Bastian Germann. There is some automated testing in place for realtime, he noted, though it was apparently not well known: the CI-RT system. It is a Jenkins-based CI system that is tailored to the needs of testing the realtime kernel. There is a one known lab running it at Linutronix (Germann's employer) on hardware donated by members of the Linux Foundation Real-Time Linux project.

Realtime developers can configure tests in CI-RT via a Git repository. The results of the tests are reported on the CI-RT site and also by email to the developer who is running them. The kernels are built on a build server, then booted on the target hardware, which serves as the first level of test. After that, the system runs tests somewhat similar to what Rostedt had described earlier. It uses cyclictest on both idle and stressed systems; the stress is created by hackbench coupled with other processes, such as a recursive grep that will generate a lot of interrupts, he said. The cyclictest results are then recorded for the systems.

Once realtime gets into the mainline, the CI-RT system can be used as is, he said, just by reconfiguring the Git source being used. Beyond the mainline itself, there are some other trees that should get tested, including some that came out in the previous session, Germann said. The current release candidate for the mainline and linux-next should be tested; the stable kernels should be tested as well, including their release candidates as was discussed. The test frequency and duration will need to be established for each tree; for example, he suggested that linux-next could be tested for eight hours every night.

No other CI systems currently run realtime tests, he said, though Brown wants to get them working in KernelCI. Germann said that more labs should be testing the realtime kernel once it gets merged. That will cover more hardware as well as raise the awareness of realtime among kernel developers. In order for that to happen, the realtime project needs to support other CI systems; KernelCI support is in the works, but he asked if there are other test or CI systems that should have support for realtime tests.

After something of a digression into how to handle signing Git tags in an automated fashion, which was deemed undesirable, Nikolai Kondrashov suggested that CI-RT send its reports to KernelCI. He and others are working on collecting and unifying test results in a common database.

Germann asked about the kinds of data that could be collected; ideally, CI-RT would want to present more than just a "pass" or "fail" and would include the latency measurements that were used to make that determination. Currently, the schema only provides a way to report the status of the test, Kondrashov said, but there is a way to attach additional data. The project is trying to work with the developers and operators of the various testing systems to determine what additional information should be added to the JSON schema. Veronika Kabatova mentioned that the Red Hat Continuous Kernel Integration (CKI) project would be willing to start running realtime tests, which would come with integration into the KernelCI unified reporting for free.

Mel Gorman said that SUSE also runs a Jenkins-based CI system that uses some of the realtime tests as part of its performance testing. He had some suggested configurations for his MMTests that could be used to help with realtime testing. Those could be combined with hackbench or kernel compilation runs and cyclictest to determine if the realtime latency requirements are being met. It might make sense to integrate the realtime tests into some other existing testing client framework (such as MMTests), rather than trying to make multiple versions of those tests targeted at each different CI system, he said.

The various CI efforts tend to congregate in the #kernelci channel on freenode or in the automated-testing@lists.yoctoproject.org mailing list. Attendees plan to work with those groups to determine the right path forward in order get more CI testing for the realtime kernel. Once the realtime patches are finally merged, the CI-RT system should provide a good starting point for CI testing moving forward.

As noted, these sessions were rather differently focused than most of those in the past. The final merging of the realtime patch set will make a big difference in how the project interacts with the rest of the kernel and the overall kernel ecosystem. It is important to get out ahead of the game with plans for stable-tree maintenance, along with ideas on how to make sure that the feature stays functional in the fast-moving mainline. The microconference would seem to have helped with both.

Index entries for this article
Kernel	Development model/Stable tree
Kernel	Development tools/Testing
Kernel	Realtime
Conference	Linux Plumbers Conference/2020

Preparing for the realtime future

Posted Sep 10, 2020 7:47 UTC (Thu) by metan (subscriber, #74107) [Link]

There are also some long forgotten realtime tests in LTP see https://github.com/linux-test-project/ltp/tree/master/tes...

These are mostly unmaintained and rotting, it would be nice if anyone would have a look to see if we should keep these or not, or even better help to find these testscases a new home, in a case they are useful.

Preparing for the realtime future

Posted Sep 10, 2020 8:02 UTC (Thu) by weberm (guest, #131630) [Link] (8 responses)

How are these folks - seemingly - not aware of the OSADL tests, i.e. the QA Farm Realtime?

https://www.osadl.org/QA-Farm-Realtime.linux-real-time.0....

Preparing for the realtime future

Posted Sep 10, 2020 15:35 UTC (Thu) by pbonzini (subscriber, #60935) [Link] (6 responses)

Does it run in a VM? If it does, it might not pass the real-time criteria unless the host kernel is also RT.

Preparing for the realtime future

Posted Sep 10, 2020 17:23 UTC (Thu) by glenn (subscriber, #102223) [Link]

I believe that the OSADL farm is all on native hardware. Here is a list of what is currently under test: https://www.osadl.org/Linux-kernels-under-test.qa-farm-ke...

Click on a system link in the "Box" column to view a detailed report about the hardware, interrupt configuration, .config, and a latency histogram (click the link "Display most recent latency plot").

There is very little documentation on best-practices in configuring a PREEMPT_RT kernel. I frequently use the OSADL configs as a point of reference.

Preparing for the realtime future

Posted Sep 11, 2020 5:59 UTC (Fri) by weberm (guest, #131630) [Link] (3 responses)

Just as glenn noted, they do run on native hardware. I was at an OSADL RT workshop and got an in-depth introduction to the farm, they actually do quite a decent job to test the RT under various workloads and introducing error conditions as well. The 3D plots of the latencies are quite nice, and regressions can easily be seen as well. I think they deserve more visibility, and with their long-term involvement in RT, I am surprised that their test-bed hasn't come up during this session. I've also notified them of maybe volunteering for some more visible testing. Which would benefit their corporate members (who, at a certain level, may have THEIR manufactured hardware tested on the QA Farm with RT) as well as the community as well as OSADL as a whole. Win-win-win.

Preparing for the realtime future

Posted Sep 13, 2020 23:04 UTC (Sun) by cemde (subscriber, #51020) [Link] (2 responses)

Thank you for taking up the cudgels for us!

Yes, everybody wins. The continuous integration of the kernel developers on the one hand and the RT system engineering that we provide on the other hand are two different activities with a well defined handover. We start our work when a new RT kernel is available, compiles flawlessly and runs for a while with expected RT capabilities. But whether this kernel behaves as expected on a particular hardware and firmware combination in an industrial device over long periods of time is another matter. Skylake, for example, took us more than a year to get a suitable BIOS and to find out appropriate settings of BIOS variables. And even Apollo Lake that Intel equipped with special RT properties took us about six months to convince BIOS and board manufacturers to actually implement the recommended variables and let us select and configure them. Or take the RPi as another example. It didn't work with RT until OSADL took care of it and published a patch (https://www.osadl.org/Single-View.111+M5c03315dc57.0.html). You cannot expect this work to be done by the RT kernel developers, this is why you need OSADL.

We know that some of the imperfection of the current RT patches will only be solved after they will be completely merged to mainline, because this is the only way that subsystem maintainers can test their upgrade changes against RT early enough. Until then we try to do our best to support the merging activities and not to jeopardize them by submitting our "exotic errors". Instead we are searching for workarounds by not using certain configurations and functionalities. People may then use the configuration of the respective QA Farm system to create a state-of-the-art Linux RT kernel.

When some of the diagnostic and auxiliary functions that we need for our assessments but may require extra work to merge to mainline were thrown out of the official RT patches we decided to take over their maintenance (https://www.osadl.org/?id=2943).

A number of systems of our QA Farm run in so-called shadow mode. These are twin systems with identical hardware and software that are ideally suitable to test the effect of changing a single variable. We use these shadow systems often to study the effect of a particular setup on RT capabilities such as hyperthreading, graphical mode, various kernel configurations etc. and document the findings in technical assessments. Some of them have been made available to the LF Real Time Linux collaborative project on request.

In conclusion, there is nothing wrong with having separate organizations to take care of various stages and aspects of the RT patches in the interest of all of us. We are glad that users of the RT patches continue to join OSADL as members and help us grow and, thus, do even more and better for the Linux RT community.

Preparing for the realtime future

Posted Sep 14, 2020 6:02 UTC (Mon) by xz (guest, #86176) [Link] (1 responses)

> Apollo Lake that Intel equipped with special RT properties

Where can I learn about these properties and their recommended values?

Preparing for the realtime future

Posted Sep 14, 2020 18:18 UTC (Mon) by cemde (subscriber, #51020) [Link]

>> Apollo Lake that Intel equipped with special RT properties
> Where can I learn about these properties and their recommended values?
Depends on the BIOS. You may wish to contact your board and/or your BIOS manufacturer.

Preparing for the realtime future

Posted Sep 13, 2020 21:22 UTC (Sun) by cemde (subscriber, #51020) [Link]

Why are you considering that we might run RT systems in a VM? All our RT systems under test run on native hardware. Anything else would not make sense. However, there is one RT system that runs a non-RT kvm guest to ensure that the guest does not interfere with the RT capabilities of the host. We do these test, since it was an OSADL funded project (executed long ago by tglx) to fix kvm induced latencies of the host.

BTW: Even if the host kernel is RT, a fully virtualized guest cannot have RT capabilities irrespective of what kernel you use.

Preparing for the realtime future

Posted Sep 15, 2020 14:50 UTC (Tue) by broonie (subscriber, #7078) [Link]

They are, it's part of what people currently look at, part of the issue here is that the test isn't something that Greg could consume directly but needs some interpretation.

Preparing for the realtime future

Posted Sep 10, 2020 14:27 UTC (Thu) by glasserc (subscriber, #108472) [Link] (3 responses)

I know this doesn't add much to the conversation, but I'm really excited to see realtime in the mainline kernel! Thanks everyone who has been working on this!

Preparing for the realtime future

Posted Sep 11, 2020 0:16 UTC (Fri) by clump (subscriber, #27801) [Link] (2 responses)

I'm also happy to see this news. I'd been reading about the difficulty in getting real time merged into mainline for well over ten years.

Preparing for the realtime future

Posted Sep 11, 2020 20:53 UTC (Fri) by jezuch (subscriber, #52988) [Link] (1 responses)

Exactly! It'd be great to read an article describing how that incredible feat was achieved :)

The article(s)

Posted Sep 11, 2020 21:20 UTC (Fri) by corbet (editor, #1) [Link]

That's not just one article, though...that's 16 years of history, much of which has been lovingly chronicled on these pages. Look over here and enjoy...:)

Preparing for the realtime future

Posted Sep 11, 2020 16:27 UTC (Fri) by alison (subscriber, #63752) [Link] (1 responses)

This article alone justifies my subscription to LWN.

I see no mention of tlgx in the article! I hope that he is well.

Preparing for the realtime future

Posted Sep 11, 2020 20:58 UTC (Fri) by ppisa (subscriber, #67307) [Link]

Thomas has taken part in many discussions and presented the final RT miniconf PREEMPT_RT: status and Q&A contribution.

Preparing for the realtime future

Posted Sep 15, 2020 14:57 UTC (Tue) by broonie (subscriber, #7078) [Link]

To be clear (with my KernelCI hat on) Greg's assessment of where KernelCI is at for testing is a bit optimistic - there's still a bunch of work to do on the reporting side for things that aren't pass/fail and integrating testsuites is not trivial.