A report from the Realtime Summit

November 6, 2017

This article was contributed by Mathieu Poirier

The 2017 Realtime Summit (RT-Summit) was hosted by the Czech Technical University on Saturday, October 21 in Prague, just before the Embedded Linux Conference. It was attended by more than 50 individuals with backgrounds ranging from academic to industrial, and some local students daring enough to spend a day with that group. What follows is a summary of some of the presentations held at the event.

Beyond what is covered here, Thomas Gleixner, who is the lead for the Real-Time Linux project, gave an update on that project's status in a session that was covered in a separate article.

Realtime trouble, lessons learned, and open questions

Gratian Crisan started his presentation on problems using the realtime patch set that his group has run into by underlining that the work presented was done by multiple people and that some of it is admittedly hackish. He then listed the problems his group recently encountered and how they were addressed.

The first problem originates from a sequence of write operations to a memory-mapped input/output (MMIO) region that is followed by a read operation. The I/O devices (examples of the e1000e and tpm_tis drivers were presented) will usually be connected to a bus running at a lower frequency and with different bit-width than the CPU's. Buffering and arbitration are required in the I/O fabric, which causes write operations to be queued up along the way. When a long sequence of writes to an MMIO region is followed by a read operation from the same region, ordering guarantees mandate that the read operation wait until all writes are flushed before it can complete. This stalls the CPU in the middle of the MMIO read instruction, preventing the servicing of timer interrupts. Realtime priority threads running on the CPU will then wake up late because the timer interrupt was delivered late.

To address the situation, long stretches of MMIO operations can be broken up by introducing delays, allowing time for the writes to be committed to the device and for high-priority realtime threads to preempt the driver code so that they can execute. Another way around this is to follow each MMIO write with an MMIO read when a kernel is configured with the PREEMPT_RT_FULL option. That way, the amount of time a CPU is stalled is bounded since only one store operation has to complete. Crisan then asked if the same problem has been seen elsewhere, something that Gleixner confirmed. There is no known solution other than education via the realtime wiki site and testing.

The second problem related to the aggregation of high-resolution timers (hrtimers) from SCHED_OTHER threads and the large amount of latency the processing of their wakeups induces on a system. Crisan provided a small code snippet that reproduces the same pattern as observed in the real use case. A patch that moves all hrtimers wakeup processing for non-realtime tasks to the softirq thread has been submitted to fix the problem.

The third problem stems from a lack of priority inheritance support in the standard glibc pthread library; only the pthread_mutex_*() functions are tailored to handle priority inheritance. A short discussion in the audience revealed that not much has happened on the topic since last year's realtime summit and a comment in the bugzilla entry notes that there is no solution in sight.

The fourth and last problem revolves around the management of interrupt thread priorities and the risk of priority inversion if the interrupt handler needs to do a bus transfer that involves another interrupt thread. It is hard to associate an interrupt number with its corresponding kernel interrupt thread process ID so that its priority can be configured properly. A related problem pertains to the configuration of priorities for threads that are not created at boot time. For example, some ethernet drivers create an interrupt thread dynamically when a cable is plugged in. A patch that adds the ability to call poll() on /proc/interrupts has been created to address the issue. From there a daemon or service like rtctl can react to changes and assign the right priority to the newly created interrupt thread. A sysfs interface has been proposed as a better alternative.

The presentation ended with advice for people when working with realtime systems. First, always check the kernel configuration after a kernel upgrade as some options may have changed that introduce unwanted latency. Second, verify that clock sources are properly configured so that realtime latency is kept to a minimum. Using a trusted clock source when instrumenting the kernel is also important to keep in mind. That way one can trust that real latency problems are being investigated rather than a side effect of the tracing code behaving differently after an upgrade. Last, but not least, never underestimate the value of running reboot tests. They have proven to unearth interesting conditions that often lead to malfunctions.

Using Coccinelle to detect and fix nested execution context violations

In her presentation Julia Cartwright focused on problems related to operations executed in a context where they are not valid and how she used Coccinelle to find them.

The first context violation of interest to Cartwright is code executing within an interrupts-disabled region that calls spin_lock(). When the realtime patch set is applied, spinlocks are turned into sleeping locks and will eventually call schedule(), which is something that triggers a "sleeping while in atomic context" complaint from the kernel lock debugging mechanism.

The second context violation also involve calls to spin_lock(), but this time from interrupt code dispatch routines that are running in hardirq context. Once again implicitly or explicitly calling schedule() will result in a warning from the kernel. From Cartwright's point of view, occurrences of the above context violations in the kernel come from developers not understanding when to use raw spinlocks and how to properly use preempt_disable().

Finding even obvious context violations in the kernel is arduous and calls for help from runtime and static-analysis tools, which leads us to Coccinelle. Cartwright then went on to give a short introduction to Coccinelle and present the scripts she used to address various code violation scenarios. To date, 38 context violations have been identified in the mainline kernel (as of v4.14-rc5) and 22 in the realtime patch set. Now that context violations have been located, the hard work of addressing each situation remains, since each case demanding careful assessment and a tailored solution.

SCHED_DEADLINE: what's next?

The goal of this talk was to present the latest development in the area of deadline scheduling along with ongoing work and topics being considered for future enhancement.

Claudio Scordino started with the greedy reclaiming of unused bandwidth (GRUB) algorithm that was merged in the 4.13 kernel. The motivation is for scenarios where deadline tasks that need more bandwidth (CPU time) than reserved when they entered SCHED_DEADLINE may use bandwidth from other deadline tasks that haven't fully used their own reservation. The main requirement is obviously to do so without breaking any reservation guarantees. Scordino went on to give a summary of how GRUB works and presented graphics showing the positive effect of the feature.

The presentation continued with work that is currently under development, more specifically the integration with the schedutil CPU-frequency governor so that deadline tasks may run at frequencies below the maximum operating point. The idea is to use the reclaimed bandwidth metrics provided by GRUB to lower the frequency of the system and scale the runtime reservation according to the current operating point. Graphics showing experimental results between GRUB-PA (power aware) and a mainline kernel were presented, highlighting the possibility of honoring bandwidth reservation at still lower frequencies. A new version of the patch set is expected to appear on the mailing lists in the coming weeks.

Next up was the hierarchical group scheduling feature, first published in March. Scordino asked the audience for guidance on the expected behavior and if the sysfs interface to configure the feature should be modified. That was followed by a discussion between the scheduler maintainers and attendees from Scuola Sant'Anna University on how best to implement the admission test on a hierarchical scheduler in order to improve the current, conservative, worst-case scenario implementation. The Scuola Sant'Anna University attendees had some ideas and plan to follow up with a proposal.

The semi-partitioning scheduling feature was presented by Daniel Bristot de Oliveira. It concentrates on use cases where a task would fail the current deadline acceptance test while bandwidth is still available in the system. The proposal is to use the theoretical approach for static task partitioning and try to achieve the same results for dynamically scheduled tasks. It works by splitting a task between CPUs at runtime, but keeping the reservation proportion as it was when first accepted by the system.

Scordino then discussed areas of future development. One of those is the idea of bandwidth reclaiming by demotion, where a deadline task is demoted to the SCHED_OTHER class rather than be throttled at the end of its time budget. Another is throttled signalling, where user-level signals are sent to throttle tasks. Scordino concluded by expressing the desire to see a closer relationship between deadline scheduler developers and the community. The goal is to come up with development priorities so that time isn't wasted on features that aren't important to users.

Future of tracing

Steven Rostedt said that he wanted to use his time as a conversation on the future of tracing rather than on a classic presentation. In order to give context to that discussion, though, he started by giving an overview of the tracing infrastructure and the various features it supports.

Rostedt then moved on to cover some of the features currently under development, such as more advanced histogram support with full customization, synthetic events, IRQ/preempt disable events, and the storage of variable events. Another area being worked on relates to the tracing of functions called by kernel modules that aren't already loaded so that tracing would start when the modules get loaded.

On his wish list, he would like the ability to have zero overhead on lock events as well as more interaction between eBPF and Ftrace. Also on the list is the capability to trace function parameters and convert the trace.dat file (from trace-cmd) to the common trace format (CTF). There is work on KernelShark to get rid of GTK2 and convert it to use Qt and there are plans to add plugins for customized views and flame graphs.

Rostedt asked for input from the audience, something that led to a request for adding support to feed the output of an Ftrace dump to KernelShark for further analysis.

In closing

This event has once again proven to be helpful to the realtime Linux community. A good number of presentations triggered conversations between audience members on the common problems they face, the way to address them, and some of the remaining challenges to overcome.

[I would like to thank the speakers for their time reviewing this article and the valuable input they have provided.]

Index entries for this article
GuestArticles	Poirier, Mathieu J.
Conference	Realtime Summit/2017

Real-Time Summit Videos

Posted Nov 6, 2017 19:49 UTC (Mon) by k.stewart (subscriber, #32117) [Link]

For those that would like more details on any of the topics covered in this article, the videos from the Real-Time Summit are now available at:
https://www.youtube.com/playlist?list=PLbzoR-pLrL6r4xoc1P...