A realtime developer's checklist

November 16, 2020

This article was contributed by Marta Rybczyńska

Realtime application development under Linux requires care to make sure that the critical realtime tasks do not suffer interference from other applications and the rest of the system. During the Embedded Linux Conference (ELC) 2020, John Ogness presented a checklist (slides [PDF]) for realtime developers, with practical recipes to follow. There are a lot of tools and features available for realtime developers, even on systems without the RT_PREEMPT patches applied.

What is realtime?

We want all applications to be correct and bug-free, Ogness began; in the realtime domain, correctness "means running at the correct time". The application must wake up within a bounded time limit when there is time-critical work to do. Ogness highlighted that, in realtime systems, the right timing of tasks is a requirement; things will go wrong if the constraints are not met. Developers need to define which tasks and applications are time-critical; he noted that a lot of people mistakenly think that all tasks in a realtime system are realtime, while most of them are not.

The good news for developers is that, under Linux, they can write realtime applications using only the POSIX API with the realtime extensions. The code will look familiar, and only three additional header files are required: sched.h (a member of the audience noted that the musl C library does not implement this one), time.h, and pthread.h.

There are three properties that a realtime operating system must have: deterministic scheduling behavior, interruptibility (the CPU is always running something, so there should be a way to interrupt a task), and a way to avoid priority inversion, which happens when a high-priority task must wait for a lower-priority one. The third property, which might be less familiar to non-realtime developers, was described with an example.

Consider three tasks (task1, task2, and task3 with high, medium, and low priority, respectively). Task3 is holding a lock when task1 comes along, takes the CPU and requests that lock. As task1 is high priority, the scheduler puts task3 back on the CPU so that it can finish its work and release the lock. Before that happens, though, task2 comes along. It has nothing to do with the lock, but it has a higher priority than task3. The scheduler then makes a good decision to give the CPU to task2, but this is indirectly blocking task1.

In this case we have a priority inversion between task1 and task2. A situation like that can come about easily in a complex system like Linux, Ogness noted. The way Linux handles priority inversion is to temporarily boost the priority of task3 to the priority level of task1. When task3 gives back the lock, it will be de-boosted and task1 can run with the lock.

Scheduling and affinity

Ogness started his realtime checklist with scheduling policies, which vary between the realtime and non-realtime domains. Non-realtime policies implement fixed time slices; if an application is running an infinite loop, it will still lose the CPU when its time slice runs out. On a realtime system, most tasks, including logging daemons, web servers, and databases, will still use non-realtime policies. "These processes should not be running as realtime", he said. In addition, the developer can configure those non-realtime tasks to limit how much CPU they can have using nice and control groups; for example, the web browser can be constrained to never take more than 20% of the available CPU time. He "strongly encourages" developers to evaluate all tasks, not only the realtime ones, for their resource requirements.

Realtime tasks, instead, normally run until they give up the CPU or are preempted by a higher-priority task. If there is a task with priority 30, it will run until a task with a higher priority comes in (like priority 31). For realtime tasks, one needs to be especially careful when writing the code, and to avoid infinite loops, which can render the system unusable.

The SCHED_FIFO realtime scheduling policy is what people typically use, Ogness said. Tasks running under this policy execute until they are blocked (waiting) or freely give up the CPU. Priorities range from one to 99 (highest). "Please never use 99", Ogness said, as kernel threads using this priority are "more important than your application". Another policy, SCHED_RR is similar, but it uses time slices for tasks at the same priority level. The third scheduling policy is SCHED_DEADLINE, where the scheduler runs the task with the closest deadline. Ogness did not cover this policy, but noted that if it is used, SCHED_DEADLINE wins over highest priority tasks from other scheduling policies; mixing scheduling classes is "strange".

Ogness pointed out one important kernel feature with regard to realtime scheduling policies: by default, the scheduler limits the amount of CPU time allocated to all realtime tasks together to 95% of the available time per second. In the remaining 50ms of each second, no realtime task is allowed to run. This policy gives an administrator a chance to intervene if a realtime task goes out of control. This situation should be avoided, though, since it constitutes a sort of priority inversion. He noted that, if the kernel hits the limit, it prints a throttling message in the kernel log, but the realtime system is already broken at this point. This feature can be disabled by writing -1 into /proc/sys/kernel/sched_rt_runtime_us. The change is not permanent, so it needs to be added to boot scripts.

Developers can set priorities and scheduling policies using the chrt tool; the -p option sets the priority of a given task. The tool can also be used to start an application with a given priority. The same operations are possible in the code by using the sched_setscheduler() system call.

CPU affinity is another important element in a realtime system. It might be critical to isolate code onto some CPUs. Ogness gave an example of an eight-CPU system divided into six CPUs for non-realtime applications and two for realtime. In Linux, CPU affinity is defined for each task and takes the form of a bitmask with one bit set for each allowed CPU. In addition to user-space tasks, interrupts and kernel threads can also have their affinity set. This is important, as they may interfere with realtime code. Ogness noted that the internal architecture of a processor will influence the realtime configuration. If two CPUs are sharing L2 caches, then they should be both realtime, as cache sharing between realtime and non-realtime applications may have an effect on realtime latency.

The tool to set and query affinities is taskset, which can either start a task or modify an existing one. taskset also applies to threads. If run without a new mask, it will show what the current mask is. As with priorities, an associated system call exists: sched_setaffinity(). Ogness noted that it requires that _GNU_SOURCE be defined, since the sched_setaffinity() wrapper is a GNU C library (glibc) extension and not part of the POSIX API.

It is also possible to influence CPU affinity in a deeper way using the maxcpus and isolcpus boot parameters. maxcpus limits the number of CPUs the kernel can see. If set to four in an eight-CPU system, it means that the kernel will see only four processors and the other four CPUs can be used differently, such as by a bare-metal realtime application. On the other hand, isolcpus indicates to the kernel that it should not put kernel threads on the indicated processors. When using isolcpus, Linux is aware of all of the CPUs and can use the isolated CPUs when threads are explicitly set to run on them.

For interrupt affinities, the default settings for new interrupts can be viewed and changed in /proc/irq/default_smp_affinity. For an existing interrupt, intr, the file /proc/irq/intr/smp_affinity allows doing the same. Developers should be aware, Ogness said, that there might be hardware limitations when setting affinities; after setting an interrupt affinity they should consult /proc/irq/intr/effective_affinity to check what was set after taking all limitations into account.

Avoiding page faults

Memory management is "probably most important" in terms of latency. He explained that, when an application allocates memory, it is not really allocated in the kernel, but only marked in the page tables. The actual allocation takes place when that memory is first accessed and results in a page fault. Page faults are "quite expensive" as the kernel needs to find a physical page and assign it before the application can continue. Developers can verify this by observing that malloc() calls are fast, but the application slows down when it starts using the allocated memory for the first time. This applies to not only the heap, but to all memory, including code and the stack. Moving to the next stack page will cause a page fault just like the allocation of new memory.

Ogness described three steps to avoid page faults. The first is to tune glibc to use the heap only for new allocations. malloc() has two ways to allocate memory: the heap or mmap() for separate chunks. Realtime developers do not want separate chunks, as they may go back to the kernel after free(); then if they are reallocated they will fault again. The developers can direct glibc to use heap memory only by using mallopt():

    mallopt(M_MMAP_MAX, 0);

Another useful option is to disable the possibility that glibc gives unused memory back to the kernel:

    mallopt(M_TRIM_THRESHOLD,-1);

The second step is to lock down allocated pages so that they "never get back to the kernel". This can happen when the kernel reclaims memory or starts swapping. Embedded developers may think their system will not swap, but this is not true, Ogness explained. When low on memory, the kernel starts reclaiming any memory it can, including the text segments of applications. For realtime "this is horrible", because when the application runs again it will need to get the pages back from the disk. To lock all pages into memory, a realtime application should use mlockall():

    mlockall(MCL_CURRENT | MCL_FUTURE);

The third and final step is to perform pre-faulting, which means causing all page faults ahead of time and having all memory "ready to go". Heap pre-faulting is done by allocating all of the memory that might be needed, then writing a non-zero value to each page to make sure that the writes are not optimized away. Stack pre-faulting should be done in a similar way by creating and writing a big stack frame, something that developers have "learned never to do" in general.

Synchronization without priority inversion

In realtime systems, developers should use POSIX mutexes for locking, Ogness said. This is important because mutexes have owners, and only the owner can release the lock. This gives the kernel the information needed to boost a lock-holding task when a higher-priority task needs that lock. By default, however, priority inheritance is not activated (even with PREEMPT_RT, he pointed out), so the developers need to activate it to avoid priority inversion using:

    pthread_mutexattr_setprotocol(&mattr, PTHREAD_PRIO_INHERIT);

For signaling between threads, Ogness recommended conditional variables as they can be associated with mutexes. If an application needs to wait and then take a lock, then pthread_cond_wait() is the tool for the job. Realtime developers should avoid signals because the environment that the signal handler runs in is hard to predict.

Clocks and measurements

"Use the monotonic clock", Ogness recommended. This is the clock that always moves forward, without regard to time zones and leap seconds. The absolute time it gives is ideal to calculate the time the task should sleep. The technique he presented consists of getting the current time once, at the beginning of the program, and then just incrementing the clock thereafter. This way the task will wake up in the right moment "even in 10 years", he said.

The last part of the talk covered tools that can be used to evaluate the realtime system. The first two tools come from the rt-tests package. cyclictest measures latencies at the given priority level. It repeatedly sleeps, then checks the clock to measure the difference between the expected and real wake-up times. This tool can prepare histograms with the resulting data. Developers will want to run it with a high load on the system to measure the latencies that the system may generate. The other tool, hackbench, can generate system load with packets, CPU load, and even out-of-memory events. A realtime system should run well even in those conditions. Ogness noted the developers should be doing stress tests for all components. For example, if the system uses Bluetooth, then it requires a Bluetooth test. He also reminded developers to include idle-mode testing, as the system latency might be different when going to low-power modes.

perf can also help developers, as it can show the number of page faults and cache misses. Ogness mentioned the kernel tracing infrastructure that can help to show not only what happens during an application's execution, but also when. A member of the audience asked if there is a way to calculate the worst response time of a task, and Ogness replied that the only solution is to test.

Ogness finished with a list of kernel configuration options to check. For example, CONFIG_PREEMPT_* activates more kernel realtime properties (more still if the PREEMPT_RT patches have been applied). CONFIG_LOCKUP_DETECTOR and CONFIG_DETECT_HUNG_TASK run at realtime priority 99, so they should be disabled if not explicitly needed. CONFIG_NO_HZ eliminates kernel clock ticks and can reduce power consumption, but it can also increase latency. He said that setting any particular option is not necessarily an error, but they need to be analyzed and checked; there might be decisions to make for example between power saving and realtime.

At the end of the talk, he walked through the checklist again, highlighting the need to verify the realtime behavior. For the members of the audience who would like to learn more there is a realtime wiki. "Have fun with realtime Linux", he concluded.

Index entries for this article
GuestArticles	Rybczynska, Marta
Conference	Open Source Summit Europe/2020

A realtime developer's checklist

Posted Nov 16, 2020 21:31 UTC (Mon) by dancol (guest, #142293) [Link]

Keep in mind that even if *your* real-time application pre-allocates and locks all memory, the *kernel* code you invoke via system calls isn't written with the same discipline in mind. Kernel mutexes, for example, are not real-time mutexes by default, so you can still get into priority inversion situations with the rest of the system by invoking kernel code that takes locks owned by non-RT tasks -- for example, the task list lock or an inode lock. (The RT patchset is meant to mitigate this problem.)

On a vanilla kernel, you can use *some* OS services without risking priority inversion, but you have to be careful about which ones they are and audit the kernel-side code paths you invoke and make sure that one of your fancy important SCHED_FIFO threads isn't going to get blocked behind some slowpoke non-RT part of the system because the SCHED_FIFO thread made a system call that takes a lock while it's running in the kernel.

If I had my way, we'd just apply the RT patchset universally, *and* make pthreads in gblic use PI by default, *and* implement surrogate execution so that priority inheritance did something for SCHED_OTHER threads (i.e. normal non-RT processes). (Right now, priority inheritance is useless for SCHED_OTHER.) I'm not likely to get my way because my proposal would tank throughput on batch workloads, but personally, I value consistency of performance over throughput.

A realtime developer's checklist

Posted Nov 17, 2020 3:58 UTC (Tue) by glenn (subscriber, #102223) [Link] (4 responses)

> ...mixing scheduling classes is "strange".

On the RT kernel as-is, this is strange. However, you should be able to schedule unrelated threads on disjoint sets of CPUs under different scheduling policies (i.e., a SCHED_FIFO cluster and a SCHED_DEADLINE cluster).

There are theoretical grounds for mixing SCHED_FIFO- and SCHED_DEADLINE-style schedulers on shared CPUs. This technique is used in some mixed-criticality schedulers. These schedulers also consider the application-level importance of real-time threads, in addition to scheduling priority (application importance != scheduling priority!!). For example, a simple mixed-criticality scheduler might schedule application-critical threads with fixed-priorities (e.g., SCHED_FIFO) on single CPUs, while scheduling less-application-critical, but still real-time, threads with SCHED_DEADLINE and let these migrate among all CPUs. We can't do this today with the current Linux scheduler, since SCHED_DEADLINE preempts SCHED_FIFO. This could be done if we were able to reorder the relative priority of the scheduling classes, or simply let SCHED_FIFO preempt SCHED_DEADLINE. This would weaken (violate) the SCHED_DEADLINE admission test guarantees, but I find this feature less compelling than support for a flavor of mixed-criticality.

A realtime developer's checklist

Posted Nov 17, 2020 14:21 UTC (Tue) by Wol (subscriber, #4433) [Link] (2 responses)

> This would weaken (violate) the SCHED_DEADLINE admission test guarantees, but I find this feature less compelling than support for a flavor of mixed-criticality.

So you're quite happy for a nuclear power plant to blow up because an operator wanted to play a shoot-em-up game on the same computer?

The whole point of Real Time is that deadlines are deadlines.As in "you're dead if they're missed".

I agree mixed criticality would be lovely, but not at the expense of true real-time operation.

Cheers,
Wol

A realtime developer's checklist

Posted Nov 17, 2020 15:46 UTC (Tue) by gus3 (guest, #61103) [Link]

> So you're quite happy for a nuclear power plant to blow up because an operator wanted to play a shoot-em-up game on the same computer?

If the operator wants to play a shoot-em-up while on duty, then the priority inversion is taking place within the operator, not the computer.

> mixed criticality would be lovely, but not at the expense of true real-time operation.

If the constraints of RT can be satisfied in two different domains, whether cores or hosts, it's still RT. Just because they have different scheduling requirements, doesn't make either of them "fake RT".

A realtime developer's checklist

Posted Nov 17, 2020 17:20 UTC (Tue) by glenn (subscriber, #102223) [Link]

> So you're quite happy for a nuclear power plant to blow up because an operator wanted to play a shoot-em-up game on the same computer?

Actually, no. Quite the opposite. I make no such argument.

Mixed-criticality scheduling is a real field of study in real-time systems. It has wide applications in hard real-time environments. In the scheme I described, where SCHED_FIFO can preempt SCHED_DEADLINE, threads with hard real-time requirements are scheduled by SCHED_FIFO, while threads with soft real-time requirements (bounded deadline tardiness, in this case), are scheduled by SCHED_DEADLINE. These ideas are backed by solid real-time scheduling theory.

Mixed-criticality scheduling is not widely supported outside of research kernels, but it is becoming a centerpiece feature of the seL4 microkernel.

Let me defend my specific claim that the SCHED_DEADLINE admissions tests is not useful for many users. An admissions test is useful in an "open system," where the sysadmin (or system designer) cannot predict or control what what is run under the SCHED_DEADLINE policy. If I had to guess at who could benefit from an open linux-rt system, I would guess multi-media and gaming. Contrast this with a "closed system," where the sysadmin/system-designer controls everything that runs on the system. Who are these users? Robots (including autonomous vehicles), high-frequency trading systems, and I would argue pro-audio tools. Admissions tests are less useful in a closed system because the system-designer can account for CPU requirements ahead of time. After all, they already know enough about their execution time requirements provide execution-time parameters to SCHED_DEADLINE. In short, there is no need for an admissions test in a closed system. And this is where it gets in the way of more robust real-time scheduling: Linux could support a flavor of mixed-criticality scheduling if it did away with the admissions test that no one is really using!

Please note that elimination of the admissions test does not mean eliminating budget enforcement. Budget enforcement can still be used to isolate run-away threads.

A realtime developer's checklist

Posted Nov 17, 2020 16:21 UTC (Tue) by Jonno (subscriber, #49613) [Link]

> We can't do this today with the current Linux scheduler, since SCHED_DEADLINE preempts SCHED_FIFO. This could be done if we were able to reorder the relative priority of the scheduling classes, or simply let SCHED_FIFO preempt SCHED_DEADLINE. This would weaken (violate) the SCHED_DEADLINE admission test guarantees, but I find this feature less compelling than support for a flavor of mixed-criticality.

In an ideal world, the scheduler would be able to tell whether it would be possible to meet all deadlines even if no SCHED_DEADLINE tasks gets scheduled until the next scheduler tick, and if so give priority to SCHED_FIFO, while still giving priority to SCHED_DEADLINE when necessary to meet deadlines.

Unfortunately that would require doing admission control every time the scheduler has to pick a task (rather than just when changing the scheduling policy of a task) which would probably have a prohibitively high overhead.

musl C and sched.h

Posted Nov 17, 2020 13:05 UTC (Tue) by binkley (subscriber, #8537) [Link] (3 responses)

Perhaps the situation is rapidly changing, but when I check the musl C source repo, I do see `sched.h`:

https://git.musl-libc.org/cgit/musl/log/?qt=grep&q=sc...

re: _a member of the audience noted that the musl C library does not implement [sched.h]_

What might be meant is that musl's implementation is incomplete for the purposes of the talk.

musl C and sched.h

Posted Nov 18, 2020 5:35 UTC (Wed) by mrybczyn (subscriber, #81776) [Link] (2 responses)

Thanks for the comment (I'm the author of the article).

Musl does include the sched_* functions, but they are not really implemented. Please check out https://git.musl-libc.org/cgit/musl/tree/src/sched/sched_... for example. It just returns an error. This is what the discussion covered.

musl C and sched.h

Posted Dec 3, 2020 3:05 UTC (Thu) by Hello71 (subscriber, #103412) [Link] (1 responses)

This is not accurate. musl implements the majority of the sched_* functions, including sched_setaffinity, sched_getaffinity, sched_get_priority_max, sched_get_priority_min, sched_getcpu, sched_rr_get_interval, and sched_yield. The only ones not implemented are sched_{get,set}{param,scheduler}. As explained by musl author Rich Felker at https://www.openwall.com/lists/musl/2016/03/01/4, this is because on Linux, these syscalls operate on a per-thread basis, but POSIX requires them to operate on a per-process basis. However, as he explains, the fix is to simply use pthread_{get,set}schedparam instead.

musl C and sched.h

Posted Dec 4, 2020 7:37 UTC (Fri) by mrybczyn (subscriber, #81776) [Link]

Thank you for the clarification!

A realtime developer's checklist

Posted Nov 18, 2020 18:15 UTC (Wed) by meyert (subscriber, #32097) [Link] (6 responses)

So what is a typical realtime application for Linux? Can you give an example of an Linux realtime "killer" app?

A realtime developer's checklist

Posted Nov 18, 2020 19:27 UTC (Wed) by hummassa (subscriber, #307) [Link] (3 responses)

Videoconference applications.
VoIP (voice over Internet Protocol)
Online gaming.
Community storage solutions.
Some e-commerce transactions.
Chatting.
IM (instant messaging)

A realtime developer's checklist

Posted Nov 19, 2020 13:04 UTC (Thu) by mathstuf (subscriber, #69389) [Link] (2 responses)

Is instant messaging typically RT? Wouldn't the typing latency be the highest thing? Or is this more about "Typing…" notifications? Things interacting with hardware (vsync, constant FPS targets, machine controllers) or wetware deadlines (audio processing, audio/video sync) certainly make sense, but I would have thought typing would have interfered too much. Though if it is, it's almost certainly got longer deadlines than other use cases(?).

Note, I don't do RT development myself, so any misconceptions or naivete is certainly on my side :) .

A realtime developer's checklist

Posted Nov 19, 2020 14:03 UTC (Thu) by hummassa (subscriber, #307) [Link] (1 responses)

WRT RT, it's not the length of the deadline, but its predictability :-)

A realtime developer's checklist

Posted Nov 19, 2020 18:56 UTC (Thu) by glenn (subscriber, #102223) [Link]

Yes, it's more about predictability than latency, but optimizing latency can improve predictability (or allows one to achieve predictability on less powerful hardware).

Paul McKenney has a great paper on this: "Real Time" vs "Real Fast": How to Choose?".

A realtime developer's checklist

Posted Nov 19, 2020 15:23 UTC (Thu) by adam820 (subscriber, #101353) [Link] (1 responses)

Any kind of audio processing app, something like Ardour (JACK). It's hard to play guitar, process the effects, and hear the output if the output lags behind your string hit by half a second. It'd be like talking and hearing your speech played back a half second behind; it's hard to do.

A realtime developer's checklist

Posted Nov 20, 2020 16:13 UTC (Fri) by azz (subscriber, #371) [Link]

And in particular, given the present circumstances, real-time music collaboration applications like Jamulus. I've been using Jamulus a lot over the last few months with an RT-patched kernel, and I've seen several non-technical users enthusing lately about JamulusOS (an Ubuntu Studio-based distribution that seems to be pretty decent at getting low-latency/jitter behaviour out of machines that Windows' audio drivers struggle on).

"priority"

Posted Dec 4, 2020 21:10 UTC (Fri) by dfsmith (guest, #20302) [Link] (1 responses)

Is there a reason that the priority (English word) is sometimes inverted from the English meaning of "priority"? It seems that "importance" would be less ambiguous. B-)
Examples:
"safety is our number one priority" -> set priority to 99.
"snacks are low priority" -> set priority to 1 ("low"/"high" priority works!)
"our second priority is to ensure lunch is ready" -> set priority to 98.
"coffee is way down our priority list" -> set priority to 10 (and ignore the yelling).

"priority"

Posted Dec 5, 2020 1:55 UTC (Sat) by neilbrown (subscriber, #359) [Link]

With tongue firmly in cheek, I might suggest that "priority" relates to "prior" meaning "before". And numbers "before" zero are normally negative.

So the first priority has a numerical priority of "-1". This is highest.
My second priority would have the value "-2", which isn't quite as high.

A positive value suggests a "laterity", meaning it can be left until later.