Safety-critical realtime with Linux

By Jonathan Corbet
September 25, 2017

Doing realtime processing with a general-purpose operating-system like Linux can be a challenge by itself, but safety-critical realtime processing ups the ante considerably. During a session at Open Source Summit North America, Wolfgang Mauerer discussed the difficulties involved in this kind of work and what Linux has to offer.

Realtime processing, as many have said, is not synonymous with "real fast". It is, instead, focused on deterministic response time and repeatable results. Getting there involves quantifying the worst-case scenario and being prepared to handle it — a 99% success rate is not good enough. The emphasis on worst-case performance is at the core of the difference with performance-oriented processing, which uses caches, lookahead algorithms, pipelines, and more to optimize the average case.

Mauerer divided realtime processing into three sub-cases. "Soft realtime" is concerned with subjective deadlines, and is used in situations where "nobody dies if you miss the deadline", though the deadline should be hit most of the time. Media rendering was an example of this type of work. "95% Realtime" applies when the deadline must be hit most of the time, but an occasional miss can be tolerated. Data acquisition and stock trading are examples; a missed deadline might mean a missed trade, but life as a whole goes on. The 100% realtime scenario, instead, applies to areas like industrial automation and aviation; if a deadline is missed, bad things happen.

Getting to 100% realtime performance requires quite a bit of work. To the extent possible, the worst-case execution time (WCET) of any task must be determined. Statistical testing is used to test that calculation. The best approach is formal verification, where the response time is proved, but that can only be done for the smallest applications and operating systems. Formal verification has been performed for the L4 system, but that kernel only has 10,000 lines of code. Formal verification is not possible for a kernel like Linux.

100% Realtime performance is hard enough to achieve, but "safety-critical" adds another dimension of reliability requirements. You do not, he said, want to see a segmentation fault when you hit the brakes. Safety-critical computing is not the same as realtime, but the two tend to go together. There is a long list of standards covering various aspects of safety-critical computing; they all come together under the IEC 61508 umbrella standard — "10,000 pages of bureaucratic poetry" according to Mauerer.

Compliance with those standards is one of three routes to safety. The second is called "proven in use", which is essentially a way of saying that a system has been used for twenty years and hasn't shown problems yet. It is scary how far the "proven in use" claim can be pushed, he said. The final approach is "compliant non-compliant development", which is how many of these systems are actually built.

Components of a realtime safety-critical system

There is a list of design patterns used to put together safety-critical realtime systems; Mauerer described some of them:

Run a traditional realtime operating system in a dedicated "side device" to handle the safety-critical work. There are a lot of these devices and systems available; they are simple and come pre-certified. The WCET of a given task can be calculated relatively easily. These systems can be hard to extend, though; they also suffer from vendor lock-in and unusual APIs.
Use a realtime-enhanced kernel — a solution that is common in the Linux community. With this approach, it's possible to keep and use existing Linux know-how and incorporate high-level technologies. The downside is that certification is difficult, the resulting systems are complex, and only statistical assurance is possible.
Run a "separation kernel" on hardware that enforces partitioning. This solution is common in the proprietary world. It offers a clean split between the realtime and non-realtime parts of the system, and there is a lot of certification experience with these systems. But there is strong coupling between the two parts at the hardware level, and there are vendor lock-in issues.
Run a co-kernel on the same hardware — like the separation kernel, but without the hardware partitioning. Once again, there is a clean division between the two parts of the system, and this solution is resource-efficient. But the necessary code is not in the mainline kernel, leading to maintenance difficulties, and there can be implicit couplings between the two kernels.
Use asymmetric multiprocessing; this solution is becoming more popular, he said. A multiprocessor system is partitioned, with some CPUs dedicated to realtime processing. Performance tends to be good, but there can be implicit coupling between the two parts. This solution is also relatively new and fast-moving, complicating maintenance.

One common feature between all of these approaches (excepting the second) is that they use some sort of partitioning to separate the realtime processing from everything else that a system is doing. The exception (the realtime-enhanced kernel) was brought about by adding support for full preemption, deterministic timing behavior, and avoidance of priority inversion. All that was needed to accomplish this, he said, was the world's best programmers and about two decades of time.

If you want to create a system with such a kernel, you need to start by avoiding "stupid stuff". No page faults can be taken; memory must be locked into RAM. Inappropriate system calls — those involving networking, for example — must be avoided; access to block devices is also not allowed. If these rules are followed, a realtime Linux system can achieve maximum latencies of about 50µs on an x86 processor, or 150µs on a Raspberry Pi. This is far from perfect, but it is enough for most uses.

There are a lot of advantages to using a realtime Linux kernel. The patches are readily available, as is community support. Existing engineering knowledge can be reused. Realtime Linux offers multi-core scalability and the ability to run realtime code in user space. On the other hand, the resulting system is hard to certify. If much smaller latencies are required, one needs specialized deep knowledge — it's best to have Thomas Gleixner available. It is also easy to mix up the realtime and non-realtime parts of the system; this does happen in practice.

One could, instead, use Xenomai, which Mauerer described as a skin for a traditional realtime operating system. It can run over Linux or in a co-kernel arrangement, but some patches need to be applied. I-pipe is used to dispatch interrupts to the realtime or Linux kernel as needed. Xenomai can achieve 10µs latencies on x86 systems, or 50µs on a Raspberry Pi. It offers a clean split between the parts of the system and is a lightweight solution. On the other hand, there are few developers working with Xenomai, and it tends to experience regressions when the upstream kernel changes.

Yet another approach, along the separation kernel lines, is an ARM system with a programmable realtime unit (PRU). The PRU has its own ARM-like processors and its own memory, so there is no contention with the main CPU. The main core can run Linux and communicate with the PRU via the remoteproc interface. Such systems are highly deterministic, cleanly split out the realtime work, and are simple. But they are also hardware-specific and require more maintenance.

Getting to safety-critical

There are, he said, two common approaches to the creation of safety-critical Linux systems. The first is called SIL2LinuxMP; it works by partitioning the system's CPUs and running applications in containers. Dedicated CPUs can then be used to isolate the safety-critical work from the rest of the system. This work is aiming for SIL certification, but at the SIL2 level only. SIL3 is considered to be too hard to reach with a Linux-based system.

The alternative is the Jailhouse system. It is a hypervisor that uses hardware virtualization to partition the system. Realtime code can be run in one partition, while safety-critical code can run in another. There are a couple of Jailhouse-like systems, but they have some disadvantages. SafeG relies on the Arm TrustZone mechanism, and only supports two partitions. The Quest-V kernel [PDF] is purely a research system that is not suitable for real-world use. So Jailhouse is Mauerer's preferred approach.

The Jailhouse project is focused on simplicity, he said. It lets an ordinary Linux kernel bring up the system and deal with hardware initialization issues; the partitioning is done once things are up and running. The "regular" Linux system is then shunted over into a virtual machine that is under the hypervisor's control. It works, but there are still some issues to deal with. Jailhouse cannot split up memory-mapped I/O regions, for example, leading to implicit coupling between the system parts. There are other hardware resources that cannot be partitioned; he mentioned clocks as being particularly problematic. The "unsafe" part of the system might manage to turn off a clock needed by the safety-critical partition, for example.

Overall, though, he said that he is happy with the Jailhouse approach. It is able to achieve 15µs maximum latencies in most settings. The obvious conclusion of the talk was thus a recommendation that Jailhouse is the best approach for safety-critical systems at this time.

[Thanks to the Linux Foundation, LWN's travel sponsor, for supporting your editor's travel to the Open Source Summit.]

Safety-critical realtime with Linux

Posted Sep 25, 2017 18:43 UTC (Mon) by alvieboy (guest, #51617) [Link] (15 responses)

I'm sorry, 15us is not realtime (at least not for me). At Gigahertz clock speeds we are talking about a huge amount of delay. Obviously, all this is hardware-dependent: if we can get the important, real-time code in cache, backed up by on-chip RAM, and have everything (including DMA) preemtible.. it would. But I am afraid there's no system out there that will provide you this easily.

The article also does not mention an important metric: jitter. Often we also do not want to do things that quickly, just at the right time.

(disclaimer: I develop safety-critical systems according to DO178B/C and DO254.- so I'm probably a bit paranoid)

Alvie

Safety-critical realtime with Linux

Posted Sep 25, 2017 19:00 UTC (Mon) by SEJeff (guest, #51588) [Link] (2 responses)

Could you expand a bit (if work NDAs allow you to) in the types of things you work on and how hard realtime helps it? Obviously aviation is a perfect use case for hard realtime.

Safety-critical realtime with Linux

Posted Sep 25, 2017 19:06 UTC (Mon) by alvieboy (guest, #51617) [Link] (1 responses)

Have a few, like software-based ARINC429 and ARINC717. Those are very time-sensitive, any excessive delay or jitter (717 is very very susceptible) will cause desyncs and render the implementation useless. I recall for A717 a <1% delay causing the receiver to lose data (which you don't want by any means, since the receiver is the aircraft black box)

Alvie

Safety-critical realtime with Linux

Posted Sep 27, 2017 7:00 UTC (Wed) by smurf (subscriber, #17840) [Link]

While I generally agree with you, protocols with hard RT limits are not a good example for RT requirements – they should be implemented in hardware instead of bit-banging.

The cost of an airplane is not dominated by the HI-3717 (a chip that implements ARINC717). Re-implementing that in software doesn't make much sense, much less doing it while dealing with RT guarantees.

Safety-critical realtime with Linux

Posted Sep 25, 2017 19:04 UTC (Mon) by farnz (subscriber, #17727) [Link] (2 responses)

Are you really claiming that when I did real time systems with clock rates of under 1 MHz (and reliable interrupt timing thus limited to around 50 µs, as interrupts took several clock cycles), it wasn't realtime because the system wasn't as fast as a modern CPU? If so, then you're saying that the Apollo Guidance System was not real time, either, as it has comparable latencies to the systems I was working with.

The general definition of a real time system is a system where the upper and lower bounds on response time are fixed and quantified before the program runs. We tend to ignore the lower bound, as it's usually simple to add delay if the lower bound is too low, but you're real time as long as failing to deliver results before the upper bound is hit results in degraded functionality.

There's a further split between "soft" real time and "hard" real time - the usual definition is that a soft real time system recovers by itself if it resumes meeting the upper bound having missed it for some processing (e.g. a music player), while a hard real time system cannot recover if it ever misses the upper bound (e.g. automated braking system, evidential-grade sound recorder).

Safety-critical realtime with Linux

Posted Sep 25, 2017 19:11 UTC (Mon) by alvieboy (guest, #51617) [Link] (1 responses)

farnz: that's why I mentioned jitter in the first place.

My distiction between "soft" and "hard" real time relates to jitter: both have upper-bounds (not necessarly lower-bounds) delays, but "soft" real time does not have an "upper-bound" jitter, in constrast with "hard" real time, which guarantees it.

Alvie

Safety-critical realtime with Linux

Posted Sep 28, 2017 15:40 UTC (Thu) by farnz (subscriber, #17727) [Link]

That definition makes audio playback (traditionally seen as a soft real time task) into a hard real time task; if the jitter is high, playback fails (the audio is unlistenable).

This is why computer science textbooks tend to distinguish based on the effect of a missed deadline; miss a hard deadline, and the task will not recover without external intervention, while missing a soft deadline can be recovered from by meeting a later deadline.

Safety-critical realtime with Linux

Posted Sep 25, 2017 19:06 UTC (Mon) by jmspeex (subscriber, #51639) [Link] (1 responses)

If you need latencies below a microsecond, then I don't think x86 is appropriate at all since it cannot make any sort of guarantees about caches, TLB, branch predictors, ... At that point you need deterministic operation. Then again, you probably can't afford more than about a thousand operations, so it's probably not too hard to do on an FPGA (or ASIC). Otherwise you'd need a simple microcontroller with SRAM (no cache), no branch prediction, ...

Safety-critical realtime with Linux

Posted Sep 26, 2017 4:06 UTC (Tue) by johnjones (guest, #5462) [Link]

Yes, I think everyone agree's that its the Jitter/hard guarantees that is interesting

simply measuring the response time is not going to work unless you explicitly flush all caches/predictions before and aim for worst case each time... even then the hardware could betray you...

the trustzone approach is interesting as it explicitly switches mode and disables IRQ's since it was designed for security, repositioning for RT is a logical approach...

trustzone as a bonus gives the ability to verify settings and versions exactly what you need in a RT system

so the question is has anyone run a Linux kernel in TEE while aiming for RT ?

Safety-critical realtime with Linux

Posted Sep 25, 2017 23:31 UTC (Mon) by drag (guest, #31333) [Link] (1 responses)

> I'm sorry, 15us is not realtime (at least not for me). At Gigahertz clock speeds we are talking about a huge amount of delay.

> The article also does not mention an important metric: jitter. Often we also do not want to do things that quickly, just at the right time.

My understanding is that 'realtime' is not about necessarily about doing things 'quickly' but 'deterministicly'. I think that is the common understanding used in 'realtime linux' and is the definition used in the article.

So while it doesn't mention jitter explicitly the 15us latency time is going to be including the jitter... so it seems safe to assume that 15us is the upper bounds of what you are going to see. If that is true then it's accurate to say that if your application can handle a maximum of 15us of jitter then it's something that may work out well with the realtime patches and appropriate hardware.

Safety-critical realtime with Linux

Posted Sep 26, 2017 9:29 UTC (Tue) by jki (subscriber, #68176) [Link]

That 15 µs example was from one of the many boards we run on, and it includes jitter (i.e. it was the worst case latency).

But these discussions about numbers are indeed pointless without looking at the concrete closed loop on a concrete system. I mean, I could tell you that we already measured <1 µs timer interrupt latency in a bare-metal Jailhouse cell besides a loaded Linux on a nice Intel box. But that did not measure any real I/O. Depending on your hardware, the I/O path may be affected by parallel activity of other partitions in the system or hardware-inherent significant jitters or even indeterminism.

So the key takeaways should be:
o Jailhouse allows to remove itself deterministically from the time-critical path - on suitable systems - or at least adds only deterministic delays on the rest
o Hardware plays the major role in what can actually be achieved, deterministically and safely

Specifically the latter point is what makes it challenging right now to certify any software, OSes, hypervisors, Open Source or proprietary on today's complex multicore boxes for safety critical purposes.

Safety-critical realtime with Linux

Posted Oct 28, 2018 7:20 UTC (Sun) by nevets (subscriber, #11875) [Link] (4 responses)

> I'm sorry, 15us is not realtime (at least not for me).

That's not what defines "Real Time". Again, real-time does not mean real-fast, and yes, tight latency requirements fall under real fast. "Real Time" means that you can 100% guarantee that you will never exceed your WCET, and that you can calculate what that time actually is (or at least a time greater than the actual WCET). Real Time simply means that there are no outliers. If you require 1 microsecond time, then some RTOS may not fit your needs. But that doesn't make them any less RT.

I know of Real Time systems in the industrial area that only require a millisecond response time. They are just as "real-time" as ones that require 1 microsecond, as really bad things will happen if there's an outlier.

The response time is only a variable in the requirements, it does not define it as a real-time system.

Safety-critical realtime with Linux

Posted Oct 28, 2018 17:34 UTC (Sun) by nix (subscriber, #2304) [Link] (3 responses)

I worked on a real-time system with a three-second response time (regulatory reporting). If you broke it you were fined several thousand dollars. (In practice, humans were in the loop, so you only got fined instantly if you weren't even trying to make the deadline: if you just had a bug they would probably let you off if you fixed it. But having to scramble to apologise to the regulator is a consequence too...)

Safety-critical realtime with Linux

Posted Oct 28, 2018 17:39 UTC (Sun) by nevets (subscriber, #11875) [Link] (2 responses)

So basically, you are just verifying my point.

Safety-critical realtime with Linux

Posted Oct 29, 2018 11:39 UTC (Mon) by nix (subscriber, #2304) [Link] (1 responses)

Yes, this was anecdatal confirmation, not disagreement. Realtime systems could perfectly well have eight-hour hard deadlines that must not be broken. They'd still be real time (though the technical challenges in meeting a threshold like that are probably relatively easy to overcome compared to, say, the 100us-and-tighter deadlines you have worked so hard to meet!).

Safety-critical realtime with Linux

Posted Oct 29, 2018 12:41 UTC (Mon) by tao (subscriber, #17563) [Link]

You could say that service contracts for big iron often constitute real-time response. "8-hour on-site guarantee". Service for "normal" customers? Not so much... Just like in the computer world there are big fines in the former case.

How to mark drivers as non-real time?

Posted Sep 26, 2017 7:37 UTC (Tue) by seanyoung (subscriber, #28711) [Link]

Some drivers, won't play nicely with real time because of the way they've been written or just inherently because they have to do e.g. bit-banging over many µs. Is there a way of marking a driver
as "won't work with real time"?

I know some (of my) drivers are just never going to work, e.g. drivers/media/rc/gpio-ir-tx.c.

Safety-critical realtime with Linux

Posted Sep 26, 2017 13:44 UTC (Tue) by tialaramex (subscriber, #21167) [Link] (2 responses)

"nobody dies" seems like a high bar (or a low bar depending on your perspective).

In aeroplanes as mentioned above they have no choice, but other vehicle safety systems may be considered just as vital even though they always have the last ditch "I give up, stop everything and let the humans fix it" option which in an aeroplane violate that "nobody dies" condition because we can't turn off gravity while we fix the problem.

The example we used to teach students about real-time systems was a toy elevator. If the students screwed up their elevator would physically crash and the toy would begin to tear itself apart. In a real elevator of course this outcome is prevented by a mechanical limit, nobody will die if the software has a bug. But I hope we'd agree that in practice what we want here is not soft real-time!

Safety-critical realtime with Linux

Posted Oct 5, 2017 6:23 UTC (Thu) by filssavi (guest, #109018) [Link]

The whole automotive systems are not hard real time might have been true 40 years ago but in a modern(ish) car it's as wrong as it can be...

Case and point a buffer overflow in Toyota ECU's could get your car stuck accelerating at full tilt even with the foot completely off the pedal, there has been at least one family killed by said bug

Brakes also are fully computer controlled, in fact the ABS/ESP controller can independently apply or release brakes as he pleases

Basically a car is more a computer where you walk in and he let's you drive him as he sees fit, than a mechanical machine

Safety-critical realtime with Linux

Posted Oct 7, 2017 23:40 UTC (Sat) by Wol (subscriber, #4433) [Link]

> In aeroplanes as mentioned above they have no choice, but other vehicle safety systems may be considered just as vital even though they always have the last ditch "I give up, stop everything and let the humans fix it" option which in an aeroplane violate that "nobody dies" condition because we can't turn off gravity while we fix the problem.

They still use that in aircraft - George will give up and hand back control to the pilots. The Air France crash over the South Atlantic was caused by that exact scenario.

And the problem is that even that scenario is real-time - it can often take longer for humans to take back control, than time is available ...

Cheers,
Wol

Safety-critical realtime with Linux

Posted Sep 26, 2017 14:01 UTC (Tue) by k3ninho (subscriber, #50375) [Link] (2 responses)

Are there notable differences in latency between a Pi B/B+, Pi 2 and Pi 3? Are the benchmarks applicable to one or will any do -- and is there detail about which x86 was used?

K3n.

Safety-critical realtime with Linux

Posted Sep 26, 2017 16:30 UTC (Tue) by excors (subscriber, #95769) [Link]

I believe Pi 1 is a single-core ARM11 at 700MHz, Pi 2 can be either quad Cortex-A7 or quad Cortex-A53 (depending on whether it was built before or after the stock of older chips ran out) at 900MHz, Pi 3 is quad Cortex-A53 at 1.2GHz. I'd be surprised if there weren't significant performance differences, so it'd be nice to clarify which one the talk was referring to.

Safety-critical realtime with Linux

Posted Sep 28, 2017 8:18 UTC (Thu) by wolfgang (subscriber, #5382) [Link]

A Pi 3 was used for the benchmarks. The numbers for x86 are based on experience with many different x86 systems.
Generally, you should take these numbers as guidelines what to expect for "typical" systems. Your mileage may
vary considerable when, for instance, SMIs come into play on x86.

Safety-critical realtime with Linux

Posted Sep 30, 2017 8:33 UTC (Sat) by kev009 (guest, #43906) [Link]

I find this subject line completely terrifying. Please use a small trusted compute base, hopefully with rigorous auditing and attempts at formal modeling, for safety critical systems. The Linux kernel development process is not suitable for this domain.