Safety-critical realtime with Linux
Realtime processing, as many have said, is not synonymous with "real fast". It is, instead, focused on deterministic response time and repeatable results. Getting there involves quantifying the worst-case scenario and being prepared to handle it — a 99% success rate is not good enough. The emphasis on worst-case performance is at the core of the difference with performance-oriented processing, which uses caches, lookahead algorithms, pipelines, and more to optimize the average case.
Mauerer divided realtime processing into three sub-cases. "Soft realtime"
is concerned with subjective deadlines, and is used in situations where
"nobody dies if you miss the deadline", though the deadline should be hit
most of the time. Media rendering was an example of this type of work.
"95% Realtime" applies when the deadline must be hit most of the time, but
an occasional miss can be tolerated. Data acquisition and stock trading
are examples; a missed deadline might mean a missed trade, but life as a
whole goes on. The 100% realtime scenario, instead, applies to areas like
industrial automation and aviation; if a deadline is missed, bad things
happen.
Getting to 100% realtime performance requires quite a bit of work. To the extent possible, the worst-case execution time (WCET) of any task must be determined. Statistical testing is used to test that calculation. The best approach is formal verification, where the response time is proved, but that can only be done for the smallest applications and operating systems. Formal verification has been performed for the L4 system, but that kernel only has 10,000 lines of code. Formal verification is not possible for a kernel like Linux.
100% Realtime performance is hard enough to achieve, but "safety-critical" adds another dimension of reliability requirements. You do not, he said, want to see a segmentation fault when you hit the brakes. Safety-critical computing is not the same as realtime, but the two tend to go together. There is a long list of standards covering various aspects of safety-critical computing; they all come together under the IEC 61508 umbrella standard — "10,000 pages of bureaucratic poetry" according to Mauerer.
Compliance with those standards is one of three routes to safety. The second is called "proven in use", which is essentially a way of saying that a system has been used for twenty years and hasn't shown problems yet. It is scary how far the "proven in use" claim can be pushed, he said. The final approach is "compliant non-compliant development", which is how many of these systems are actually built.
Components of a realtime safety-critical system
There is a list of design patterns used to put together safety-critical realtime systems; Mauerer described some of them:
- Run a traditional realtime operating system in a dedicated "side
device" to handle the safety-critical work. There are a lot of these
devices and systems available; they are simple and come
pre-certified. The WCET of a given task can be calculated relatively
easily. These systems can be hard to extend, though; they also suffer
from vendor lock-in and unusual APIs.
- Use a realtime-enhanced kernel — a solution that is common in the
Linux community. With this approach, it's possible to keep and use
existing Linux know-how and incorporate high-level technologies. The
downside is that certification is difficult, the resulting systems are
complex, and only statistical assurance is possible.
- Run a "separation kernel" on hardware that enforces partitioning.
This solution is common in the proprietary world. It offers a clean
split between the realtime and non-realtime parts of the system, and
there is a lot of certification experience with these systems. But
there is strong coupling between the two parts at the hardware level,
and there are vendor lock-in issues.
- Run a co-kernel on the same hardware — like the separation kernel, but
without the hardware partitioning. Once again, there is a clean
division between the two parts of the system, and this solution is
resource-efficient. But the necessary code is not in the mainline
kernel, leading to maintenance difficulties, and there can be implicit
couplings between the two kernels.
- Use asymmetric multiprocessing; this solution is becoming more popular, he said. A multiprocessor system is partitioned, with some CPUs dedicated to realtime processing. Performance tends to be good, but there can be implicit coupling between the two parts. This solution is also relatively new and fast-moving, complicating maintenance.
One common feature between all of these approaches (excepting the second) is that they use some sort of partitioning to separate the realtime processing from everything else that a system is doing. The exception (the realtime-enhanced kernel) was brought about by adding support for full preemption, deterministic timing behavior, and avoidance of priority inversion. All that was needed to accomplish this, he said, was the world's best programmers and about two decades of time.
If you want to create a system with such a kernel, you need to start by avoiding "stupid stuff". No page faults can be taken; memory must be locked into RAM. Inappropriate system calls — those involving networking, for example — must be avoided; access to block devices is also not allowed. If these rules are followed, a realtime Linux system can achieve maximum latencies of about 50µs on an x86 processor, or 150µs on a Raspberry Pi. This is far from perfect, but it is enough for most uses.
There are a lot of advantages to using a realtime Linux kernel. The patches are readily available, as is community support. Existing engineering knowledge can be reused. Realtime Linux offers multi-core scalability and the ability to run realtime code in user space. On the other hand, the resulting system is hard to certify. If much smaller latencies are required, one needs specialized deep knowledge — it's best to have Thomas Gleixner available. It is also easy to mix up the realtime and non-realtime parts of the system; this does happen in practice.
One could, instead, use Xenomai, which Mauerer described as a skin for a traditional realtime operating system. It can run over Linux or in a co-kernel arrangement, but some patches need to be applied. I-pipe is used to dispatch interrupts to the realtime or Linux kernel as needed. Xenomai can achieve 10µs latencies on x86 systems, or 50µs on a Raspberry Pi. It offers a clean split between the parts of the system and is a lightweight solution. On the other hand, there are few developers working with Xenomai, and it tends to experience regressions when the upstream kernel changes.
Yet another approach, along the separation kernel lines, is an ARM system with a programmable realtime unit (PRU). The PRU has its own ARM-like processors and its own memory, so there is no contention with the main CPU. The main core can run Linux and communicate with the PRU via the remoteproc interface. Such systems are highly deterministic, cleanly split out the realtime work, and are simple. But they are also hardware-specific and require more maintenance.
Getting to safety-critical
There are, he said, two common approaches to the creation of safety-critical Linux systems. The first is called SIL2LinuxMP; it works by partitioning the system's CPUs and running applications in containers. Dedicated CPUs can then be used to isolate the safety-critical work from the rest of the system. This work is aiming for SIL certification, but at the SIL2 level only. SIL3 is considered to be too hard to reach with a Linux-based system.
The alternative is the Jailhouse system. It is a hypervisor that uses hardware virtualization to partition the system. Realtime code can be run in one partition, while safety-critical code can run in another. There are a couple of Jailhouse-like systems, but they have some disadvantages. SafeG relies on the Arm TrustZone mechanism, and only supports two partitions. The Quest-V kernel [PDF] is purely a research system that is not suitable for real-world use. So Jailhouse is Mauerer's preferred approach.
The Jailhouse project is focused on simplicity, he said. It lets an ordinary Linux kernel bring up the system and deal with hardware initialization issues; the partitioning is done once things are up and running. The "regular" Linux system is then shunted over into a virtual machine that is under the hypervisor's control. It works, but there are still some issues to deal with. Jailhouse cannot split up memory-mapped I/O regions, for example, leading to implicit coupling between the system parts. There are other hardware resources that cannot be partitioned; he mentioned clocks as being particularly problematic. The "unsafe" part of the system might manage to turn off a clock needed by the safety-critical partition, for example.
Overall, though, he said that he is happy with the Jailhouse approach. It is able to achieve 15µs maximum latencies in most settings. The obvious conclusion of the talk was thus a recommendation that Jailhouse is the best approach for safety-critical systems at this time.
[Thanks to the Linux Foundation, LWN's travel sponsor, for supporting your
editor's travel to the Open Source Summit.]
Posted Sep 25, 2017 18:43 UTC (Mon)
by alvieboy (guest, #51617)
[Link] (15 responses)
The article also does not mention an important metric: jitter. Often we also do not want to do things that quickly, just at the right time.
(disclaimer: I develop safety-critical systems according to DO178B/C and DO254.- so I'm probably a bit paranoid)
Alvie
Posted Sep 25, 2017 19:00 UTC (Mon)
by SEJeff (guest, #51588)
[Link] (2 responses)
Posted Sep 25, 2017 19:06 UTC (Mon)
by alvieboy (guest, #51617)
[Link] (1 responses)
Alvie
Posted Sep 27, 2017 7:00 UTC (Wed)
by smurf (subscriber, #17840)
[Link]
The cost of an airplane is not dominated by the HI-3717 (a chip that implements ARINC717). Re-implementing that in software doesn't make much sense, much less doing it while dealing with RT guarantees.
Posted Sep 25, 2017 19:04 UTC (Mon)
by farnz (subscriber, #17727)
[Link] (2 responses)
Are you really claiming that when I did real time systems with clock rates of under 1 MHz (and reliable interrupt timing thus limited to around 50 µs, as interrupts took several clock cycles), it wasn't realtime because the system wasn't as fast as a modern CPU? If so, then you're saying that the Apollo Guidance System was not real time, either, as it has comparable latencies to the systems I was working with.
The general definition of a real time system is a system where the upper and lower bounds on response time are fixed and quantified before the program runs. We tend to ignore the lower bound, as it's usually simple to add delay if the lower bound is too low, but you're real time as long as failing to deliver results before the upper bound is hit results in degraded functionality.
There's a further split between "soft" real time and "hard" real time - the usual definition is that a soft real time system recovers by itself if it resumes meeting the upper bound having missed it for some processing (e.g. a music player), while a hard real time system cannot recover if it ever misses the upper bound (e.g. automated braking system, evidential-grade sound recorder).
Posted Sep 25, 2017 19:11 UTC (Mon)
by alvieboy (guest, #51617)
[Link] (1 responses)
My distiction between "soft" and "hard" real time relates to jitter: both have upper-bounds (not necessarly lower-bounds) delays, but "soft" real time does not have an "upper-bound" jitter, in constrast with "hard" real time, which guarantees it.
Alvie
Posted Sep 28, 2017 15:40 UTC (Thu)
by farnz (subscriber, #17727)
[Link]
That definition makes audio playback (traditionally seen as a soft real time task) into a hard real time task; if the jitter is high, playback fails (the audio is unlistenable).
This is why computer science textbooks tend to distinguish based on the effect of a missed deadline; miss a hard deadline, and the task will not recover without external intervention, while missing a soft deadline can be recovered from by meeting a later deadline.
Posted Sep 25, 2017 19:06 UTC (Mon)
by jmspeex (subscriber, #51639)
[Link] (1 responses)
Posted Sep 26, 2017 4:06 UTC (Tue)
by johnjones (guest, #5462)
[Link]
simply measuring the response time is not going to work unless you explicitly flush all caches/predictions before and aim for worst case each time... even then the hardware could betray you...
the trustzone approach is interesting as it explicitly switches mode and disables IRQ's since it was designed for security, repositioning for RT is a logical approach...
trustzone as a bonus gives the ability to verify settings and versions exactly what you need in a RT system
so the question is has anyone run a Linux kernel in TEE while aiming for RT ?
Posted Sep 25, 2017 23:31 UTC (Mon)
by drag (guest, #31333)
[Link] (1 responses)
> The article also does not mention an important metric: jitter. Often we also do not want to do things that quickly, just at the right time.
My understanding is that 'realtime' is not about necessarily about doing things 'quickly' but 'deterministicly'. I think that is the common understanding used in 'realtime linux' and is the definition used in the article.
So while it doesn't mention jitter explicitly the 15us latency time is going to be including the jitter... so it seems safe to assume that 15us is the upper bounds of what you are going to see. If that is true then it's accurate to say that if your application can handle a maximum of 15us of jitter then it's something that may work out well with the realtime patches and appropriate hardware.
Posted Sep 26, 2017 9:29 UTC (Tue)
by jki (subscriber, #68176)
[Link]
But these discussions about numbers are indeed pointless without looking at the concrete closed loop on a concrete system. I mean, I could tell you that we already measured <1 µs timer interrupt latency in a bare-metal Jailhouse cell besides a loaded Linux on a nice Intel box. But that did not measure any real I/O. Depending on your hardware, the I/O path may be affected by parallel activity of other partitions in the system or hardware-inherent significant jitters or even indeterminism.
So the key takeaways should be:
Specifically the latter point is what makes it challenging right now to certify any software, OSes, hypervisors, Open Source or proprietary on today's complex multicore boxes for safety critical purposes.
Posted Oct 28, 2018 7:20 UTC (Sun)
by nevets (subscriber, #11875)
[Link] (4 responses)
That's not what defines "Real Time". Again, real-time does not mean real-fast, and yes, tight latency requirements fall under real fast. "Real Time" means that you can 100% guarantee that you will never exceed your WCET, and that you can calculate what that time actually is (or at least a time greater than the actual WCET). Real Time simply means that there are no outliers. If you require 1 microsecond time, then some RTOS may not fit your needs. But that doesn't make them any less RT.
I know of Real Time systems in the industrial area that only require a millisecond response time. They are just as "real-time" as ones that require 1 microsecond, as really bad things will happen if there's an outlier.
The response time is only a variable in the requirements, it does not define it as a real-time system.
Posted Oct 28, 2018 17:34 UTC (Sun)
by nix (subscriber, #2304)
[Link] (3 responses)
Posted Oct 28, 2018 17:39 UTC (Sun)
by nevets (subscriber, #11875)
[Link] (2 responses)
Posted Oct 29, 2018 11:39 UTC (Mon)
by nix (subscriber, #2304)
[Link] (1 responses)
Posted Oct 29, 2018 12:41 UTC (Mon)
by tao (subscriber, #17563)
[Link]
Posted Sep 26, 2017 7:37 UTC (Tue)
by seanyoung (subscriber, #28711)
[Link]
I know some (of my) drivers are just never going to work, e.g. drivers/media/rc/gpio-ir-tx.c.
Posted Sep 26, 2017 13:44 UTC (Tue)
by tialaramex (subscriber, #21167)
[Link] (2 responses)
In aeroplanes as mentioned above they have no choice, but other vehicle safety systems may be considered just as vital even though they always have the last ditch "I give up, stop everything and let the humans fix it" option which in an aeroplane violate that "nobody dies" condition because we can't turn off gravity while we fix the problem.
The example we used to teach students about real-time systems was a toy elevator. If the students screwed up their elevator would physically crash and the toy would begin to tear itself apart. In a real elevator of course this outcome is prevented by a mechanical limit, nobody will die if the software has a bug. But I hope we'd agree that in practice what we want here is not soft real-time!
Posted Oct 5, 2017 6:23 UTC (Thu)
by filssavi (guest, #109018)
[Link]
Case and point a buffer overflow in Toyota ECU's could get your car stuck accelerating at full tilt even with the foot completely off the pedal, there has been at least one family killed by said bug
Brakes also are fully computer controlled, in fact the ABS/ESP controller can independently apply or release brakes as he pleases
Basically a car is more a computer where you walk in and he let's you drive him as he sees fit, than a mechanical machine
Posted Oct 7, 2017 23:40 UTC (Sat)
by Wol (subscriber, #4433)
[Link]
They still use that in aircraft - George will give up and hand back control to the pilots. The Air France crash over the South Atlantic was caused by that exact scenario.
And the problem is that even that scenario is real-time - it can often take longer for humans to take back control, than time is available ...
Cheers,
Posted Sep 26, 2017 14:01 UTC (Tue)
by k3ninho (subscriber, #50375)
[Link] (2 responses)
K3n.
Posted Sep 26, 2017 16:30 UTC (Tue)
by excors (subscriber, #95769)
[Link]
Posted Sep 28, 2017 8:18 UTC (Thu)
by wolfgang (subscriber, #5382)
[Link]
Posted Sep 30, 2017 8:33 UTC (Sat)
by kev009 (guest, #43906)
[Link]
Safety-critical realtime with Linux
Safety-critical realtime with Linux
Safety-critical realtime with Linux
Safety-critical realtime with Linux
Safety-critical realtime with Linux
Safety-critical realtime with Linux
Safety-critical realtime with Linux
Safety-critical realtime with Linux
Safety-critical realtime with Linux
Safety-critical realtime with Linux
Safety-critical realtime with Linux
o Jailhouse allows to remove itself deterministically from the time-critical path - on suitable systems - or at least adds only deterministic delays on the rest
o Hardware plays the major role in what can actually be achieved, deterministically and safely
Safety-critical realtime with Linux
Safety-critical realtime with Linux
Safety-critical realtime with Linux
Safety-critical realtime with Linux
Safety-critical realtime with Linux
How to mark drivers as non-real time?
as "won't work with real time"?
Safety-critical realtime with Linux
Safety-critical realtime with Linux
Safety-critical realtime with Linux
Wol
Safety-critical realtime with Linux
Safety-critical realtime with Linux
Safety-critical realtime with Linux
Generally, you should take these numbers as guidelines what to expect for "typical" systems. Your mileage may
vary considerable when, for instance, SMIs come into play on x86.
Safety-critical realtime with Linux