LWN.net Logo

OSS in Space (Linux Journal)

Linux Journal takes a look at how OSS might have benefited the 1997 Mars Pathfinder mission. "At first glance, this dialogue is merely interesting; I think every hardware and software engineer/tinkerer should read them. On deeper reflection, however, I was struck by something more. Although I assume it was not their intention, the authors quite clearly demonstrate how open-source software (OSS) and the OSS development model would have helped this project enormously, not only in finding the bug but, in all probability, preventing the bug in the first place. The extracts from these e-mails and my comments below should make more sense to you after you've read the original postings."
(Log in to post comments)

It probably wouldn't have helped.

Posted Feb 14, 2004 0:33 UTC (Sat) by AnswerGuy (subscriber, #1256) [Link]


I don't think that the use of Linux would have helped in this situation.
It would be at least as easy to make this sort of mistake using Linux
as any other system.

For arguments sake let's imagine that they built this around RTLinux or
RTAI. Since the Linux kernel runs as the idle task under the RT
microkernel/subsystem in those cases ... it would be all too easy
to create deadlocks between the user space processes and the RT
threads.

I realize that the watchdog circuits on these system rebooted them,
repeatedly. Effectively something in the start-up code was
re-initiating the error condition --- a lock-up/reboot loop. It's
easy to create similar situations for Linux (I've used the watchdog
device drivers, and daemons and done it, caught the problem in testing).

Regardless of what OS you'd select, it's necessary to do a full
failure mode analysis, to look at every line of code between startup
and operation asking: "How could this particular function fail?"
That's (barely) possible in a single threaded code sequence. As
we introduce multi-tasking or multi-threading, process control
and I/O handling we then have opportunities for races and deadlocks.

I don't think modern software engineering can provide *FULL*
failure mode analysis for modern, general purpose, multi-tasking,
multi-threading systems, with "normal" I/O requirements.

The best we can hope for seems to be: sufficiently robust to get
to some sort of debugger or remote access state, full failure
over to a known working state, and simplified the problem domain
to the point where simpler, specialized systems can be used in
place of general purpose OSes.

OSS in Space (Linux Journal)

Posted Feb 14, 2004 1:02 UTC (Sat) by ajax (subscriber, #7251) [Link]

Although the OSS *process* may have been useful, the conclusion that Linux itself would have been useful is unwarranted. 1) Linux does not implement ordered wait queues, therefore the highest priority process waiting on the queue is not the next one awoken from that queue, 2) Linux does not implement priority inheritance on any of its kernel-level or user-available semaphores, 3) Linux likes to hold spinlocks for very long periods. We have measured lock hold times (via the lockmeter patch) of up to 190 milliseconds (almost .5 seconds) in a 2.4 kernel. While a lock is held nothing else other than the servicing of interrupts can happen. Then, of course, there is the kernel's love of the Big Kernel Lock, which while held forces serialization of all process-level activities and is the most popular lock used in the kernel. Its elimination is a long a difficult process which is still a work in progress for 2.6 and (presumably) 2.7.

OSS in Space (Linux Journal)

Posted Feb 14, 2004 1:04 UTC (Sat) by ajax (subscriber, #7251) [Link]

That is supposed to be ... almost .2 seconds ...

Copyright © 2004, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds