By Jake Edge
March 6, 2013
On day two of the 2013 Embedded
Linux Conference, Robert Rose of SpaceX spoke about the "Lessons
Learned Developing Software for Space Vehicles". In his talk, he discussed
how SpaceX develops its Linux-based software for a wide variety of tasks
needed to put spacecraft into orbit—and eventually beyond. Linux
runs everywhere at SpaceX, he said, on everything from desktops to spacecraft.
Rose is the lead for the avionics flight software team at SpaceX. He is
a former video game programmer, and said that some lessons from that work
were valuable in his current job. He got his start with Linux in
1994 with Slackware.
SpaceX as a company strongly believes in making humans into a
multi-planetary species. A Mars colony is the goal, but in order to get
there, you need rockets and spaceships, he said. It is currently expensive
to launch space vehicles, so there is a need to "drive costs down" in order
to reach the goal.
The company follows a philosophy of reusability, which helps in driving
costs down, Rose
said. That has already been tried to some extent with the space shuttle
program, but SpaceX takes it further. Not only are hardware
components reused between different spacecraft, but the software is shared
as well. The company builds its rockets from the ground up at its
facility, rather than contracting out various pieces. That allows for
closer and more frequent hardware-software integration.
One thing that Rose found hard to get used to early on in his time at
SpaceX is the company's focus on the "end goal". When decisions are
being made,
people will often bring it up: "is this going to work for
the Mars mission?" That question is always considered when decisions are
being made; Mars doesn't always win, but that concern is always examined,
he said.
Challenges
Some of the challenges faced by the company are extreme, because the
safety of people and property are involved. The spacecraft are dangerous
vehicles that could cause serious damage if their fuel were to explode, for
example. There is "no undo", no second chance to get things right; once
the rocket launches "it's just gonna go". Another problem that he didn't
encounter until he started working in the industry is the effects of
radiation in space, which can "randomly flip bits"—something that the
system design needs to take into account.
There are some less extreme challenges that SpaceX shares with other industries,
Rose said. Dealing with proprietary hardware and a target platform that is
not the same as the development platform are challenges shared with
embedded Linux, for example. In addition, the SpaceX team has had to face
the common problem that "no one outside of software understands software".
SpaceX started with the Falcon rocket and eventually
transitioned the avionics code to the Dragon spacecraft. The obvious
advantage of sharing code is
that bugs fixed on one platform are automatically fixed on the other. But
there are differences in the software requirements for the launch
vehicles and spacecraft, largely having to do with the different reaction
times available.
As long as a spacecraft is not within 250 meters of the International Space
Station (ISS), it can take some time to react to any problem. For a
rocket, that luxury is not available; it must react in short order.
False positives are one problem that needs to be taken into
account. Rose mentioned the heat shield indicator on the Mercury 6 mission
(the first US manned orbital flight) which showed that the heat shield had
separated. NASA tried to figure out a way to do a re-entry with no heat
shield, but "eventually just went for it". It turned out to be a false
positive. Once again, the amount of time available to react is different
for launch vehicles and spacecraft.
Gathering data
Quoting Fred Brooks (of The Mythical Man-Month fame), Rose said
"software is invisible". To make software more visible, you need to know what
it is doing, he said, which means creating "metrics on everything you can
think of". With a rocket, you can't just connect via JTAG and "fire up
gdb", so the software needs to keep track of what it is doing. Those
metrics should cover areas like performance, network utilization, CPU load,
and so on.
The metrics gathered, whether from testing or real-world use, should be
stored as it is "incredibly valuable" to be able to go back through them,
he said. For his systems, telemetry data is stored with the program
metrics, as is the version of all of the code running so that everything
can be reproduced if needed.
SpaceX has programs to parse the metrics data and raise an
alarm when "something goes bad". It is important to automate that, Rose
said, because forcing a human to do it "would suck". The same programs run on
the data whether it is generated from a developer's test, from a run on the
spacecraft, or from a mission. Any failures should be seen as an
opportunity to add new metrics. It takes a while to "get into the rhythm"
of doing so, but it is "very useful". He likes to "geek out on error
reporting", using tools like libSegFault and ftrace.
Automation is important, and continuous integration is "very valuable",
Rose said. He suggested building for every platform all of the time, even
for "things you don't use any more". SpaceX does that and has found
interesting problems when building unused code. Unit tests are run from
the continuous integration system any time the code changes. "Everyone
here has 100% unit test coverage", he joked, but running whatever tests are
available, and creating new ones is useful. When he worked on video games,
they had a test to just "warp" the character to random locations in a level
and had it look in the four directions, which regularly found problems.
"Automate process processes", he said. Things like coding standards,
static analysis, spaces vs. tabs, or detecting the use of Emacs should be
done automatically. SpaceX has a complicated process where changes cannot
be made without tickets, code review, signoffs, and so forth, but all of
that is checked automatically. If static analysis is part of the workflow,
make it such that the code will not build unless it passes that analysis
step.
When the build fails, it should "fail loudly" with a "monitor that
starts flashing red" and email to everyone on the team. When that happens,
you should "respond immediately" to fix the problem. In his team, they
have a full-size Justin Bieber cutout that gets placed facing the team
member who broke the build. They found that "100% of software
engineers don't like Justin Bieber", and will work quickly to fix the build
problem.
Project management
In his transition to becoming a manager, Rose has had to learn to worry
about different things than he did before. He pointed to the "Make
the Invisible More Visible" essay from the 97
Things Every Programmer Should Know project as a source of
inspiration. For hardware, it's obvious what its integration state is
because you can look at it and see, but that's not true for software.
There is "no progress bar for software development". That has led his team
to experiment with different methods to try to do project planning.
Various "off the shelf" project management methodologies and ways to
estimate how long projects will take do not work for his team. It is
important to set something up that works for your people and
set of tasks, Rose said. They have tried various techniques for estimating
time requirements, from wideband delphi to
evidence-based
scheduling and found that no technique by itself works well for the
group. Since they are software engineers, "we wrote our own tool", he said
with a chuckle, that is a hybrid of several different techniques. There is
"no silver bullet" for scheduling, and it is "unlikely you could pick up
our method and apply it" to your domain. One hard lesson he learned is
that once you have some success using a particular scheduling method, you
"need to do a sales job" to show the engineers that it worked. That will
make it work even better the next time because there will be more buy-in.
Some technical details
Linux is used for everything at SpaceX. The Falcon, Dragon, and
Grasshopper vehicles use it for flight control, the ground stations run
Linux, as do the developers' desktops. SpaceX is "Linux, Linux, Linux", he
said.
Rose went on to briefly describe the Dragon flight system, though he
said he couldn't give
too many details. It is a fault-tolerant system in order to satisfy NASA
requirements for when it gets close to the ISS. There are rules about how
many faults a craft needs to be able to tolerate and still be allowed to
approach the station. It uses triply redundant computers to achieve the
required level of fault tolerance. The Byzantine
generals' algorithm is used to handle situations where the computers do
not agree. That situation could come about because of a radiation event
changing memory or register values, for example.
For navigation, Dragon uses positional information that it receives from
the ISS, along with GPS data it calculates itself. As it approaches the
station, it uses imagery of the ISS and the relative size of the
station to compute the distance to the station. Because it might well be
in darkness, Dragon
uses thermal imaging as the station is slightly warmer than the background.
His team does not use "off-the-shelf distro kernels". Instead, they spend
a lot of time evaluating kernels for their needs. One of the areas they
focus on is scheduler performance. They do not have hard realtime
requirements, but do care about wakeup latencies, he said. There are tests
they use to quantify the performance of the scheduler under different
scenarios, such as while stressing the network. Once a kernel is chosen,
"we try not to change it".
The development tools they use are "embarrassingly non-sophisticated", Rose
said. They use GCC and gdb, while "everyone does their own thing" in terms
of editors and development environments. Development has always targeted
Linux, but it was not always the desktop used by developers, so they
have also developed a lot of their own POSIX-based tools. The main reason for
switching to Linux desktops was because of the development tools that "you
get out of the box", such as ftrace, gdb (which can be directly attached to
debug your target
platform), netfilter, and iptables.
Rose provided an interesting view inside the software development for a
large and complex embedded Linux environment. In addition, his talk was
more open than a previous SpaceX talk we
covered, which was nice to see. Many of the techniques used by the
company
will sound familiar to most programmers, which makes it clear that the
process of creating
code for spacecraft is not exactly rocket science.
[ I would like to thank the Linux Foundation for travel assistance to attend ELC. ]
(
Log in to post comments)