ELC: SpaceX lessons learned

By Jake Edge
March 6, 2013

On day two of the 2013 Embedded Linux Conference, Robert Rose of SpaceX spoke about the "Lessons Learned Developing Software for Space Vehicles". In his talk, he discussed how SpaceX develops its Linux-based software for a wide variety of tasks needed to put spacecraft into orbit—and eventually beyond. Linux runs everywhere at SpaceX, he said, on everything from desktops to spacecraft.

Rose is the lead for the avionics flight software team at SpaceX. He is a former video game programmer, and said that some lessons from that work were valuable in his current job. He got his start with Linux in 1994 with Slackware.

SpaceX as a company strongly believes in making humans into a multi-planetary species. A Mars colony is the goal, but in order to get there, you need rockets and spaceships, he said. It is currently expensive to launch space vehicles, so there is a need to "drive costs down" in order to reach the goal.

The company follows a philosophy of reusability, which helps in driving costs down, Rose said. That has already been tried to some extent with the space shuttle program, but SpaceX takes it further. Not only are hardware components reused between different spacecraft, but the software is shared as well. The company builds its rockets from the ground up at its facility, rather than contracting out various pieces. That allows for closer and more frequent hardware-software integration.

One thing that Rose found hard to get used to early on in his time at SpaceX is the company's focus on the "end goal". When decisions are being made, people will often bring it up: "is this going to work for the Mars mission?" That question is always considered when decisions are being made; Mars doesn't always win, but that concern is always examined, he said.

Challenges

Some of the challenges faced by the company are extreme, because the safety of people and property are involved. The spacecraft are dangerous vehicles that could cause serious damage if their fuel were to explode, for example. There is "no undo", no second chance to get things right; once the rocket launches "it's just gonna go". Another problem that he didn't encounter until he started working in the industry is the effects of radiation in space, which can "randomly flip bits"—something that the system design needs to take into account.

There are some less extreme challenges that SpaceX shares with other industries, Rose said. Dealing with proprietary hardware and a target platform that is not the same as the development platform are challenges shared with embedded Linux, for example. In addition, the SpaceX team has had to face the common problem that "no one outside of software understands software".

SpaceX started with the Falcon rocket and eventually transitioned the avionics code to the Dragon spacecraft. The obvious advantage of sharing code is that bugs fixed on one platform are automatically fixed on the other. But there are differences in the software requirements for the launch vehicles and spacecraft, largely having to do with the different reaction times available. As long as a spacecraft is not within 250 meters of the International Space Station (ISS), it can take some time to react to any problem. For a rocket, that luxury is not available; it must react in short order.

False positives are one problem that needs to be taken into account. Rose mentioned the heat shield indicator on the Mercury 6 mission (the first US manned orbital flight) which showed that the heat shield had separated. NASA tried to figure out a way to do a re-entry with no heat shield, but "eventually just went for it". It turned out to be a false positive. Once again, the amount of time available to react is different for launch vehicles and spacecraft.

Gathering data

Quoting Fred Brooks (of The Mythical Man-Month fame), Rose said "software is invisible". To make software more visible, you need to know what it is doing, he said, which means creating "metrics on everything you can think of". With a rocket, you can't just connect via JTAG and "fire up gdb", so the software needs to keep track of what it is doing. Those metrics should cover areas like performance, network utilization, CPU load, and so on.

The metrics gathered, whether from testing or real-world use, should be stored as it is "incredibly valuable" to be able to go back through them, he said. For his systems, telemetry data is stored with the program metrics, as is the version of all of the code running so that everything can be reproduced if needed.

SpaceX has programs to parse the metrics data and raise an alarm when "something goes bad". It is important to automate that, Rose said, because forcing a human to do it "would suck". The same programs run on the data whether it is generated from a developer's test, from a run on the spacecraft, or from a mission. Any failures should be seen as an opportunity to add new metrics. It takes a while to "get into the rhythm" of doing so, but it is "very useful". He likes to "geek out on error reporting", using tools like libSegFault and ftrace.

Automation is important, and continuous integration is "very valuable", Rose said. He suggested building for every platform all of the time, even for "things you don't use any more". SpaceX does that and has found interesting problems when building unused code. Unit tests are run from the continuous integration system any time the code changes. "Everyone here has 100% unit test coverage", he joked, but running whatever tests are available, and creating new ones is useful. When he worked on video games, they had a test to just "warp" the character to random locations in a level and had it look in the four directions, which regularly found problems.

"Automate process processes", he said. Things like coding standards, static analysis, spaces vs. tabs, or detecting the use of Emacs should be done automatically. SpaceX has a complicated process where changes cannot be made without tickets, code review, signoffs, and so forth, but all of that is checked automatically. If static analysis is part of the workflow, make it such that the code will not build unless it passes that analysis step.

When the build fails, it should "fail loudly" with a "monitor that starts flashing red" and email to everyone on the team. When that happens, you should "respond immediately" to fix the problem. In his team, they have a full-size Justin Bieber cutout that gets placed facing the team member who broke the build. They found that "100% of software engineers don't like Justin Bieber", and will work quickly to fix the build problem.

Project management

In his transition to becoming a manager, Rose has had to learn to worry about different things than he did before. He pointed to the "Make the Invisible More Visible" essay from the 97 Things Every Programmer Should Know project as a source of inspiration. For hardware, it's obvious what its integration state is because you can look at it and see, but that's not true for software. There is "no progress bar for software development". That has led his team to experiment with different methods to try to do project planning.

Various "off the shelf" project management methodologies and ways to estimate how long projects will take do not work for his team. It is important to set something up that works for your people and set of tasks, Rose said. They have tried various techniques for estimating time requirements, from wideband delphi to evidence-based scheduling and found that no technique by itself works well for the group. Since they are software engineers, "we wrote our own tool", he said with a chuckle, that is a hybrid of several different techniques. There is "no silver bullet" for scheduling, and it is "unlikely you could pick up our method and apply it" to your domain. One hard lesson he learned is that once you have some success using a particular scheduling method, you "need to do a sales job" to show the engineers that it worked. That will make it work even better the next time because there will be more buy-in.

Some technical details

Linux is used for everything at SpaceX. The Falcon, Dragon, and Grasshopper vehicles use it for flight control, the ground stations run Linux, as do the developers' desktops. SpaceX is "Linux, Linux, Linux", he said.

Rose went on to briefly describe the Dragon flight system, though he said he couldn't give too many details. It is a fault-tolerant system in order to satisfy NASA requirements for when it gets close to the ISS. There are rules about how many faults a craft needs to be able to tolerate and still be allowed to approach the station. It uses triply redundant computers to achieve the required level of fault tolerance. The Byzantine generals' algorithm is used to handle situations where the computers do not agree. That situation could come about because of a radiation event changing memory or register values, for example.

For navigation, Dragon uses positional information that it receives from the ISS, along with GPS data it calculates itself. As it approaches the station, it uses imagery of the ISS and the relative size of the station to compute the distance to the station. Because it might well be in darkness, Dragon uses thermal imaging as the station is slightly warmer than the background.

His team does not use "off-the-shelf distro kernels". Instead, they spend a lot of time evaluating kernels for their needs. One of the areas they focus on is scheduler performance. They do not have hard realtime requirements, but do care about wakeup latencies, he said. There are tests they use to quantify the performance of the scheduler under different scenarios, such as while stressing the network. Once a kernel is chosen, "we try not to change it".

The development tools they use are "embarrassingly non-sophisticated", Rose said. They use GCC and gdb, while "everyone does their own thing" in terms of editors and development environments. Development has always targeted Linux, but it was not always the desktop used by developers, so they have also developed a lot of their own POSIX-based tools. The main reason for switching to Linux desktops was because of the development tools that "you get out of the box", such as ftrace, gdb (which can be directly attached to debug your target platform), netfilter, and iptables.

Rose provided an interesting view inside the software development for a large and complex embedded Linux environment. In addition, his talk was more open than a previous SpaceX talk we covered, which was nice to see. Many of the techniques used by the company will sound familiar to most programmers, which makes it clear that the process of creating code for spacecraft is not exactly rocket science.

[ I would like to thank the Linux Foundation for travel assistance to attend ELC. ]

Index entries for this article
Conference	Embedded Linux Conference/2013

ELC: SpaceX lessons learned

Posted Mar 8, 2013 0:44 UTC (Fri) by giraffedata (guest, #1954) [Link] (1 responses)

In his team, they have a full-size Justin Bieber cutout that gets placed facing the team member who broke the build

I've been part of projects in the past where people were paralyzed by fear of breaking the build - where they spent unjustifiable amounts of time developing code just to avoid Justin Bieber.

A better approach is to have the code control system automatically detect the build breakage and back out the change that caused it, send a polite notice to the person responsible for the change, and let work proceed.

It's also helpful to pay some attention to what makes the build so fragile.

ELC: SpaceX lessons learned

Posted Mar 24, 2013 18:19 UTC (Sun) by jedbrown (subscriber, #49919) [Link]

Tweaking workflow can also help make new features much less stressful. For example, if you use topic branches with a 'master' and 'next', most merges with a chance of breaking are to 'next' (topics only go to 'master' after having "cooked" in 'next' for a while). Since no other developers base their work on 'next' (it's really just used for integrated testing), a broken 'next' is far less disruptive than a broken 'master'. Additionally, reverting a merge on 'next' does not imply messy history because 'next' is rerolled after a release.

ELC: SpaceX lessons learned

Posted Mar 24, 2013 23:29 UTC (Sun) by cabo (guest, #90020) [Link] (2 responses)

"detecting the use of Emacs" -- WTF?

ELC: SpaceX lessons learned

Posted Mar 26, 2013 0:47 UTC (Tue) by nix (subscriber, #2304) [Link] (1 responses)

It's dry wit. Recalibrate your humour detectors. :)

ELC: SpaceX lessons learned

Posted Mar 27, 2013 2:03 UTC (Wed) by sitaram (guest, #5959) [Link]

oh no! I was just about to ask him how he did that and can I also have a way to detect anything non-vim!

;-)