The seven deadly sins of software deployment

August 8, 2013

This article was contributed by Josh Berkus

Through me pass into the site of downtime,
Through me pass into eternal overtime
Through me pass and moan ye in fear
All updates abandon, ye who enter here.

A decade ago, software deployments were something you did fairly infrequently; at most monthly, more commonly quarterly, or even annually. As such, pushing new and updated software was not something developers, operations (ops) staff, or database administrators (DBAs) got much practice with. Generally, a deployment was a major downtime event, requiring the kind of planning and personnel NASA takes to land a robot on Mars ... and with about as many missed attempts.

Not anymore. Now we deploy software weekly, daily, even continuously. And that means that a software push needs to become a non-event, notable only for the exceptional disaster. This means that everyone on the development staff needs to become accustomed to and familiar with the deployment drill and their part in it. However, many developers and ops staff — including, on occasion, me — have been slow to make the adjustment from one way of deployment to another.

That's why I presented "The Seven Deadly Sins of Software Deployment [YouTube]" at OSCON Ignite on July 22. Each of the "sins" below is a chronic bad habit I've seen in practice, which turns what should be a routine exercise into a periodic catastrophe. While a couple of the sins aren't an exact match to their medieval counterparts, they're still a good check list for "am I doing this wrong?".

Sloth

Why do you need deployment scripts?
That's too much work to get done.
I'll just run the steps by hand,
I know I won't forget one.

And the same for change docs;
wherefore do you task me.
For info on how each step works,
when you need it you just ask me.

Scripting and documenting every step of a software deployment process are, let's face it, a lot of work. It's extremely tempting to simply "improvise" it, or just go from a small set of notes on a desktop sticky. This works fine — until it doesn't.

Many people find out the hard way that nobody can remember a 13-step process in their head. Nor can they remember whether or not it's critical to the deployment that step four succeed, or whether step nine is supposed to return anything on success or not. If your code push needs to happen at 2:00AM in order to avoid customer traffic, it can be hard to even remember a three-step procedure.

There is no more common time for your home internet to fail, the VPN server to lose your key, or your pet to need an emergency veterinary visit than ten minutes before a nighttime software update. If the steps for the next deployment are well-scripted, well-documented, and checked into a common repository, one of your coworkers can just take it and run it. If not, well, you'll be up late two nights in a row after a very uncomfortable staff meeting.

Requiring full scripting and documentation has another benefit; it makes developers and staff think more about what they're doing during the deployment than they would otherwise. Has this been tested? Do we know how long the database update actually takes? Should the ActiveRecord update come before or after we patch Apache?

Greed

Buy cheap staging servers, no one will know:
they're not production, they can be slow.
They need not RAM, nor disks nor updates.
Ignore you QA; those greedy ingrates.

There's a surprising number of "agile" software shops out there who either lack staging servers entirely, or who use the old former production servers from two or three generations ago. Sometimes these staging servers will have known, recurring hardware issues. Other times they will be so old, or so unmaintained, they can't run the same OS version and libraries which are run in production.

In cases where "staging" means "developer laptops", there is no way to check for performance issues or for how long a change will take. Modifying a database column on an 8MB test database is a fundamentally different proposition from doing it on the 4 terabyte production database. Changes which cause new blocking actions between threads or processes also tend not to show up in developer tests.

Even when issues do show up during testing, nobody can tell for certain if the issues are caused by the inadequate staging setup or by new bugs. Eventually, QA staff start to habitually ignore certain kinds of errors, especially performance problems, which makes doing QA at all an exercise of dubious utility. Why bother to run response time tests if you're going to ignore the results because the staging database is known to be 20 times slower than production?

The ideal staging system is, of course, a full replica of your production setup. This isn't necessarily feasible for companies whose production includes dozens or hundreds of servers (or devices), but a scaled-down staging environment should be scaled down in an intelligent way that keeps performance at a known ratio to production. And definitely keep those staging servers running the exact same versions of your platform that production is running.

Yes, having a good staging setup is expensive; you're looking at spending at least ¼ as much as you spent on production, maybe as much. On the other hand, how expensive is unexpected downtime?

Gluttony

Install it! Update it! Do it ASAP!
I'll have Kernel upgrades,
a new shared lib or three,
a fat Python update
and four new applications!
And then for dessert:
Sixteen DB migrations.

If you work at the kind of organization where deployments happen relatively infrequently, or at least scheduled downtimes are once-in-a-blue-moon, there is an enormous temptation to "pile on" updates which have been waiting for weeks or months into one enormous deployment. The logic behind this usually is, "as long as the service is down for version 10.5, let's apply those kernel patches." This is inevitably a mistake.

As you add additional changes to a particular deployment, each change increases the chances it will fail somehow, both because each change has a chance of failure, and because layered application and system changes can mess each other up (for example, a Python update can cause an update to your Django application to fail due to API changes). Additional changes also make the deployment procedure itself more complicated and thus increase the chances of an administrator or scripting error, and you make it harder and more time-consuming to test all of the changes both in isolation or together. To make this into a rule:

The odds of deployment failure approach 100% as the number of distinct change sets approaches seven.

Obviously, the count of seven is somewhat dependent on your infrastructure, nature of the application, and testing setup. However, even if you have an extremely well-trained crew and an unmatched staging platform, you're really not going to be able to tolerate many more distinct changes to your production system before making failure all but certain.

Worse, if you have many separate "things" in your deployment, you've also made rollback longer and more difficult — and more likely to fail. This means, potentially, a serious catch-22, where you can't proceed because deployment is failing, and you can't roll back because rollback is failing. That's the start of a really long night.

The solution to this is to make deployments as small and as frequent as possible. The ideal change is only one item. While that goal is often unachievable, doing three separate deployments which change three things each is actually much easier than trying to change nine things in one. If the size of your update list is becoming unmanageable, you should think in terms of doing more deployments instead of larger ones.

Pride

Need I no tests, nor verification.
Behold my code! Kneel in adulation.
Rollback scripts are meant for lesser men;
my deployments perfect, as ever, again.

Possibly the most common critical deployment failure is when developers and administrators don't create a rollback procedure at all, let alone rollback scripts. A variety of excuses are given for this, including: "I don't have time", "it's such a small change", or "all tests passed and it looks good on staging". Writing rollback procedures and scripts is also a bald admission that your code might be faulty or that you might not have thought of everything, which is hard for anyone to admit to themselves.

Software deployments fail for all sorts of random reasons, up to and including sunspots and cosmic rays. One cannot plan for the unanticipated, by definition. So you should be ready for it to fail; you should plan for it to fail. Because when you're ready for something to fail, most of the time, it succeeds. Besides, the alternative is improvising a solution or calling an emergency staff meeting at midnight.

You don't need to be complicated or comprehensive. If the deployment is simple, the rollback may be as simple as a numbered list of steps on a shared wiki page. There are two stages to planning to roll back properly:

write a rollback procedure and/or scripts
test that the rollback succeeds on staging

Many people forget to test their rollback procedure just like they test the original deployment. In fact, it's more important to test the rollback, because if it fails, you're out of other options.

Lust

On production servers,
These wretches had deployed
all of the most updated
platforms and tools they enjoyed:
new releases, alpha versions,
compiled from source.
No packages, no documentation,
and untested, of course.

The essence of successful software deployments is repeatability. When you can run the exact same steps several times in a row on both development and staging systems, you're in good shape for the actual deployment, and if it fails, you can roll back and try again. The cutting edge is the opposite of repeatability. If your deployment procedure includes "check out latest commit from git HEAD for library_dependency", then something has already gone wrong, and the chances of a successful deployment are very, very low.

This is why system administrators prefer known, mainstream packages and are correct to do so, even though this often leads to battles with developers. "But I need feature new_new_xyz, which is only in the current beta!" is a whine which often precipitates a tumultuous staff meeting. The developer only needs to make their stack work once (on their laptop) and can take several days to make it work; the system administrator or devops staff needs to make it work within minutes — several times.

In most cases, the developers don't really need the latest-source-version of the platform software being updated, and this can be settled in the staff meeting or scrum. If they really do need it, then the best answer is usually to create your own packages and documentation internally for the exact version to be deployed in production. This seems like a lot of extra work, but if your organization isn't able to put in the time for it, it's probably not as important to get that most recent version as people thought.

Envy

I cannot stand meetings,
I will not do chat
my scripts all are perfect,
you can count on that.
I care only to keep clean my name
if my teammates fail,
then they'll take the blame.

In every enterprise, some staff members got into computers so that they wouldn't have to deal with other people. These antisocial folks will be a constant trial to your team management, especially around deployment time. They want to do their piece of the large job without helping, or even interacting with, anyone else on the team.

For a notable failed deployment at one company, we needed a network administrator to change some network settings as the first step of the deployment. The administrator did this, logging in, changing the settings, and logging back out, and telling nobody what he'd done. He then went home. When it came time for step two, the devops staff could not contact the administrator, and nobody still online had the permissions to check if the network settings were changed. Accordingly, the whole deployment had to be rolled back, and tried again the following week.

Many software deployment failures can be put down to poor communication between team members. The QA people don't know what things they're supposed to test. The DBA doesn't know to disable replication. The developers don't know that both features are being rolled out. Nobody knows how to check if things are working. This can cause a disastrously bad deployment even when every single step would have succeeded.

The answer to this is lots of communication. Overdetermine that everyone knows what's going to happen during the deployment, who's going to do it, when they're going to do it, and how they'll know when they're done. Go over this in a meeting, follow it up with an email, and have everyone on chat or VoIP conference during the deployment itself. You can work around your antisocial staff by giving them other ways to keep team members updated such as wikis and status boards, but ultimately you need to impress on them how important coordination is. Or encourage them to switch to a job which doesn't require teamwork.

Wrath

When failed the deployment,
again and again and again they would try,
frantically debugging
each failing step on the fly.
They would not roll back,
but ground on all night,
"the very next time we run it
the upgrade will be all right."

I've seen (and been part of) teams which did everything else right. They scripted and documented, communicated and packaged, and had valid and working rollback scripts. Then, something unexpected went wrong in the middle of the deployment. The team had to make a decision whether to try to fix it, or to roll back; in the heat of the moment, they chose to press on. The next dawn found the devops staff still at work, trying to fix error after error, now so deep into ad-hoc patches that the rollback procedure wouldn't work if they tried to follow it. Generally, this is followed by several days of cleaning up the mess.

It's very easy to get sucked into the trap of "if I fix one more thing, I can go to bed and I don't have to do this over again tomorrow." As you get more and more into overtime, your ability to judge when you need to turn back gets worse and worse. Nobody can make a rational decision at two in the morning after a 15-hour day.

To fight this, Laura Thompson at Mozilla introduced the "three strikes" rule. This rule says: "If three or more things have gone wrong, roll back." While I was working with Mozilla, this saved us from bad decisions about fixing deployments on the fly at least twice; it was a clear rule which could be easily applied even by very tired staff. I recommend it.

Conclusion

To escape DevOps hell
avoid sin; keep to heart
these seven virtues
of an agile software art.

Just as the medieval seven deadly sins have seven virtues to counterbalance them, here are seven rules for successful software deployments:

Diligence: write change scripts and documentation
Benevolence: get a good staging environment
Temperance: make small deployments
Humility: write rollback procedures
Purity: use stable platforms
Compassion: communicate often
Patience: know when to roll back

You can do daily, or even "continuous", deployments if you develop good practices and stick to them. While not the totality of what you need to do for more rapid, reliable, and trouble-free updates and pushes, following the seven rules of good practice will help you avoid some of the common pitfalls which turn routine deployments into hellish nights.

For more information, see the video of my "The Seven Deadly Sins of Software Deployment" talk, the slides [PDF], and verses. See also the slides [PDF] from Laura Thompson's excellent talk "Practicing Deployment", and Selena Deckelmann's related talk: Mistakes Were Made [YouTube].

Index entries for this article
GuestArticles	Berkus, Josh

The seven deadly sins of software deployment

Posted Aug 8, 2013 17:16 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

Hey! We intentionally use weak servers for staging, but with real datasets. So if staging works fine then production should most probably be OK.

The seven deadly sins of software deployment

Posted Aug 8, 2013 17:57 UTC (Thu) by jberkus (guest, #55561) [Link]

That works if you have a practice which says "no matter how weak they are, the response time/performance/load tests MUST pass on staging". Where it fails is when QA starts ignoring performance test failures because they know staging is slow. I've seen some horrible performance bugs creep into production that way.

The seven deadly sins of software deployment

Posted Aug 8, 2013 22:38 UTC (Thu) by jameslivingston (guest, #57330) [Link] (2 responses)

Unfortunately there are quite a lot of problems which only appear with more powerful systems.

If production has more CPU cores, then it's much more likely to run into issues that only show up under highly-concurrent load.

If you're using Java or similar applications, GC tuning is highly dependant on the system. Doubling the memory in production could make certain problems disappear, or cause many more. The throughput collectors pause times increase super-linearly, and CMS has many settings that depend on the exact used-memory to available-memory ratio.

Latency of various components will be different, affecting all sorts of timing.

Some of the worst things I've seen:
* Production being split across two datacentres, where staging was only in one (unfortunately quite common)
* Running Solaris on SPARC in production but Linux on x86 in Staging because it's cheaper.
* Running production on physical machines, but staging on virtual machines
* "Staging" being a desktop machine under someone's desk :(

The seven deadly sins of software deployment

Posted Aug 11, 2013 1:24 UTC (Sun) by pr1268 (guest, #24648) [Link] (1 responses)

Just when I thought I'd been cured of deployment nightmares, your last bullet just set me back a ways. ;-)

"Deployment" servers don't belong under someone's desk, because the overnight cleaning crew might unplug it when it makes funny noises. Spoken from personal experience¹.

¹ We were deploying remotely from home at 11 PM, and while the cleaning crew was emptying wastebaskets, right when they got to our sysadmin's cubicle, the computer's disk drives started churning loudly (in an otherwise very quiet office), and perhaps the janitor thought it was possessed or something, so she unplugged it. Needless to say our release had to be rescheduled. Fortunately the sysadmin later convinced the guy holding the purse strings to cough up the $$$ for a real deployment server.

The seven deadly sins of software deployment

Posted Aug 11, 2013 20:40 UTC (Sun) by jberkus (guest, #55561) [Link]

Hah! Don't be surprised if I re-use that anecdote in a future talk ;-)

The seven deadly sins of software deployment

Posted Aug 8, 2013 18:19 UTC (Thu) by halla (subscriber, #14185) [Link] (1 responses)

This miserable sinner has been through purgatory more than once... Awesome article, which deserves to become a classic :-)

The seven deadly sins of software deployment

Posted Aug 8, 2013 20:05 UTC (Thu) by jberkus (guest, #55561) [Link]

Thanks for the praise! I had fun doing the talk ;-)

The seven deadly sins of software deployment

Posted Aug 8, 2013 19:03 UTC (Thu) by kjp (guest, #39639) [Link] (1 responses)

If I may be so bold as to add a sub-bullet or even counter point to team work: If you have multiple teams, try as hard as you can to make deployments NOT require multiple teams to go 'at once' in 'all or nothing' fashion. There should be a clear dependency graph, and dependencies should be rolled out in a backward-compatible way and enabled (days) before another team cuts over to using them. Multi team coordinated deployments are the worst for coordination, rollback, and testing. Maybe that should be so obvious it goes without saying, but sadly we had to experience pain a few times before we wised up.

The seven deadly sins of software deployment

Posted Aug 8, 2013 19:53 UTC (Thu) by jberkus (guest, #55561) [Link]

Yes, definitely. I'd actually go further than that and say "as much as possible, don't have multi-team deployments at all". See if you can get each team's deployment to happen separately. If not, you're gonna have to do all of the work you just outlined, and I guarantee you at least one person won't actually disclose all of their info.

The seven deadly sins of software deployment

Posted Aug 8, 2013 20:25 UTC (Thu) by jengelh (guest, #33263) [Link] (2 responses)

>We needed a network administrator to change some network settings as the first step of the deployment. The administrator did this,[...] and telling nobody what he'd done.

But you just told him what to do! By Unix tradition, if the task has been carried out as per the request, and there were no deviations and no errors, no output means success. :)

The seven deadly sins of software deployment

Posted Aug 8, 2013 21:35 UTC (Thu) by ewen (subscriber, #4772) [Link]

In Unix tradition you do also get to see that the process was started, and that the process completed, which seems to be the two things missing in the Case of That Network Administrator (tm). (An email saying "Step 1: Done." probably would have been sufficient to say "process started, process finished".)

Ewen (a fan of the Unix tradition, but _much_ less of a fan of lack of communication)

PS: Excellent article.

The seven deadly sins of software deployment

Posted Aug 9, 2013 6:22 UTC (Fri) by iabervon (subscriber, #722) [Link]

Ah, but Unix tradition says that, after he exits, he should hang around as a zombie until you wait() and then give you an exit code before he disappears entirely.

The seven deadly sins of software deployment

Posted Aug 8, 2013 21:01 UTC (Thu) by dpquigl (guest, #52852) [Link]

I was in the audience when Josh did this keynote at OSCON and I have to say it even better watching him recite it in person.

The seven deadly sins of software deployment

Posted Aug 9, 2013 6:25 UTC (Fri) by dune73 (guest, #17225) [Link]

Now that was a good read.

The seven deadly sins of software deployment

Posted Aug 9, 2013 13:54 UTC (Fri) by micka (subscriber, #38720) [Link]

While reading this, I knew I was missing something, as most of the technical part made sense but some of the form didn't.
I actually had to search this "medieval seven deadly sins" to find I missed some mythological background to get it all.

Now I think I understand a bit more. I had actually encountered it before (though it was in my native language so I didn't do the link before I checked how it translated in my language) but didn't take the time to look it up. Now it's done.

The seven deadly sins of software deployment

Posted Aug 10, 2013 4:33 UTC (Sat) by maxiaojun (guest, #91482) [Link] (1 responses)

Watched "The Seven Deadly Sins of Software Deployment [YouTube]" video.

Don't understand why the presenter put two country names after "outsourced code".

The seven deadly sins of software deployment

Posted Aug 11, 2013 20:50 UTC (Sun) by jberkus (guest, #55561) [Link]

(a) because it rhymed.

(b) because in the current outsourced development market, China and the Phillipines seem to be supplying the biggest chunk (but not all) of the lowest-end and most incompentent coding (not blaming the providers, they're also doing it really cheap, and You Get What You Pay For). If you browse the DailyWTF a bit, you'll find lots of examples of "inventive" code from these kinds of shops.

Five years ago I would have said "India and the Ukraine" and had to find a different rhyme. However, code quality from Indian/Ukrainian oursourced companies has gotten a lot better (and more expensive). I expect China and the Phillipines to follow the same trajectory, and in five years it'll be Zimbabwe or Trinidad or Albania or somewhere else supplying the plurality of low-cost, low-quality outsourced code.

The seven deadly sins of software deployment

Posted Aug 16, 2013 14:34 UTC (Fri) by professor.matic (subscriber, #82368) [Link]

Thoroughly enjoyed reading this article: many thanks for writing it up.

How long have you found it takes to get a team from 'sinners' to 'saints'?