Through me pass into the site of downtime,
Through me pass into eternal overtime
Through me pass and moan ye in fear
All updates abandon, ye who enter here.
A decade ago, software deployments were something you did fairly
infrequently; at most monthly, more commonly quarterly, or even
annually. As such, pushing new and updated software was not something
developers, operations (ops) staff, or database administrators (DBAs) got much practice with. Generally, a deployment was a major downtime event, requiring the kind of planning and personnel NASA takes to land a robot on Mars ... and with about as many missed attempts.
Not anymore. Now we deploy software weekly, daily, even
continuously. And that means that a software push needs to become a
non-event, notable only for the exceptional disaster. This means that
everyone on the development staff needs to become accustomed to and familiar with the deployment drill and their part in it. However, many developers and ops staff — including, on occasion, me — have been slow to make the adjustment from one way of deployment to another.
That's why I presented "The Seven Deadly Sins of
Software Deployment [YouTube]" at OSCON Ignite on July 22. Each of the "sins" below is a chronic bad habit I've seen in practice, which turns what should be a routine exercise into a periodic catastrophe. While a couple of the sins aren't an exact match to their medieval counterparts, they're still a good check list for "am I doing this wrong?".
Why do you need deployment scripts?
That's too much work to get done.
I'll just run the steps by hand,
I know I won't forget one.
And the same for change docs;
wherefore do you task me.
For info on how each step works,
when you need it you just ask me.
Scripting and documenting every step of a software deployment process
are, let's face it, a lot of work. It's extremely tempting to simply
"improvise" it, or just go from a small set of notes on a desktop
sticky. This works fine — until it doesn't.
Many people find out the hard way that nobody can remember a 13-step process in their head. Nor can they remember whether or not it's critical to the deployment that step four succeed, or whether step nine is supposed to return anything on success or not. If your code push needs to happen at 2:00AM in order to avoid customer traffic, it can be hard to even remember a three-step procedure.
There is no more common time for your home internet to fail, the VPN server to lose your key, or your pet to need an emergency veterinary visit than ten minutes before a nighttime software update. If the steps for the next deployment are well-scripted, well-documented, and checked into a common repository, one of your coworkers can just take it and run it. If not, well, you'll be up late two nights in a row after a very uncomfortable staff meeting.
Requiring full scripting and documentation has another benefit; it makes developers and staff think more about what they're doing during the deployment than they would otherwise. Has this been tested? Do we know how long the database update actually takes? Should the ActiveRecord update come before or after we patch Apache?
Buy cheap staging servers, no one will know:
they're not production, they can be slow.
They need not RAM, nor disks nor updates.
Ignore you QA; those greedy ingrates.
There's a surprising number of "agile" software shops out there who either lack staging servers entirely, or who use the old former production servers from two or three generations ago. Sometimes these staging servers will have known, recurring hardware issues. Other times they will be so old, or so unmaintained, they can't run the same OS version and libraries which are run in production.
In cases where "staging" means "developer laptops", there is no way to check for performance issues or for how long a change will take. Modifying a database column on an 8MB test database is a fundamentally different proposition from doing it on the 4 terabyte production database. Changes which cause new blocking actions between threads or processes also tend not to show up in developer tests.
Even when issues do show up during testing, nobody can tell for certain
if the issues are caused by the inadequate staging setup or by new
bugs. Eventually, QA staff start to habitually ignore certain kinds of
errors, especially performance problems, which makes doing QA at all an
exercise of dubious utility. Why bother to run response time tests if
you're going to ignore the results because the staging database is known to
be 20 times slower than production?
The ideal staging system is, of course, a full replica of your production setup. This isn't necessarily feasible for companies whose production includes dozens or hundreds of servers (or devices), but a scaled-down staging environment should be scaled down in an intelligent way that keeps performance at a known ratio to production. And definitely keep those staging servers running the exact same versions of your platform that production is running.
Yes, having a good staging setup is expensive; you're looking at spending at least ¼ as much as you spent on production, maybe as much. On the other hand, how expensive is unexpected downtime?
Install it! Update it! Do it ASAP!
I'll have Kernel upgrades,
a new shared lib or three,
a fat Python update
and four new applications!
And then for dessert:
Sixteen DB migrations.
If you work at the kind of organization where deployments happen relatively infrequently, or at least scheduled downtimes are once-in-a-blue-moon, there is an enormous temptation to "pile on" updates which have been waiting for weeks or months into one enormous deployment. The logic behind this usually is, "as long as the service is down for version 10.5, let's apply those kernel patches." This is inevitably a mistake.
As you add additional changes to a particular deployment, each change increases the chances it will fail somehow, both because each change has a chance of failure, and because layered application and system changes can mess each other up (for example, a Python update can cause an update to your Django application to fail due to API changes). Additional changes also make the deployment procedure itself more complicated and thus increase the chances of an administrator or scripting error, and you make it harder and more time-consuming to test all of the changes both in isolation or together. To make this into a rule:
The odds of deployment failure approach 100% as the number of distinct change sets approaches seven.
Obviously, the count of seven is somewhat dependent on your
infrastructure, nature of the application, and testing setup. However, even
if you have an extremely well-trained crew and an unmatched staging
platform, you're really not going to be able to tolerate many more distinct
changes to your production system before making failure all but certain.
Worse, if you have many separate "things" in your deployment, you've also made rollback longer and more difficult — and more likely to fail. This means, potentially, a serious catch-22, where you can't proceed because deployment is failing, and you can't roll back because rollback is failing. That's the start of a really long night.
The solution to this is to make deployments as small and as frequent as possible. The ideal change is only one item. While that goal is often unachievable, doing three separate deployments which change three things each is actually much easier than trying to change nine things in one. If the size of your update list is becoming unmanageable, you should think in terms of doing more deployments instead of larger ones.
Need I no tests, nor verification.
Behold my code! Kneel in adulation.
Rollback scripts are meant for lesser men;
my deployments perfect, as ever, again.
Possibly the most common critical deployment failure is when developers and administrators don't create a rollback procedure at all, let alone rollback scripts. A variety of excuses are given for this, including: "I don't have time", "it's such a small change", or "all tests passed and it looks good on staging". Writing rollback procedures and scripts is also a bald admission that your code might be faulty or that you might not have thought of everything, which is hard for anyone to admit to themselves.
Software deployments fail for all sorts of random reasons, up to and including sunspots and cosmic rays. One cannot plan for the unanticipated, by definition. So you should be ready for it to fail; you should plan for it to fail. Because when you're ready for something to fail, most of the time, it succeeds. Besides, the alternative is improvising a solution or calling an emergency staff meeting at midnight.
You don't need to be complicated or comprehensive. If the deployment is
simple, the rollback may be as simple as a numbered list of steps on a
shared wiki page. There are two stages to planning to roll back properly:
- write a rollback procedure and/or scripts
- test that the rollback succeeds on staging
Many people forget to test their rollback procedure just like they test the original deployment. In fact, it's more important to test the rollback, because if it fails, you're out of other options.
On production servers,
These wretches had deployed
all of the most updated
platforms and tools they enjoyed:
new releases, alpha versions,
compiled from source.
No packages, no documentation,
and untested, of course.
The essence of successful software deployments is repeatability. When
you can run the exact same steps several times in a row on both development
and staging systems, you're in good shape for the actual deployment, and if it fails, you can roll back and try again. The cutting edge is the opposite of repeatability. If your deployment procedure includes "check out latest commit from git HEAD for library_dependency", then something has already gone wrong, and the chances of a successful deployment are very, very low.
This is why system administrators prefer known, mainstream packages and
are correct to do so, even though this often leads to battles with
developers. "But I need feature new_new_xyz, which is only in the
current beta!" is a whine which often precipitates a tumultuous staff
meeting. The developer only needs to make their stack work once (on their
laptop) and can take several days to make it work; the system administrator
or devops staff needs to make it work within minutes — several times.
In most cases, the developers don't really need the
latest-source-version of the platform software being updated, and this can be settled in the staff meeting or scrum. If they really do need it, then the best answer is usually to create your own packages and documentation internally for the exact version to be deployed in production. This seems like a lot of extra work, but if your organization isn't able to put in the time for it, it's probably not as important to get that most recent version as people thought.
I cannot stand meetings,
I will not do chat
my scripts all are perfect,
you can count on that.
I care only to keep clean my name
if my teammates fail,
then they'll take the blame.
In every enterprise, some staff members got into computers so that they wouldn't have to deal with other people. These antisocial folks will be a constant trial to your team management, especially around deployment time. They want to do their piece of the large job without helping, or even interacting with, anyone else on the team.
For a notable failed deployment at one company, we needed a network administrator to change some network settings as the first step of the deployment. The administrator did this, logging in, changing the settings, and logging back out, and telling nobody what he'd done. He then went home. When it came time for step two, the devops staff could not contact the administrator, and nobody still online had the permissions to check if the network settings were changed. Accordingly, the whole deployment had to be rolled back, and tried again the following week.
Many software deployment failures can be put down to poor communication between team members. The QA people don't know what things they're supposed to test. The DBA doesn't know to disable replication. The developers don't know that both features are being rolled out. Nobody knows how to check if things are working. This can cause a disastrously bad deployment even when every single step would have succeeded.
The answer to this is lots of communication. Overdetermine that everyone knows what's going to happen during the deployment, who's going to do it, when they're going to do it, and how they'll know when they're done. Go over this in a meeting, follow it up with an email, and have everyone on chat or VoIP conference during the deployment itself. You can work around your antisocial staff by giving them other ways to keep team members updated such as wikis and status boards, but ultimately you need to impress on them how important coordination is. Or encourage them to switch to a job which doesn't require teamwork.
When failed the deployment,
again and again and again they would try,
each failing step on the fly.
They would not roll back,
but ground on all night,
"the very next time we run it
the upgrade will be all right."
I've seen (and been part of) teams which did everything else right. They scripted and documented, communicated and packaged, and had valid and working rollback scripts. Then, something unexpected went wrong in the middle of the deployment. The team had to make a decision whether to try to fix it, or to roll back; in the heat of the moment, they chose to press on. The next dawn found the devops staff still at work, trying to fix error after error, now so deep into ad-hoc patches that the rollback procedure wouldn't work if they tried to follow it. Generally, this is followed by several days of cleaning up the mess.
It's very easy to get sucked into the trap of "if I fix one more thing, I can go to bed and I don't have to do this over again tomorrow." As you get more and more into overtime, your ability to judge when you need to turn back gets worse and worse. Nobody can make a rational decision at two in the morning after a 15-hour day.
To fight this, Laura Thompson at Mozilla introduced the "three strikes" rule. This rule says: "If three or more things have gone wrong, roll back." While I was working with Mozilla, this saved us from bad decisions about fixing deployments on the fly at least twice; it was a clear rule which could be easily applied even by very tired staff. I recommend it.
To escape DevOps hell
avoid sin; keep to heart
these seven virtues
of an agile software art.
Just as the medieval seven deadly sins have seven virtues to counterbalance them, here are seven rules for successful software deployments:
- Diligence: write change scripts and documentation
- Benevolence: get a good staging environment
- Temperance: make small deployments
- Humility: write rollback procedures
- Purity: use stable platforms
- Compassion: communicate often
- Patience: know when to roll back
You can do daily, or even "continuous", deployments if you
develop good practices and stick to them. While not the totality of what
you need to do for more rapid, reliable, and trouble-free updates and
pushes, following the seven rules of good practice will help you avoid some of the common pitfalls which turn routine deployments into hellish nights.
For more information, see the video of my "The Seven Deadly Sins of
Software Deployment" talk, the slides
and verses. See
also the slides
[PDF] from Laura Thompson's excellent talk "Practicing Deployment", and
Selena Deckelmann's related talk: Mistakes Were Made [YouTube].
to post comments)