May 6, 2011
This article was contributed by Josh Berkus
Let me tell you a secret. I don't fix databases. I fix applications.
Companies hire me to "fix the database" because they think it's the source
of their performance and downtime problems. This is very rarely the case.
Failure to scale is almost always the result of poor management decisions
— often a series of them. In fact, these anti-scaling decisions are
so often repeated that they have become anti-patterns.
I did a little talk about these anti-patterns at the last MySQL Conference and Expo. Go watch it and then come on back.
Now that you've seen the five-minute version (and hopefully laughed at it), you're ready for some less sarcastic detail which explains how to recognize these anti-patterns and how to avoid them.
Trendiness
"Now, why are you migrating databases? You haven't had a downtime in three months, and we have a plan for the next two years of growth. A migration will cause outages and chaos."
"Well ... our CTO is the only one at the weekly CTO's lunch who uses PostgreSQL. The other CTOs have been teasing him about it."
Does this sound like your CTO? It's a real conversation I had. It also describes more technical executives than I care to think about: more concerned with their personal image and career than they are with whether or not the site stays up or the company stays in business. If you start hearing any of the following words in your infrastructure meetings, you know you're in for some serious overtime: "hip", "hot", "cutting-edge", "latest tech", or "cool kids". References to magazine surveys or industry trends articles are also a bad sign.
Scaling an application is all about management of resources and administrative repeatability. This means using technology which your staff is extremely familiar with and which has been tested and proven to be reliable — and is designed to do the thing you want it to do. Hot new features are less important than consistent uptime without constant attention. More importantly, web technology usually makes big news while it's still brand new, which also means poorly documented, unstable, unable to integrate with other components, and full of bugs.
There's also another kind of trendiness to watch out for, it's the one which says, "If Google or Facebook does it, it must be the right choice." First, what's the right choice for them may not be the right choice for you, unless your applications and platform are very similar to theirs.
Second, not everything that Google and Facebook did with their infrastructures are things they would do again if they had to start over. Like everyone else, the top internet companies make bad decisions and get stuck with technology which is painful to use, but even more painful to migrate away from. So if you're going to copy something "the big boys" do, make sure you ask their staff what they think of that technology first.
No metrics
"Have we actually checked the network latency?"
"I'm sure the problem is
HBase."
"Yes, but have we checked?"
"I told you, we don't need to check. The problem is always HBase."
"Humor me."
"Whatever. Hmmmmmm ... oh! I think something's wrong with the network ..."
Scaling an application is an arithmetic exercise. If one user consumes X
amount of CPU time on the web server, how many web servers do you need to support 100,000 simultaneous users? If the database is growing at Y per day, and Z% of the data is "active" how long until the active data outgrows RAM?
Clearly, you cannot do any of this kind of estimation without at least approximate values for X, Y, and Z. If you're planning to scale, you should be instrumenting every piece of your application stack, from the storage hardware to the JavaScript. The thing you forget to monitor is the one which will most likely bring down your whole site. Most software these days has some way to monitor its performance, and software that doesn't is software you should probably avoid.
Despite this common-sense idea, a surprising number of our clients were doing nothing more sophisticated than Nagios alerts on their hardware. This means that when a response time problem or outage occurs, they had no way to diagnose what caused it, and usually ended up fixing the wrong component.
Worse, if you don't have the math for what resources your application is
actually consuming, then you have no idea how many servers, and of what kind, you need in order to scale up your site. That means you will be massively overbuilding some components, while starving others, and spending twice as much money as you need to.
Given how many companies lack metrics, or ignore them, how do they make decisions? Well ...
Barn door decision making
"When I was at Amazon, we used a squid reverse proxy ..."
"Dan, you were an ad sales manager at Amazon."
In the absence of data, staff tend to troubleshoot problems according to their experience, which is usually wrong. Especially when an emergency occurs, there's a tendency to run to fix whatever broke last time. Of course, if they fixed the thing which broke last time, it's unlikely to be the cause of the current outage.
This sort of thinking gets worse when it comes time to plan for growth. I've seen plenty of IT staff purchase equipment, provision servers, configure hardware and software, and lay out networks according to what they did on their last project or even on their previous job. This means that the resources available for the current application are not at all matched to what that application needs, and either you over-provision dramatically or you go down.
Certainly you should learn from your experience. But you should learn appropriate lessons, like "don't depend on VPNs being constantly up". Don't misapply knowledge, like copying the caching strategy from a picture site to an online bank. Learning the wrong lesson is generally heralded by announcements in one or all of the following forms:
- "when I was at name_of_previous_employer ..."
- "when we encountered not_very_similar_problem before, we used random_software_or_technique ..."
- "name_of_very_different_project is using random_software_or_technique, so that's what we should use."
(For non-native English speakers, "barn door" refers to the expression "closing the barn door after the horses have run away")
Now, it's time to actually get into application design.
Single-threaded programming
"So, if I monkey-patch a common class in Rails, when do the changes affect concurrently running processes?"
"Instantly! It's like magic."
The parallel processing frame of mind is a challenge for most developers. Here's a story I've seen a hundred times: a developer writes his code single-threaded, he tests it with a single user and single process on his own laptop, then he deploys it to 200 servers, and the site goes down.
Single-threading is the enemy of scalability. Any portion of your application which blocks concurrent execution of the same code at the same time is going to limit you to the throughput of a single core on a single machine. I'm not just talking here about application code which takes a mutex, although that can be bad too. I'm talking about designs which block the entire application around waiting on one exclusively locked component.
For example, a popular beginning developer mistake is to put every single asynchronous task in a single non-forwarded queue, limiting the pace of the whole application to the rate at which messages can be pulled off that queue. Other popular mistakes are the frequently updated single-row "status" table, explicit locking of common resources, and total ignorance of which actions in one's programming language, framework, or database require exclusive locks on pages in memory.
One application I'm currently working on has a distributed data-processing cloud of 240 servers. However, assignment of chunks of data to servers for processing is done by a single-process daemon running on a single dispatch server, rate limiting the whole cloud to 4000 jobs/minute and 75% idle.
An even worse example was a popular sports web site we worked on. The site would update sports statistics by holding an exclusive lock on transactional database tables while waiting for a remote data service over the internet to respond. The client couldn't understand why adding more application servers to their infrastructure made the timeouts worse instead of better.
Any time you design anything for your application which is supposed to scale, ask yourself "how would this work if 100 users were doing it simultaneously? 1000? 1,000,000?" And learn a functional language or map/reduce. They're good training for parallel thinking.
Coming in part 2
I'm sure you recognized at least one of the anti-patterns above in your own company, as most of the audience at the Ignite talk did. In part two of this article, I will cover component scaling, caching, and SPoFs, as well as the problem with The Cloud.
[ Note about the author: to support his habit of hacking on the
PostgreSQL database, Josh Berkus is CEO of PostgreSQL Experts Inc., a
database and applications consulting company which helps clients make their
PostgreSQL applications more scalable, reliable, and secure. ]
(
Log in to post comments)