May 20, 2011
This article was contributed by Josh Berkus
In Part One of Scale Fail, I discussed some of the major issues which prevent web sites and applications from scaling. As was said there, most scalability issues are really management issues. The first article covered a few of the chronic bad decisions — or "anti-patterns" — which companies suffer from, including compulsive trendiness, lack of metrics, "barn door troubleshooting", and single-process programming. In Part Two, we'll explore some more general failures of technology management which lead to downtime.
No Caching
"Your query report shows that you're doing 7,000 read queries per second. Surely some of these could be cached?"
"We have memcached installed somewhere."
"How is it configured? What data are you caching? How are you invalidating data?"
"I'm ... not sure. We kind of leave it up to Django."
I'm often astonished at how much money web companies are willing to spend on faster hardware, and how little effort on simple things which would make their applications much faster in a relatively painless way. For example, if you're trying to scale a website, the first thing you should be asking yourself is: "where can I add more useful caching?"
While I mention memcached above, I'm not just talking about simple data caching. In any really scalable website you can add multiple levels of caching, each of them useful in their own way:
- database connection, parse and plan caching
- complex data caching and materialized views
- simple data caching
- object caching
- server web page caching
Detailing the different types of caching and how to employ them would be an article series on its own. However, every form of caching shares the ability to bring data closer to the user and make it more stateless, reducing response times. More importantly, by reducing the amount of resources required by repetitive application requests, you improve the efficiency of your platform and thus make it more scalable.
Seems obvious, doesn't it? Yet I can count our clients who were employing an effective caching strategy before they hired us on one hand.
A common mistake we see clients making with data caching is to leave caching entirely up to the object-relational mapper (ORM). The problem with this is that out-of-the-box, the ORM is going to be very conservative about how it uses the cache, only retrieving cached data for a user request which is absolutely identical, and thus lowering the number of cache hits to nearly zero. For example, I have yet to see an ORM which dealt well with caching the data for a paginated application view on its own.
The worst case I've seen of this was an online auction application where every single thing the user did ... every click, every pagination, every mouse-over ... resulted in a query to the back-end PostgreSQL database. Plus the auction widget polled the database for auction updates 30 times a second per user. This meant that each active application user resulted in over 100 queries per second to the core transactional database.
As much as a lack of caching is a very common bad decision, it's really symptomatic of a more general anti-pattern I like to call:
Scaling the Impossible Things
"In Phase III, we will shard the database into 12 segments, dividing along the lines of the statistically most common user groupings. Any data which doesn't divide neatly will need to be duplicated on each shard, and I've invented a replication scheme to take care of that ..."
"Seems awfully complicated. Have you considered just caching the most common searches instead?"
"That wouldn't work. Our data is too dynamic."
"Are you sure? I did some query analysis, and 90% of your current database hits fall in one of these four patterns ..."
"I told you, it wouldn't work. Who's the CTO here, huh?"
Some things which you need for your application are very hard to scale, consuming large amounts of system resources, administration time, and staff creativity to get them to scale up or scale out. These include
transactional databases, queues, shared filesystems, complex web frameworks
(e.g. Django or Rails), and object-relational mappers (ORMs).
Other parts of your infrastructure are very easy to scale to many user requests, such as
web servers,
static content delivery,
caches,
local storage, and
client-side software (e.g. javascript).
Basically, the more stateful, complex, and featureful a piece of infrastructure is, the more resources it's going to use per application user and the more prone it's going to be to locking — and thus the harder it's going to be to scale out. Given this, you would think that companies who are struggling with rapidly growing scalability problems would focus first on scaling out the easy things, and put off scaling the hard things for as long as possible.
You would be wrong.
Instead directors of development seem to be in love with trying to scale the most difficult item in their infrastructure first. Sharding plans, load-balancing master-slave-replication, forwarded transactional queues, 200-node clustered filesystems — these get IT staff excited and get development money flowing. Even when their scalability problems could be easily and cheaply overcome by adding a Varnish cache or fixing some unnecessarily resource-hungry application code.
For example, one of our clients had issues with their Django servers
constantly becoming overloaded and falling over. They'd gone up from four
to eight application servers, and were still having to restart them on a
regular basis, and wanted to discuss doubling the number of application servers again, which also would require scaling up the database server. Instead, we did some traffic analysis and discovered that most of the resource usage on the Django servers was from serving static images. We moved all the static images to a content delivery network, and they were able to reduce their server count.
After a month of telling us why we "didn't understand the application", of course.
SPoF
"How are we load-balancing the connection from the middleware servers to the database servers?"
"Through a Zeus load-balancing cluster."
"From the web servers to the middleware servers?"
"The same Zeus cluster."
"Web servers to network file storage? VPN between data centers? SSH access?"
"Zeus."
"Does everything on this network go through Zeus?"
"Pretty much, yes."
"Uh-huh. Well, what could possibly go wrong?"
SPoF, of course, stands for Single Point of Failure. Specifically, it refers to a single component which will take down your entire infrastructure if it fails, no matter how much redundancy you have in other places. It's dismaying how many companies fail to remove SPoFs despite having lavished hardware and engineering time on making several levels of their stack high availability. Your availability is only as good as your least available component.
The company in the dialog above went down for most of a day only a few weeks after that conversation. A sysadmin had loaded a buggy configuration into Zeus, and instantly the whole network ceased to exist. The database servers, the web servers, the other servers were all still running, but not even the sysadmins could reach them.
Sometimes your SPoF is a person. For example, you might have a server or even a data center which needs to be failed over manually, and only one staff member has the knowledge or login to do so. More sinister SPoFs often lurk in your development or recovery processes. I once witnessed a company try to deploy a hot code fix in response to a DDOS attack, only to have their code repository — their only code repository — go down and refuse to come back up.
A "Cascading SPoF" is a SPoF which looks like it's redundant. Here's a simple math exercise: You have three application servers. Each of these servers is operating at 80% of their capacity. What happens when one of them fails and its traffic gets load balanced onto the other two?
A component doesn't have to be the only one of its kind to be a SPoF; it just has to be the case that its failure will take the application down. If all of the components at any level of your stack are operating at near-capacity, you have a problem, because a minority server failure or even a modest increase in traffic can result in cascading failure.
Cloud Addiction
"... so if you stay on AWS, we'll have to do major horizontal scaling, which will require a $40K consulting project. If you move to conventional hosting, you'll need around $10K of our services for the move, and get better application performance. Plus your cloud fees are costing you around three times what you would pay to rent racked servers."
"We can't discuss a move from until the next fiscal year."
"So, you'll be wanting the $40K contract then?"
Since I put together the Ignite talk early this year, I've increasingly seen a new anti-pattern we call "Cloud addiction". Several of our clients are refusing to move off of cloud hosting even when it is demonstrably killing their businesses. This problem is at its worst on Amazon Web Services (AWS) because Amazon has no way to move off their cloud without leaving Amazon entirely, but I've seen it with other public clouds as well.
The advantage of cloud hosting is that it allows startups to get a new application running and serving real users without ever making an up-front investment in infrastructure. As a way to lower the barriers to innovation, cloud hosting is a tremendous asset.
The problem comes when the application has outgrown the resource limitations of cloud servers and has to move to a different platform. Usually a company discovers these limits in the form of timeouts, outages, and rapidly escalating numbers of server instances which fail to improve application performance. By limitations, I'm referring to the restrictions on memory, processing power, storage throughput and network configuration inherent on a large scale public cloud, as well as the high cost of round-the-clock busy cloud instances. These are "good enough" for getting a project off the ground, but start failing when you need to make serious performance demands on each node.
That's when you've reached scale fail on the cloud. At that point, the company has no experience managing infrastructure, no systems staff, and no migration budget. More critically, management doesn't have any process for making decisions about infrastructure. Advice that a change of hosting is required are met with blank stares or even panic. "Next fiscal year", in a startup, is a euphemism for "never".
Conclusion
Of course, these are not all the scalability anti-patterns out there. Personnel mismanagement, failure to anticipate demand spikes, lack of deployment process, dependencies on unreliable third parties, or other issues can be just as damaging as the eight issues I've outlined above. There are probably as many ways to not scale as there are web companies. I can't cover everything.
Hopefully this article will help you recognize some of these "scale fail" patterns when they occur at your own company or at your clients. Every one of the issues I've outlined here comes down to poor decision-making rather than any technical limits in scalability. In my experience, technical issues rarely hold back the growth of a web business, while management mistakes frequently destroy it. If you recognize the anti-patterns, you may be able to make one less mistake.
[ Note about the author: to support his habit of hacking on the PostgreSQL database, Josh Berkus is CEO of PostgreSQL Experts Inc., a database and applications consulting company which helps clients make their PostgreSQL applications more scalable, reliable, and secure. ]
(
Log in to post comments)