User: Password:
Subscribe / Log in / New account


Scale Fail (part 2)

May 20, 2011

This article was contributed by Josh Berkus

In Part One of Scale Fail, I discussed some of the major issues which prevent web sites and applications from scaling. As was said there, most scalability issues are really management issues. The first article covered a few of the chronic bad decisions — or "anti-patterns" — which companies suffer from, including compulsive trendiness, lack of metrics, "barn door troubleshooting", and single-process programming. In Part Two, we'll explore some more general failures of technology management which lead to downtime.

No Caching

"Your query report shows that you're doing 7,000 read queries per second. Surely some of these could be cached?"

"We have memcached installed somewhere."

"How is it configured? What data are you caching? How are you invalidating data?"

"I'm ... not sure. We kind of leave it up to Django."

I'm often astonished at how much money web companies are willing to spend on faster hardware, and how little effort on simple things which would make their applications much faster in a relatively painless way. For example, if you're trying to scale a website, the first thing you should be asking yourself is: "where can I add more useful caching?"

While I mention memcached above, I'm not just talking about simple data caching. In any really scalable website you can add multiple levels of caching, each of them useful in their own way:

  • database connection, parse and plan caching
  • complex data caching and materialized views
  • simple data caching
  • object caching
  • server web page caching

Detailing the different types of caching and how to employ them would be an article series on its own. However, every form of caching shares the ability to bring data closer to the user and make it more stateless, reducing response times. More importantly, by reducing the amount of resources required by repetitive application requests, you improve the efficiency of your platform and thus make it more scalable.

Seems obvious, doesn't it? Yet I can count our clients who were employing an effective caching strategy before they hired us on one hand.

A common mistake we see clients making with data caching is to leave caching entirely up to the object-relational mapper (ORM). The problem with this is that out-of-the-box, the ORM is going to be very conservative about how it uses the cache, only retrieving cached data for a user request which is absolutely identical, and thus lowering the number of cache hits to nearly zero. For example, I have yet to see an ORM which dealt well with caching the data for a paginated application view on its own.

The worst case I've seen of this was an online auction application where every single thing the user did ... every click, every pagination, every mouse-over ... resulted in a query to the back-end PostgreSQL database. Plus the auction widget polled the database for auction updates 30 times a second per user. This meant that each active application user resulted in over 100 queries per second to the core transactional database.

As much as a lack of caching is a very common bad decision, it's really symptomatic of a more general anti-pattern I like to call:

Scaling the Impossible Things

"In Phase III, we will shard the database into 12 segments, dividing along the lines of the statistically most common user groupings. Any data which doesn't divide neatly will need to be duplicated on each shard, and I've invented a replication scheme to take care of that ..."

"Seems awfully complicated. Have you considered just caching the most common searches instead?"

"That wouldn't work. Our data is too dynamic."

"Are you sure? I did some query analysis, and 90% of your current database hits fall in one of these four patterns ..."

"I told you, it wouldn't work. Who's the CTO here, huh?"

Some things which you need for your application are very hard to scale, consuming large amounts of system resources, administration time, and staff creativity to get them to scale up or scale out. These include transactional databases, queues, shared filesystems, complex web frameworks (e.g. Django or Rails), and object-relational mappers (ORMs).

Other parts of your infrastructure are very easy to scale to many user requests, such as web servers, static content delivery, caches, local storage, and client-side software (e.g. javascript).

Basically, the more stateful, complex, and featureful a piece of infrastructure is, the more resources it's going to use per application user and the more prone it's going to be to locking — and thus the harder it's going to be to scale out. Given this, you would think that companies who are struggling with rapidly growing scalability problems would focus first on scaling out the easy things, and put off scaling the hard things for as long as possible.

You would be wrong.

Instead directors of development seem to be in love with trying to scale the most difficult item in their infrastructure first. Sharding plans, load-balancing master-slave-replication, forwarded transactional queues, 200-node clustered filesystems — these get IT staff excited and get development money flowing. Even when their scalability problems could be easily and cheaply overcome by adding a Varnish cache or fixing some unnecessarily resource-hungry application code.

For example, one of our clients had issues with their Django servers constantly becoming overloaded and falling over. They'd gone up from four to eight application servers, and were still having to restart them on a regular basis, and wanted to discuss doubling the number of application servers again, which also would require scaling up the database server. Instead, we did some traffic analysis and discovered that most of the resource usage on the Django servers was from serving static images. We moved all the static images to a content delivery network, and they were able to reduce their server count.

After a month of telling us why we "didn't understand the application", of course.


"How are we load-balancing the connection from the middleware servers to the database servers?"

"Through a Zeus load-balancing cluster."

"From the web servers to the middleware servers?"

"The same Zeus cluster."

"Web servers to network file storage? VPN between data centers? SSH access?"


"Does everything on this network go through Zeus?"

"Pretty much, yes."

"Uh-huh. Well, what could possibly go wrong?"

SPoF, of course, stands for Single Point of Failure. Specifically, it refers to a single component which will take down your entire infrastructure if it fails, no matter how much redundancy you have in other places. It's dismaying how many companies fail to remove SPoFs despite having lavished hardware and engineering time on making several levels of their stack high availability. Your availability is only as good as your least available component.

The company in the dialog above went down for most of a day only a few weeks after that conversation. A sysadmin had loaded a buggy configuration into Zeus, and instantly the whole network ceased to exist. The database servers, the web servers, the other servers were all still running, but not even the sysadmins could reach them.

Sometimes your SPoF is a person. For example, you might have a server or even a data center which needs to be failed over manually, and only one staff member has the knowledge or login to do so. More sinister SPoFs often lurk in your development or recovery processes. I once witnessed a company try to deploy a hot code fix in response to a DDOS attack, only to have their code repository — their only code repository — go down and refuse to come back up.

A "Cascading SPoF" is a SPoF which looks like it's redundant. Here's a simple math exercise: You have three application servers. Each of these servers is operating at 80% of their capacity. What happens when one of them fails and its traffic gets load balanced onto the other two?

A component doesn't have to be the only one of its kind to be a SPoF; it just has to be the case that its failure will take the application down. If all of the components at any level of your stack are operating at near-capacity, you have a problem, because a minority server failure or even a modest increase in traffic can result in cascading failure.

Cloud Addiction

"... so if you stay on AWS, we'll have to do major horizontal scaling, which will require a $40K consulting project. If you move to conventional hosting, you'll need around $10K of our services for the move, and get better application performance. Plus your cloud fees are costing you around three times what you would pay to rent racked servers."

"We can't discuss a move from until the next fiscal year."

"So, you'll be wanting the $40K contract then?"

Since I put together the Ignite talk early this year, I've increasingly seen a new anti-pattern we call "Cloud addiction". Several of our clients are refusing to move off of cloud hosting even when it is demonstrably killing their businesses. This problem is at its worst on Amazon Web Services (AWS) because Amazon has no way to move off their cloud without leaving Amazon entirely, but I've seen it with other public clouds as well.

The advantage of cloud hosting is that it allows startups to get a new application running and serving real users without ever making an up-front investment in infrastructure. As a way to lower the barriers to innovation, cloud hosting is a tremendous asset.

The problem comes when the application has outgrown the resource limitations of cloud servers and has to move to a different platform. Usually a company discovers these limits in the form of timeouts, outages, and rapidly escalating numbers of server instances which fail to improve application performance. By limitations, I'm referring to the restrictions on memory, processing power, storage throughput and network configuration inherent on a large scale public cloud, as well as the high cost of round-the-clock busy cloud instances. These are "good enough" for getting a project off the ground, but start failing when you need to make serious performance demands on each node.

That's when you've reached scale fail on the cloud. At that point, the company has no experience managing infrastructure, no systems staff, and no migration budget. More critically, management doesn't have any process for making decisions about infrastructure. Advice that a change of hosting is required are met with blank stares or even panic. "Next fiscal year", in a startup, is a euphemism for "never".


Of course, these are not all the scalability anti-patterns out there. Personnel mismanagement, failure to anticipate demand spikes, lack of deployment process, dependencies on unreliable third parties, or other issues can be just as damaging as the eight issues I've outlined above. There are probably as many ways to not scale as there are web companies. I can't cover everything.

Hopefully this article will help you recognize some of these "scale fail" patterns when they occur at your own company or at your clients. Every one of the issues I've outlined here comes down to poor decision-making rather than any technical limits in scalability. In my experience, technical issues rarely hold back the growth of a web business, while management mistakes frequently destroy it. If you recognize the anti-patterns, you may be able to make one less mistake.

[ Note about the author: to support his habit of hacking on the PostgreSQL database, Josh Berkus is CEO of PostgreSQL Experts Inc., a database and applications consulting company which helps clients make their PostgreSQL applications more scalable, reliable, and secure. ]

Comments (19 posted)

Brief items

Quotes of the week

I'm not going to touch that. If someone else wants to break the long tradition of *fb code being undecipherable by humans, they can poke the wasps nest and try to avoid getting stung.
-- Alan Coopersmith

It is wise to leave most technical decisions to the people who will do the technical work, but this is not a rule, just a good default.
-- Richard Stallman

Comments (none posted)

0install 1.0 released

The 0install project describes itself as "a decentralised cross-distribution software installation system available under the LGPL. It allows software developers to publish programs directly from their own web-sites, while supporting features familiar from centralised distribution repositories such as shared libraries, automatic updates and digital signatures." The 1.0 release is out; see for more information.

Full Story (comments: 3)

The first Calligra snapshot release

The Calligra fork of KOffice has announced the availability of its first snapshot release in the hope of getting useful feedback from users. "Our goal is to provide the best application suite on all platforms based on open standards. That is no small goal. We feel now that we have improved the foundation enough to start working seriously on improving the user experience. For this, we need the help of our users!" There are two new applications (a diagram and flowchart editor and a note-taking tool), performance improvements, better text layout, and more.

Comments (6 posted)

SSL FalseStart Performance Results (The Chromium Blog)

Over at the Chromium blog, Google is reporting on the performance of SSL FalseStart, which is implemented in the Chromium browser. SSL FalseStart takes a seemingly legal (in a protocol sense) shortcut in the SSL handshake, which leads to 30% less latency in SSL startup time. In order to roll it out, Google also needed to determine which sites didn't support the feature: "To investigate the failing sites, we implemented a more robust check to understand how the failures occurred. We disregarded those sites that failed due to certificate failures or problems unrelated to FalseStart. Finally, we discovered that the sites which didn't support FalseStart were using only a handful of SSL vendors. We reported the problem to the vendors, and most have fixed it already, while the others have fixes in progress. The result is that today, we have a manageable, small list of domains where SSL FalseStart doesn't work, and we've added them to a list within Chrome where we simply won’t use FalseStart. This list is public and posted in the chromium source code. We are actively working to shrink the list and ultimately remove it." It seems likely that other browsers can take advantage of this work.

Comments (16 posted)

Libcloud becomes a top-level Apache project

The Apache Software Foundation has announced that Libcloud has graduated into a top-level project. "Apache Libcloud is an Open Source Python library that provides a vendor-neutral interface to cloud provider APIs. The current version of Apache Libcloud includes backend drivers for more than twenty leading providers including Amazon EC2, Rackspace Cloud, GoGrid and Linode."

Comments (none posted)

Jato v0.2 released

Jato is a JIT-only virtual machine for Java; the 0.2 release is now out. New features include Jython and JRuby support, annotation support, improved JNI support, and a lot of fixes.

Full Story (comments: none)

Pinpoint 0.1.0 released

Pinpoint is "a simple presentation tool that hopes to avoid audience death by bullet point and instead encourage presentations containing beautiful images and small amounts of concise text in slides." The first release (0.1.0) has been made with a basic set of presentation features.

Full Story (comments: 2)

Newsletters and articles

Development newsletters from the last week

Comments (none posted)

Modders Make Android Work the Way You Want (Wired)

Wired profiles the CyanogenMod team. CyanongenMod is, of course, the alternate firmware for Android devices built from the code that Google releases. "CyanogenMod expanded into a team of 35 different "device maintainers," who manage the code for the 32 different devices that the project supports. Like Google, the team publishes its code to an online repository and accepts online submissions for changes to the code from other developers. Seven core members decide which of the submitted changes make it into the next release of CyanogenMod, and which don't."

Comments (33 posted)

What Every C Programmer Should Know About Undefined Behavior #3/3

The final segment of the LLVM blog's series on undefined behavior is up. "In this article, we look at the challenges that compilers face in providing warnings about these gotchas, and talk about some of the features and tools that LLVM and Clang provide to help get the performance wins while taking away some of the surprise."

Comments (73 posted)

Rapid-release idea spreads to Firefox 5 beta (CNet)

Stephen Shankland takes a look at the beta for Firefox 5. "Firefox 5 beta's big new feature is support for CSS animations, which let Web developers add some pizzazz to actions such as making dialog boxes pop up or switching among photos. Also in the new version is canvas, JavaScript, memory, and networking, according to the release notes and bug-fix list."

Comments (10 posted)

Page editor: Jonathan Corbet
Next page: Announcements>>

Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds