While one might ordinarily think of the PyPy project as an experiment in
implementing the Python runtime in Python itself, there is really more to
it than that. PyPy is, in a sense, a toolbox for the creation of
just-in-time compilers for dynamic languages; Python is just the start -
but it's an interesting start. It
has been almost exactly one year since LWN first looked at PyPy
and a few weeks since the 1.5
, so the time seemed right to actually play with this tool a
bit. The results were somewhat eye-opening.
LWN uses a lot of tools written in Python; one of them is the gitdm data
miner which is used to generate kernel development statistics. It is a
simple program which reads the output of "git log" and
generates a big in-memory data structure reflecting the relationships between
developers, their employers, and the patches they are somehow associated
with. There is very little that is done in the kernel, and there is no use
of extension modules written in C. These features make gitdm a
natural first test for PyPy; there is little to trip things up.
The test was to stash the git log output from the 2.6.36 kernel
release through the present - some 31,000 changes - in a file on a local
SSD. The file, while large, should still fit in memory with nothing else
running; I/O effects should, thus, not figure into the results. Gitdm was
run on the file using both the CPython 2.7.1 interpreter and
When switching to an entirely different runtime for a non-trivial program,
it is natural to expect at least one glitch. In this case, there were
none; gitdm ran without complaint and produced identical output. There was
one significant difference, though: while the CPython runs took an average
of about 63 seconds, the PyPy runs completed in about
21 seconds. In other words, for the cost of changing the "#!" line at
the top of the program, the run time was cut to one third of its previous
value. One might conclude that the effort was justified; plans are to run
gitdm under PyPy from here on out.
To dig just a little deeper, the perf tool was used to generate a
few statistics of the differing runs:
As would be expected from the previous result, running with CPython took
about three times as many processor cycles as running with PyPy. On the
other hand, CPython reliably incurred less than 1/3 as many cache misses;
it would be hard to say why. Somehow, the code generated by the PyPy JIT
generates more widely spread-out memory references; that may be related to
garbage collection strategies. CPython uses reference counting, which can
improve cache locality, while PyPy does not.
One other interesting thing to note is that PyPy only
made half as many system calls.
That called for some investigation. Since gitdm is just
reading data and cranking on it, almost every system call it makes is
read(). Sure enough, the CPython runtime was issuing twice as
many read() calls. Understanding why would require digging into
the code; it could be as simple as PyPy using larger buffers in its file
Given results like this, one might well wonder why PyPy is not much more
widely used. There may be numerous reasons, including a simple lack of
awareness of PyPy among Python developers and users of their programs. But
the biggest issue may be extension modules. Most non-trivial Python
programs will use one or more modules which have been written in C for
performance reasons, or because it's simply not possible to provide the
required functionality in pure Python. These modules do not just move over
to PyPy the way Python code does. There is a
short list of modules supported by PyPy, but it's insufficient for many
Fixing this problem would seem to be one of the most urgent tasks for the
PyPy developers if they want to increase their user base. In other ways,
PyPy is ready for prime time; it implements the (Python 2.x) language
faithfully, and it is fast. With better support for extensions,
PyPy could easily become the interpreter of choice for a lot of Python
programs. It is a nice piece of work.
Comments (14 posted)
Let me tell you a secret. I don't fix databases. I fix applications.
Companies hire me to "fix the database" because they think it's the source
of their performance and downtime problems. This is very rarely the case.
Failure to scale is almost always the result of poor management decisions
— often a series of them. In fact, these anti-scaling decisions are
so often repeated that they have become anti-patterns.
I did a little talk about these anti-patterns at the last MySQL Conference and Expo. Go watch it and then come on back.
Now that you've seen the five-minute version (and hopefully laughed at it), you're ready for some less sarcastic detail which explains how to recognize these anti-patterns and how to avoid them.
"Now, why are you migrating databases? You haven't had a downtime in three months, and we have a plan for the next two years of growth. A migration will cause outages and chaos."
"Well ... our CTO is the only one at the weekly CTO's lunch who uses PostgreSQL. The other CTOs have been teasing him about it."
Does this sound like your CTO? It's a real conversation I had. It also describes more technical executives than I care to think about: more concerned with their personal image and career than they are with whether or not the site stays up or the company stays in business. If you start hearing any of the following words in your infrastructure meetings, you know you're in for some serious overtime: "hip", "hot", "cutting-edge", "latest tech", or "cool kids". References to magazine surveys or industry trends articles are also a bad sign.
Scaling an application is all about management of resources and administrative repeatability. This means using technology which your staff is extremely familiar with and which has been tested and proven to be reliable — and is designed to do the thing you want it to do. Hot new features are less important than consistent uptime without constant attention. More importantly, web technology usually makes big news while it's still brand new, which also means poorly documented, unstable, unable to integrate with other components, and full of bugs.
There's also another kind of trendiness to watch out for, it's the one which says, "If Google or Facebook does it, it must be the right choice." First, what's the right choice for them may not be the right choice for you, unless your applications and platform are very similar to theirs.
Second, not everything that Google and Facebook did with their infrastructures are things they would do again if they had to start over. Like everyone else, the top internet companies make bad decisions and get stuck with technology which is painful to use, but even more painful to migrate away from. So if you're going to copy something "the big boys" do, make sure you ask their staff what they think of that technology first.
"Have we actually checked the network latency?"
"I'm sure the problem is HBase.
"Yes, but have we checked?"
"I told you, we don't need to check. The problem is always HBase."
"Whatever. Hmmmmmm ... oh! I think something's wrong with the network ..."
Scaling an application is an arithmetic exercise. If one user consumes X
amount of CPU time on the web server, how many web servers do you need to support 100,000 simultaneous users? If the database is growing at Y per day, and Z% of the data is "active" how long until the active data outgrows RAM?
Despite this common-sense idea, a surprising number of our clients were doing nothing more sophisticated than Nagios alerts on their hardware. This means that when a response time problem or outage occurs, they had no way to diagnose what caused it, and usually ended up fixing the wrong component.
Worse, if you don't have the math for what resources your application is
actually consuming, then you have no idea how many servers, and of what kind, you need in order to scale up your site. That means you will be massively overbuilding some components, while starving others, and spending twice as much money as you need to.
Given how many companies lack metrics, or ignore them, how do they make decisions? Well ...
Barn door decision making
"When I was at Amazon, we used a squid reverse proxy ..."
"Dan, you were an ad sales manager at Amazon."
In the absence of data, staff tend to troubleshoot problems according to their experience, which is usually wrong. Especially when an emergency occurs, there's a tendency to run to fix whatever broke last time. Of course, if they fixed the thing which broke last time, it's unlikely to be the cause of the current outage.
This sort of thinking gets worse when it comes time to plan for growth. I've seen plenty of IT staff purchase equipment, provision servers, configure hardware and software, and lay out networks according to what they did on their last project or even on their previous job. This means that the resources available for the current application are not at all matched to what that application needs, and either you over-provision dramatically or you go down.
Certainly you should learn from your experience. But you should learn appropriate lessons, like "don't depend on VPNs being constantly up". Don't misapply knowledge, like copying the caching strategy from a picture site to an online bank. Learning the wrong lesson is generally heralded by announcements in one or all of the following forms:
- "when I was at name_of_previous_employer ..."
- "when we encountered not_very_similar_problem before, we used random_software_or_technique ..."
- "name_of_very_different_project is using random_software_or_technique, so that's what we should use."
(For non-native English speakers, "barn door" refers to the expression "closing the barn door after the horses have run away")
Now, it's time to actually get into application design.
"So, if I monkey-patch a common class in Rails, when do the changes affect concurrently running processes?"
"Instantly! It's like magic."
The parallel processing frame of mind is a challenge for most developers. Here's a story I've seen a hundred times: a developer writes his code single-threaded, he tests it with a single user and single process on his own laptop, then he deploys it to 200 servers, and the site goes down.
Single-threading is the enemy of scalability. Any portion of your application which blocks concurrent execution of the same code at the same time is going to limit you to the throughput of a single core on a single machine. I'm not just talking here about application code which takes a mutex, although that can be bad too. I'm talking about designs which block the entire application around waiting on one exclusively locked component.
For example, a popular beginning developer mistake is to put every single asynchronous task in a single non-forwarded queue, limiting the pace of the whole application to the rate at which messages can be pulled off that queue. Other popular mistakes are the frequently updated single-row "status" table, explicit locking of common resources, and total ignorance of which actions in one's programming language, framework, or database require exclusive locks on pages in memory.
One application I'm currently working on has a distributed data-processing cloud of 240 servers. However, assignment of chunks of data to servers for processing is done by a single-process daemon running on a single dispatch server, rate limiting the whole cloud to 4000 jobs/minute and 75% idle.
An even worse example was a popular sports web site we worked on. The site would update sports statistics by holding an exclusive lock on transactional database tables while waiting for a remote data service over the internet to respond. The client couldn't understand why adding more application servers to their infrastructure made the timeouts worse instead of better.
Any time you design anything for your application which is supposed to scale, ask yourself "how would this work if 100 users were doing it simultaneously? 1000? 1,000,000?" And learn a functional language or map/reduce. They're good training for parallel thinking.
Coming in part 2
I'm sure you recognized at least one of the anti-patterns above in your own company, as most of the audience at the Ignite talk did. In part two of this article, I will cover component scaling, caching, and SPoFs, as well as the problem with The Cloud.
[ Note about the author: to support his habit of hacking on the
PostgreSQL database, Josh Berkus is CEO of PostgreSQL Experts Inc., a
database and applications consulting company which helps clients make their
PostgreSQL applications more scalable, reliable, and secure. ]
Comments (100 posted)
The release of Puppet 2.7 brought one major change that has nothing to
do with its actual feature list — the license was changed
from the GPLv2 to the Apache License 2.0. This came as no surprise to
the Puppet contributor community, but it seems as if it might be part of a
trend towards more permissive licenses by companies working with open
source. Luke Kanies, the founder and CEO of Puppet Labs (the company that has grown up around Puppet) says that he has no political axe to grind with the decision — it's simply a matter of reducing "friction" when it comes to Puppet adoption.
In conversations about licensing, Kanies shows little passion for the
topic. But, when asking about the actual goals for Puppet, he exhibits a lot more interest about what Puppet (or something like Puppet) needs to accomplish — the ability to manage large-scale networks without needing to know the particulars for each device in the network.
Puppet is an "enterprise systems management platform" which started as a replacement for Cfengine. Kanies, who was a major contributor to Cfengine before starting Puppet, has a fairly modest goal for Puppet — ubiquity. "If we can get ubiquity, we can accomplish what we're trying to do... profitability is easy. What we're trying to do is [make it so] you don't need to know what OS you're running on."
According to Kanies, there was simply no good reason to remain with the
GPL. The license didn't do anything specifically to address his goals for
Puppet, and could actually hinder Puppet's ubiquity. Why? Kanies says that
"a number of companies," and two in particular, were
"quite afraid" of the GPL. One company, he says, avoids even
having Puppet in its infrastructure — to the point of having a
separate approval process for deploying GPL software. The other company
didn't have qualms about the use of GPL software, but did have concerns
about mixing GPL code with other code they ship.
It seems odd in 2011 to hear that companies still have "fears" about the
GPL, given its widespread adoption and endorsement by such a diverse
selection of companies — up to and including a giant like
IBM. However, Kanies says that plenty of companies (or perhaps more
accurately, their lawyers) have concerns about the GPL — and IBM is
perhaps a poor example:
You're right that IBM is comfortable with the GPL,
but there aren't many companies that can sue IBM. It's tough to scare IBM,
and IBM not being afraid is not a good indicator that everyone else should
not be afraid.
So what's the fear? Kanies says that it's the standard argument about
the GPL being untested in the courts, along with the fact that there's
disagreement and a lack of clarity about what "linking" means with regards to
dynamic languages, and whether that linking creates a derivative work. For the record, Kanies points out that he does not share the same fears about the GPL — but he also does not feel particularly strongly about the GPL, and certainly not enough to keep the license if it stands in the way of Puppet adoption.
As a single event, the change of one project's license from GPL to Apache is not particularly important (outside that project, of course). However, if it's part of a larger migration away from the GPL, then it may be worth noting.
Are projects moving away from the GPL? Not in droves, but there does seem to be less of a tendency for companies or projects without a strong philosophical bent to choosing permissive licenses like the Apache, MIT, and BSD licenses. The 451 Group pointed to some evidence last year that companies were favoring more permissive licenses like the LGPL, BSD, Apache, and Eclipse Public Licenses. In January of this year, Stephen O'Grady noted that Black Duck Software's license figures showed a decline for the GPL overall.
The GPL still seems to be the dominant license, however. Black Duck Software tracks the adoption of GPLv3 versus GPLv2. The GPLv2 has dropped to well under 50% (45.42%), with the GPLv3 at nearly 7% of the projects it tracks. The Apache License 2.0 is at nearly 5%, and the MIT license is at just over 8%, as is the Artistic License. According to O'Grady, this is a 4% decline for GPLv2 since August 2009 (with an increase of only 1.34% for GPLv3) and nearly 4% increase for the MIT license.
On Contributor License Agreements
The license change should not come as a surprise to anyone in the Puppet
community, though it has been greeted with some surprise in wider circles. Kanies says he has been asking the community for about two years, and has talked to "all of the major contributors" about the change. Kanies says that none of the contributors have raised a fuss, though he's gotten "one person that's said they're upset, and a couple who seem like they aren't that happy with the change and say 'I'd like to better understand this decision that you've made."
Since late 2009, Puppet has required a Contributor License
Agreement (CLA) in order to submit code to the project. Kanies says
it's similar to the Apache CLA, which basically provides the right to
relicense the software any way the project sees fit.
In the case of Puppet, there seems to be little real cause for
concern. Kanies provided ample time for the larger Puppet community to
comment on the license change, first raising
the issue in April, 2009 (when he thought he might go to the Affero
GPL), and announcing the planned change
to the Apache license five months later. It seems that few in the Puppet
community are upset by the change. Users receiving Puppet under the Apache
license are essentially in the same place they were before — able to
study, modify, use, and distribute Puppet freely. Contributors to Puppet
may not receive the same "protections" that the GPL affords, but it seems
that the contributor community to Puppet is not particularly concerned about this.
The Puppet change should serve as a reminder to other developers that CLAs are in place for a reason. When giving permission to a project or organization to re-license a work, it should be assumed the organization will exercise its rights at some point — perhaps in a way that is unoffensive, perhaps not. Absent a guarantee in the CLA to stick with a certain class of license, it should be at least considered that the program may be re-licensed in a way that is less friendly to its user and contributor community.
Comments (12 posted)
Page editor: Jonathan Corbet
Inside this week's LWN.net Weekly Edition
- Security: Guardian project for Android; New vulnerabilities in exim, kernel, postfix, xen, ...
- Kernel: Ftrace, perf, and the tracing ABI; 2.6.39 development statistics; Stable pages.
- Distributions: Ubuntu developer summit; BackTrack, CyanogenMod, Mint, ...
- Development: Two talks about Krita; GNOME Shell extensions, Passlib, PostgreSQL, ...
- Announcements: Apple not providing webkit source, Matt Zimmerman leaving Canonical, Oracle vs. Google, ...