LWN uses a lot of tools written in Python; one of them is the gitdm data miner which is used to generate kernel development statistics. It is a simple program which reads the output of "git log" and generates a big in-memory data structure reflecting the relationships between developers, their employers, and the patches they are somehow associated with. There is very little that is done in the kernel, and there is no use of extension modules written in C. These features make gitdm a natural first test for PyPy; there is little to trip things up.
The test was to stash the git log output from the 2.6.36 kernel release through the present - some 31,000 changes - in a file on a local SSD. The file, while large, should still fit in memory with nothing else running; I/O effects should, thus, not figure into the results. Gitdm was run on the file using both the CPython 2.7.1 interpreter and PyPy 1.5.
When switching to an entirely different runtime for a non-trivial program, it is natural to expect at least one glitch. In this case, there were none; gitdm ran without complaint and produced identical output. There was one significant difference, though: while the CPython runs took an average of about 63 seconds, the PyPy runs completed in about 21 seconds. In other words, for the cost of changing the "#!" line at the top of the program, the run time was cut to one third of its previous value. One might conclude that the effort was justified; plans are to run gitdm under PyPy from here on out.
To dig just a little deeper, the perf tool was used to generate a few statistics of the differing runs:
CPython PyPy Cycles 124B 42B Cache misses 14M 45M Syscalls 55,000 28,000
As would be expected from the previous result, running with CPython took about three times as many processor cycles as running with PyPy. On the other hand, CPython reliably incurred less than 1/3 as many cache misses; it would be hard to say why. Somehow, the code generated by the PyPy JIT generates more widely spread-out memory references; that may be related to garbage collection strategies. CPython uses reference counting, which can improve cache locality, while PyPy does not.
One other interesting thing to note is that PyPy only made half as many system calls. That called for some investigation. Since gitdm is just reading data and cranking on it, almost every system call it makes is read(). Sure enough, the CPython runtime was issuing twice as many read() calls. Understanding why would require digging into the code; it could be as simple as PyPy using larger buffers in its file I/O implementation.
Given results like this, one might well wonder why PyPy is not much more widely used. There may be numerous reasons, including a simple lack of awareness of PyPy among Python developers and users of their programs. But the biggest issue may be extension modules. Most non-trivial Python programs will use one or more modules which have been written in C for performance reasons, or because it's simply not possible to provide the required functionality in pure Python. These modules do not just move over to PyPy the way Python code does. There is a short list of modules supported by PyPy, but it's insufficient for many programs.
Fixing this problem would seem to be one of the most urgent tasks for the PyPy developers if they want to increase their user base. In other ways, PyPy is ready for prime time; it implements the (Python 2.x) language faithfully, and it is fast. With better support for extensions, PyPy could easily become the interpreter of choice for a lot of Python programs. It is a nice piece of work.
Let me tell you a secret. I don't fix databases. I fix applications.
Companies hire me to "fix the database" because they think it's the source of their performance and downtime problems. This is very rarely the case. Failure to scale is almost always the result of poor management decisions — often a series of them. In fact, these anti-scaling decisions are so often repeated that they have become anti-patterns.
I did a little talk about these anti-patterns at the last MySQL Conference and Expo. Go watch it and then come on back. Now that you've seen the five-minute version (and hopefully laughed at it), you're ready for some less sarcastic detail which explains how to recognize these anti-patterns and how to avoid them.
"Well ... our CTO is the only one at the weekly CTO's lunch who uses PostgreSQL. The other CTOs have been teasing him about it."
Does this sound like your CTO? It's a real conversation I had. It also describes more technical executives than I care to think about: more concerned with their personal image and career than they are with whether or not the site stays up or the company stays in business. If you start hearing any of the following words in your infrastructure meetings, you know you're in for some serious overtime: "hip", "hot", "cutting-edge", "latest tech", or "cool kids". References to magazine surveys or industry trends articles are also a bad sign.
Scaling an application is all about management of resources and administrative repeatability. This means using technology which your staff is extremely familiar with and which has been tested and proven to be reliable — and is designed to do the thing you want it to do. Hot new features are less important than consistent uptime without constant attention. More importantly, web technology usually makes big news while it's still brand new, which also means poorly documented, unstable, unable to integrate with other components, and full of bugs.
There's also another kind of trendiness to watch out for, it's the one which says, "If Google or Facebook does it, it must be the right choice." First, what's the right choice for them may not be the right choice for you, unless your applications and platform are very similar to theirs.
Second, not everything that Google and Facebook did with their infrastructures are things they would do again if they had to start over. Like everyone else, the top internet companies make bad decisions and get stuck with technology which is painful to use, but even more painful to migrate away from. So if you're going to copy something "the big boys" do, make sure you ask their staff what they think of that technology first.
Scaling an application is an arithmetic exercise. If one user consumes X amount of CPU time on the web server, how many web servers do you need to support 100,000 simultaneous users? If the database is growing at Y per day, and Z% of the data is "active" how long until the active data outgrows RAM?
Despite this common-sense idea, a surprising number of our clients were doing nothing more sophisticated than Nagios alerts on their hardware. This means that when a response time problem or outage occurs, they had no way to diagnose what caused it, and usually ended up fixing the wrong component.
Worse, if you don't have the math for what resources your application is actually consuming, then you have no idea how many servers, and of what kind, you need in order to scale up your site. That means you will be massively overbuilding some components, while starving others, and spending twice as much money as you need to.
Given how many companies lack metrics, or ignore them, how do they make decisions? Well ...
"Dan, you were an ad sales manager at Amazon."
In the absence of data, staff tend to troubleshoot problems according to their experience, which is usually wrong. Especially when an emergency occurs, there's a tendency to run to fix whatever broke last time. Of course, if they fixed the thing which broke last time, it's unlikely to be the cause of the current outage.
This sort of thinking gets worse when it comes time to plan for growth. I've seen plenty of IT staff purchase equipment, provision servers, configure hardware and software, and lay out networks according to what they did on their last project or even on their previous job. This means that the resources available for the current application are not at all matched to what that application needs, and either you over-provision dramatically or you go down.
Certainly you should learn from your experience. But you should learn appropriate lessons, like "don't depend on VPNs being constantly up". Don't misapply knowledge, like copying the caching strategy from a picture site to an online bank. Learning the wrong lesson is generally heralded by announcements in one or all of the following forms:
(For non-native English speakers, "barn door" refers to the expression "closing the barn door after the horses have run away")
Now, it's time to actually get into application design.
"Instantly! It's like magic."
The parallel processing frame of mind is a challenge for most developers. Here's a story I've seen a hundred times: a developer writes his code single-threaded, he tests it with a single user and single process on his own laptop, then he deploys it to 200 servers, and the site goes down.
Single-threading is the enemy of scalability. Any portion of your application which blocks concurrent execution of the same code at the same time is going to limit you to the throughput of a single core on a single machine. I'm not just talking here about application code which takes a mutex, although that can be bad too. I'm talking about designs which block the entire application around waiting on one exclusively locked component.
For example, a popular beginning developer mistake is to put every single asynchronous task in a single non-forwarded queue, limiting the pace of the whole application to the rate at which messages can be pulled off that queue. Other popular mistakes are the frequently updated single-row "status" table, explicit locking of common resources, and total ignorance of which actions in one's programming language, framework, or database require exclusive locks on pages in memory.
One application I'm currently working on has a distributed data-processing cloud of 240 servers. However, assignment of chunks of data to servers for processing is done by a single-process daemon running on a single dispatch server, rate limiting the whole cloud to 4000 jobs/minute and 75% idle.
An even worse example was a popular sports web site we worked on. The site would update sports statistics by holding an exclusive lock on transactional database tables while waiting for a remote data service over the internet to respond. The client couldn't understand why adding more application servers to their infrastructure made the timeouts worse instead of better.
Any time you design anything for your application which is supposed to scale, ask yourself "how would this work if 100 users were doing it simultaneously? 1000? 1,000,000?" And learn a functional language or map/reduce. They're good training for parallel thinking.
I'm sure you recognized at least one of the anti-patterns above in your own company, as most of the audience at the Ignite talk did. In part two of this article, I will cover component scaling, caching, and SPoFs, as well as the problem with The Cloud.
[ Note about the author: to support his habit of hacking on the PostgreSQL database, Josh Berkus is CEO of PostgreSQL Experts Inc., a database and applications consulting company which helps clients make their PostgreSQL applications more scalable, reliable, and secure. ]
The release of Puppet 2.7 brought one major change that has nothing to do with its actual feature list — the license was changed from the GPLv2 to the Apache License 2.0. This came as no surprise to the Puppet contributor community, but it seems as if it might be part of a trend towards more permissive licenses by companies working with open source. Luke Kanies, the founder and CEO of Puppet Labs (the company that has grown up around Puppet) says that he has no political axe to grind with the decision — it's simply a matter of reducing "friction" when it comes to Puppet adoption.
In conversations about licensing, Kanies shows little passion for the topic. But, when asking about the actual goals for Puppet, he exhibits a lot more interest about what Puppet (or something like Puppet) needs to accomplish — the ability to manage large-scale networks without needing to know the particulars for each device in the network.
Puppet is an "enterprise systems management platform" which started as a replacement for Cfengine. Kanies, who was a major contributor to Cfengine before starting Puppet, has a fairly modest goal for Puppet — ubiquity. "If we can get ubiquity, we can accomplish what we're trying to do... profitability is easy. What we're trying to do is [make it so] you don't need to know what OS you're running on."
According to Kanies, there was simply no good reason to remain with the GPL. The license didn't do anything specifically to address his goals for Puppet, and could actually hinder Puppet's ubiquity. Why? Kanies says that "a number of companies," and two in particular, were "quite afraid" of the GPL. One company, he says, avoids even having Puppet in its infrastructure — to the point of having a separate approval process for deploying GPL software. The other company didn't have qualms about the use of GPL software, but did have concerns about mixing GPL code with other code they ship.
It seems odd in 2011 to hear that companies still have "fears" about the GPL, given its widespread adoption and endorsement by such a diverse selection of companies — up to and including a giant like IBM. However, Kanies says that plenty of companies (or perhaps more accurately, their lawyers) have concerns about the GPL — and IBM is perhaps a poor example:
So what's the fear? Kanies says that it's the standard argument about the GPL being untested in the courts, along with the fact that there's disagreement and a lack of clarity about what "linking" means with regards to dynamic languages, and whether that linking creates a derivative work. For the record, Kanies points out that he does not share the same fears about the GPL — but he also does not feel particularly strongly about the GPL, and certainly not enough to keep the license if it stands in the way of Puppet adoption.
As a single event, the change of one project's license from GPL to Apache is not particularly important (outside that project, of course). However, if it's part of a larger migration away from the GPL, then it may be worth noting.
Are projects moving away from the GPL? Not in droves, but there does seem to be less of a tendency for companies or projects without a strong philosophical bent to choosing permissive licenses like the Apache, MIT, and BSD licenses. The 451 Group pointed to some evidence last year that companies were favoring more permissive licenses like the LGPL, BSD, Apache, and Eclipse Public Licenses. In January of this year, Stephen O'Grady noted that Black Duck Software's license figures showed a decline for the GPL overall.
The GPL still seems to be the dominant license, however. Black Duck Software tracks the adoption of GPLv3 versus GPLv2. The GPLv2 has dropped to well under 50% (45.42%), with the GPLv3 at nearly 7% of the projects it tracks. The Apache License 2.0 is at nearly 5%, and the MIT license is at just over 8%, as is the Artistic License. According to O'Grady, this is a 4% decline for GPLv2 since August 2009 (with an increase of only 1.34% for GPLv3) and nearly 4% increase for the MIT license.
The license change should not come as a surprise to anyone in the Puppet community, though it has been greeted with some surprise in wider circles. Kanies says he has been asking the community for about two years, and has talked to "all of the major contributors" about the change. Kanies says that none of the contributors have raised a fuss, though he's gotten "one person that's said they're upset, and a couple who seem like they aren't that happy with the change and say 'I'd like to better understand this decision that you've made."
Since late 2009, Puppet has required a Contributor License Agreement (CLA) in order to submit code to the project. Kanies says it's similar to the Apache CLA, which basically provides the right to relicense the software any way the project sees fit.
In the case of Puppet, there seems to be little real cause for concern. Kanies provided ample time for the larger Puppet community to comment on the license change, first raising the issue in April, 2009 (when he thought he might go to the Affero GPL), and announcing the planned change to the Apache license five months later. It seems that few in the Puppet community are upset by the change. Users receiving Puppet under the Apache license are essentially in the same place they were before — able to study, modify, use, and distribute Puppet freely. Contributors to Puppet may not receive the same "protections" that the GPL affords, but it seems that the contributor community to Puppet is not particularly concerned about this.
The Puppet change should serve as a reminder to other developers that CLAs are in place for a reason. When giving permission to a project or organization to re-license a work, it should be assumed the organization will exercise its rights at some point — perhaps in a way that is unoffensive, perhaps not. Absent a guarantee in the CLA to stick with a certain class of license, it should be at least considered that the program may be re-licensed in a way that is less friendly to its user and contributor community.
Page editor: Jonathan Corbet
Next page: Security>>
Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds