By Jonathan Corbet
May 11, 2011
While one might ordinarily think of the PyPy project as an experiment in
implementing the Python runtime in Python itself, there is really more to
it than that. PyPy is, in a sense, a toolbox for the creation of
just-in-time compilers for dynamic languages; Python is just the start -
but it's an interesting start. It
has been almost exactly one year since LWN first
looked at PyPy and a few weeks since the
1.5
release, so the time seemed right to actually play with this tool a
bit. The results were somewhat eye-opening.
LWN uses a lot of tools written in Python; one of them is the gitdm data
miner which is used to generate kernel development statistics. It is a
simple program which reads the output of "git log" and
generates a big in-memory data structure reflecting the relationships between
developers, their employers, and the patches they are somehow associated
with. There is very little that is done in the kernel, and there is no use
of extension modules written in C. These features make gitdm a
natural first test for PyPy; there is little to trip things up.
The test was to stash the git log output from the 2.6.36 kernel
release through the present - some 31,000 changes - in a file on a local
SSD. The file, while large, should still fit in memory with nothing else
running; I/O effects should, thus, not figure into the results. Gitdm was
run on the file using both the CPython 2.7.1 interpreter and
PyPy 1.5.
When switching to an entirely different runtime for a non-trivial program,
it is natural to expect at least one glitch. In this case, there were
none; gitdm ran without complaint and produced identical output. There was
one significant difference, though: while the CPython runs took an average
of about 63 seconds, the PyPy runs completed in about
21 seconds. In other words, for the cost of changing the "#!" line at
the top of the program, the run time was cut to one third of its previous
value. One might conclude that the effort was justified; plans are to run
gitdm under PyPy from here on out.
To dig just a little deeper, the perf tool was used to generate a
few statistics of the differing runs:
| CPython | PyPy |
| Cycles | 124B |
42B |
| Cache misses | 14M |
45M |
| Syscalls | 55,000 |
28,000 |
As would be expected from the previous result, running with CPython took
about three times as many processor cycles as running with PyPy. On the
other hand, CPython reliably incurred less than 1/3 as many cache misses;
it would be hard to say why. Somehow, the code generated by the PyPy JIT
generates more widely spread-out memory references; that may be related to
garbage collection strategies. CPython uses reference counting, which can
improve cache locality, while PyPy does not.
One other interesting thing to note is that PyPy only
made half as many system calls.
That called for some investigation. Since gitdm is just
reading data and cranking on it, almost every system call it makes is
read(). Sure enough, the CPython runtime was issuing twice as
many read() calls. Understanding why would require digging into
the code; it could be as simple as PyPy using larger buffers in its file
I/O implementation.
Given results like this, one might well wonder why PyPy is not much more
widely used. There may be numerous reasons, including a simple lack of
awareness of PyPy among Python developers and users of their programs. But
the biggest issue may be extension modules. Most non-trivial Python
programs will use one or more modules which have been written in C for
performance reasons, or because it's simply not possible to provide the
required functionality in pure Python. These modules do not just move over
to PyPy the way Python code does. There is a
short list of modules supported by PyPy, but it's insufficient for many
programs.
Fixing this problem would seem to be one of the most urgent tasks for the
PyPy developers if they want to increase their user base. In other ways,
PyPy is ready for prime time; it implements the (Python 2.x) language
faithfully, and it is fast. With better support for extensions,
PyPy could easily become the interpreter of choice for a lot of Python
programs. It is a nice piece of work.
(
Log in to post comments)