By Jake Edge
April 3, 2013
While the advertised
singing and dancing never materialized, Larry Hastings's
PyCon 2013 talk did look at a
largely overlooked topic: Python bytecodes. The Python virtual machine (VM)
operates on these bytecodes, and Hastings's talk explored what they are and
do,
how they can be examined, and why folks might be interested. But the
talk also featured some concrete uses of bytecodes, including introducing
the "simplest
possible VM" written in Python, a pure-Python assembler/disassembler, and a Forth
interpreter targeting the Python
VM.
Hastings is a Python core developer who is also serving as the release manager
for Python 3.4. That release is currently targeted for
February 2014.
Introduction
Bytecodes have been used in language implementations going
back to the early days of computing. The idea is to turn the input
language into a simpler form that can be run on some particular VM.
Python bytecodes are an implementation detail specific to CPython (the C
language version of Python, which is what
most people think of as "Python"); the talk looked at the bytecodes used by
CPython 3.3.
The ideas behind bytecode (though not the bytecodes themselves) are roughly
applicable to other versions of CPython or for
implementations like PyPy (a Python
interpreter written in
Python). PyPy has bytecodes that are similar but not the same as those of
CPython. Implementations like
Jython and IronPython have very different
bytecodes because the whole point of those is to target a different VM.
The concepts behind bytecode will be applicable to all of those, though.
Bytecode is the assembly language for the Python VM, which is
"both virtual and a machine", he said. It is virtual in the "sense that it
doesn't really exist", it is not something you could "go over and kick".
It is a machine, in the sense of a computer, because it has registers and a
stack, and bytecode is just the code that operates on that machine.
There are essentially four different kinds of bytecodes in the Python VM.
There is a data stack and a set of bytecodes to add and remove items from
that stack. There are flow-control bytecodes that allow you to manipulate
the instruction pointer to jump forward and backward in the code.
The standard arithmetic operations are another set of bytecodes. The final
type are the "Pythonic" bytecodes, which do things specific to the
language, like creating a tuple or a set.
Any time that Python is running code, it is running bytecode. By
definition, everything that is expressible in Python can be turned into
bytecode, but the reverse is not true—there are bytecode sequences
that are not translatable into valid Python. The reason that Python uses
bytecode is to "manage complexity" in implementing the language. It is a
common way to implement an interpreted language. Ruby has bytecodes, as
does PHP ("although they're crazy ... seriously").
If you are going to be a core contributor, it makes sense to study
bytecode, though there are areas of the core where that knowledge is not
required at all.
Hastings is somewhat skeptical of the other commonly cited
reasons—"so I'm really
wasting your time"—but the presentation made the subject interesting,
even if only as an academic exercise.
For
example, understanding what's "really going on" in the interpreter is one
reason for understanding bytecode, but that's a bit
dubious because there is more to it than that. Understanding the bytecode means
understanding C, which means
understanding assembly language, microcode, transistors, and, eventually, quantum
mechanics. You can get really far with Python programming
without knowing anything about quantum mechanics, he said.
Hand optimizing bytecode for performance is another
commonly cited reason, but there are much better ways to optimize your
program and bytecode is somewhat fragile—bytecode can change between
Python releases.
If you are a pure Python developer, there is one reason to know about
bytecode: it is the granularity at which the global interpreter lock (GIL)
operates. The
GIL can switch to a different thread between bytecodes, but not while
executing a given bytecode operation, as the interpreter atomically
executes individual
bytecodes.
Disassembling "gunk"
Hastings presented a function called "gunk", that "doesn't do anything
useful", but will illustrate some of the attributes of bytecode.
def gunk(a=1, *args, b=3):
print(args)
c = None
return (a + b, c)
It is a Python-3-only function due to the argument list (and
print as a function). It has a positional argument
a, with a
default of 1,
args holds all the rest of the positional arguments, and
b is a keyword argument with a default of 3. The function prints
the
args list, sets local variable
c,
and returns a tuple. Its bytecode can be examined using the
dis (disassembler)
module:
>>> import dis
>>> dis.dis(gunk)
2 0 LOAD_GLOBAL 0 (print)
3 LOAD_FAST 2 (args)
6 CALL_FUNCTION 1 (1 positional, 0 keyword pair)
9 POP_TOP
3 10 LOAD_CONST 0 (None)
13 STORE_FAST 3 (c)
4 16 LOAD_FAST 0 (a)
19 LOAD_FAST 1 (b)
22 BINARY_ADD
23 LOAD_FAST 3 (c)
26 BUILD_TUPLE 2
29 RETURN_VALUE
Bytecode consists of a single-byte operation code (opcode) followed by an
optional two-byte parameter. Thus, bytecodes are either one or three bytes
in length.
The first column in the disassembly shows the line number in the function,
the second is the
address of the opcode (i.e. the sum of the sizes of the previous
bytecodes), third is the opcode itself, fourth is the opcode's parameters, and
last is an interpretation of the parameter.
The output from dis leaves something to be desired, he said. For
example "0" turns into "print" somehow, but dis doesn't explain
it. Nor does it show the defaults of the arguments, or make a distinction
between arguments and local variables. All of that
led Hastings to write his own assembler/disassembler, which he described at
the end of the talk.
There are 101 different opcodes in Python 3.3. The
dis.HAVE_ARGUMENT constant (currently 90) governs whether a given
opcode has an argument or not. If the opcode is less than that value, it
has no argument, and is one byte in size.
Bytecode requires a runtime environment, which is provided by the Python
VM. The VM has three types of data associated with it, each of which can
be manipulated by various opcodes. The first is the instruction pointer,
which can be
changed with JUMP_* opcodes. The stack is another piece of data
the VM keeps track of. Various opcodes like LOAD_* or
STORE_* directly add or remove things from the stack, but there are
also opcodes to do rotations and other manipulations. The final data
storage location for the VM is the "fast locals array", which stores
arguments and local variables; the LOAD_FAST and
STORE_FAST opcodes are used to access items in that array.
The stack is used to store arbitrary Python objects. The LOAD_*
opcodes push something onto the stack, while the STORE_* opcodes
remove things from the top. The
BINARY_ADD opcode takes the top two objects off the stack, "adds"
them (which might not mean addition depending on the types) and loads the
result back onto the stack.
There are six different variable types that are used in the VM and which
can be moved to or from the stack with the load and store opcodes. Globals and
builtins are loaded with LOAD_GLOBAL, while the fast locals array is
accessed with with LOAD_FAST. There is a "slow locals" array,
which was the original mechanism used for arguments and local variables,
but is now used for class and module variables with the LOAD_NAME
opcode. Constants use LOAD_CONST, while object attributes use
LOAD_ATTR. For all except CONST, there is a
corresponding STORE_* opcode.
The last variable type is "cell variables" that use the LOAD_DEREF
opcode. Cell variables are local variables that are referenced by a nested
function, so that function must look for them in an enclosing scope. In the
following, a is a cell variable in foo() (and a "free"
variable in bar()).
def foo():
a = 1
def bar():
print(a)
It is not clear to Hastings why cell variables are handled that way, but
they are.
Rooting around in code objects
Looking deeper inside the gunk() object shows that its type is
types.FunctionType, but that it contains a code object
(gunk.__code__, which is types.CodeType).
One thing to note: when talking about various double-underscore names (like
__code__), it is evidently traditional in the Python community to
use "dunder" to describe
them. So while viewing the video
of Hastings's talk, or others from PyCon, you will hear things like "dunder
code" for __code__.
But if every function object contains a code object, "why have both?",
Hastings asked. Code objects must be able to be handled by the marshal
binary serialization format, which is conceptually similar to pickle. Marshal is used
to create .pyc files. In fact, a .pyc is just three
four-byte integers (magic number, datestamp and size of the .py
file) followed by a marshaled code object. But function objects contain
lots of "extras" that cannot be marshaled (and may not even be
pickle-able).
While all of that is true, Hastings asked Python creator Guido van Rossum
about the reason for having both kinds of objects a few days before the
talk. Van Rossum said that the decision predates marshal and .pyc files,
and was originally done to support nested functions so that the code
object could be shared by multiple instantiations of the enclosing
function. That saves time and memory.
To get to the actual bytecode of gunk(), you need to look at the
co_code attribute:
>>> gunk.__code__.co_code
b't\x00\x00|\x02\x00\x83\x01\x00\x01d\x00\x00}\x03\x00|\x00\x00|\
\x01\x00\x17|\x03\x00f\x02\x00S'
or, slightly more readably:
>>> [x for x in gunk.__code__.co_code]
[116, 0, 0, 124, 2, 0, 131, 1, 0, 1, 100, 0, 0, 125, 3, 0, 124, 0, 0, 124,
1, 0, 23, 124, 3, 0, 102, 2, 0, 83]
He showed the "simplest useful disassembler", which fit on a single
slide. It
would simply turn the opcode byte into a name and print its argument (if
any) as an integer. Running it on itself gives "almost readable" output,
he said,
which is 80% of what
dis.dis() does but in only 15 lines of code.
Digging around in the code object further, Hastings showed other
co_* attributes on gunk.__code__. There are counts of
various kinds of arguments (positional, keyword) to the function and a count
of the fast locals along with an array of their names ('a', 'b', 'args', and
'c', for gunk()). It also contains the arrays referred to in the
parameters to opcodes. For example gunk.__code__.co_names
contains the global names used in the function ('print' and 'None'), while
co_consts has a tuple with the constants (None). Beyond that,
there is information on line numbers, filename, module name, maximum stack
depth, and on and on.
Modules and classes
A function is just a def statement followed by some code (which
gets turned into a code object). A module in Python is just a bunch of
code without the def; it also gets turned into a code object.
There are no arguments to the module and it effectively ends with a
"return None" since all code objects must return.
Classes are a bit more interesting, Hastings said. The "slow" locals
dictionary is passed as the argument to the classname callable object.
That is something that is new with Python 3 and it allows the __prepare__()
method to substitute its own dictionary-like object in place of the
__locals__ dictionary. That is all done in support of metaclass
handling for Python.
Creating a function by hand is easy to do, with only four lines of code,
Hastings said, "but they're four lines that you're not going to like".
There is a lot of magic in that slide that creates an add()
function, but the three slides following show how to do so in a more
readable form. But, even that is not "terribly readable", which is why he
wrote his own assembler.
The holy hand grenade
His assembler project,
named "Maynard"
after the keeper of the "holy hand
grenade" ("the most effective disassembler known to man or beast"),
includes both an assembler and disassembler. The Maynard repository also
contains the sample code
from his talk. Its output
is "far more readable" when applied to
gunk() for example.
His eventual goal with Maynard is to make it "round-trippable", so that you
could disassemble a function, then assemble the result, and get the
function back again. It's close, but not there yet.
Hastings then presented "another terrible idea of mine", which is Perth—a
Forth interpreter on top of the CPython VM. It is "really a toy" that does
not (yet) implement recursion. That will change as he will not be done
until it can implement the Fibonacci number
algorithm. "You don't have a real language until you can run Fibonacci",
he said. The biggest hurdle he has faced is that the Python stack
operations do not allow deep stack manipulations, which leads to some
terribly inefficient workarounds for a stack-based language like Forth.
Finally, for his last trick, Hastings presented a five-slide
"simplest possible VM" that could nevertheless run fib(). It
is a Python VM written completely in Python, which is not "very robust"; it
works for that particular function "and nothing else".
If you play with bytecode, Hastings said, there is "something you should
expect to see a lot":
segmentation fault (core dumped)
The Python VM makes no guarantees that it will be robust in the face of
badly formed bytecode—it is "very easy to crash". For those wanting
more information, the
ceval.c file in the Python source tree is
the canonical location for the bytecodes, but requires knowledge of C.
Hastings also pointed
to Ned Batchelder's
byterun, which is a pure
Python implementation of a bytecode-executing VM.
Interpreted languages provide introspection opportunities that go well
beyond just the data structures and objects in the program. There is
something rather interesting in being able to root around inside the
execution environment of a language like Python. One practical benefit of
learning about Python's bytecodes that Hastings didn't mention is that it
provides a view into the implementation of an interpreted language. That's
certainly of more than simply academic interest.
(
Log in to post comments)