PyCon: Peering in on bytecodes

By Jake Edge
April 3, 2013

While the advertised singing and dancing never materialized, Larry Hastings's PyCon 2013 talk did look at a largely overlooked topic: Python bytecodes. The Python virtual machine (VM) operates on these bytecodes, and Hastings's talk explored what they are and do, how they can be examined, and why folks might be interested. But the talk also featured some concrete uses of bytecodes, including introducing the "simplest possible VM" written in Python, a pure-Python assembler/disassembler, and a Forth interpreter targeting the Python VM.

Hastings is a Python core developer who is also serving as the release manager for Python 3.4. That release is currently targeted for February 2014.

Introduction

Bytecodes have been used in language implementations going back to the early days of computing. The idea is to turn the input language into a simpler form that can be run on some particular VM. Python bytecodes are an implementation detail specific to CPython (the C language version of Python, which is what most people think of as "Python"); the talk looked at the bytecodes used by CPython 3.3.

The ideas behind bytecode (though not the bytecodes themselves) are roughly applicable to other versions of CPython or for implementations like PyPy (a Python interpreter written in Python). PyPy has bytecodes that are similar but not the same as those of CPython. Implementations like Jython and IronPython have very different bytecodes because the whole point of those is to target a different VM. The concepts behind bytecode will be applicable to all of those, though.

Bytecode is the assembly language for the Python VM, which is "both virtual and a machine", he said. It is virtual in the "sense that it doesn't really exist", it is not something you could "go over and kick". It is a machine, in the sense of a computer, because it has registers and a stack, and bytecode is just the code that operates on that machine.

There are essentially four different kinds of bytecodes in the Python VM. There is a data stack and a set of bytecodes to add and remove items from that stack. There are flow-control bytecodes that allow you to manipulate the instruction pointer to jump forward and backward in the code. The standard arithmetic operations are another set of bytecodes. The final type are the "Pythonic" bytecodes, which do things specific to the language, like creating a tuple or a set.

Any time that Python is running code, it is running bytecode. By definition, everything that is expressible in Python can be turned into bytecode, but the reverse is not true—there are bytecode sequences that are not translatable into valid Python. The reason that Python uses bytecode is to "manage complexity" in implementing the language. It is a common way to implement an interpreted language. Ruby has bytecodes, as does PHP ("although they're crazy ... seriously").

If you are going to be a core contributor, it makes sense to study bytecode, though there are areas of the core where that knowledge is not required at all. Hastings is somewhat skeptical of the other commonly cited reasons—"so I'm really wasting your time"—but the presentation made the subject interesting, even if only as an academic exercise. For example, understanding what's "really going on" in the interpreter is one reason for understanding bytecode, but that's a bit dubious because there is more to it than that. Understanding the bytecode means understanding C, which means understanding assembly language, microcode, transistors, and, eventually, quantum mechanics. You can get really far with Python programming without knowing anything about quantum mechanics, he said.

Hand optimizing bytecode for performance is another commonly cited reason, but there are much better ways to optimize your program and bytecode is somewhat fragile—bytecode can change between Python releases. If you are a pure Python developer, there is one reason to know about bytecode: it is the granularity at which the global interpreter lock (GIL) operates. The GIL can switch to a different thread between bytecodes, but not while executing a given bytecode operation, as the interpreter atomically executes individual bytecodes.

Disassembling "gunk"

Hastings presented a function called "gunk", that "doesn't do anything useful", but will illustrate some of the attributes of bytecode.

    def gunk(a=1, *args, b=3):
        print(args)
        c = None
        return (a + b, c)

It is a Python-3-only function due to the argument list (and print as a function). It has a positional argument a, with a default of 1, args holds all the rest of the positional arguments, and b is a keyword argument with a default of 3. The function prints the args list, sets local variable c, and returns a tuple. Its bytecode can be examined using the dis (disassembler) module:

    >>> import dis
    >>> dis.dis(gunk)
      2           0 LOAD_GLOBAL              0 (print)
		  3 LOAD_FAST                2 (args)
		  6 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
		  9 POP_TOP

      3          10 LOAD_CONST               0 (None)
		 13 STORE_FAST               3 (c)

      4          16 LOAD_FAST                0 (a)
		 19 LOAD_FAST                1 (b)
		 22 BINARY_ADD
		 23 LOAD_FAST                3 (c)
		 26 BUILD_TUPLE              2
		 29 RETURN_VALUE

Bytecode consists of a single-byte operation code (opcode) followed by an optional two-byte parameter. Thus, bytecodes are either one or three bytes in length. The first column in the disassembly shows the line number in the function, the second is the address of the opcode (i.e. the sum of the sizes of the previous bytecodes), third is the opcode itself, fourth is the opcode's parameters, and last is an interpretation of the parameter.

The output from dis leaves something to be desired, he said. For example "0" turns into "print" somehow, but dis doesn't explain it. Nor does it show the defaults of the arguments, or make a distinction between arguments and local variables. All of that led Hastings to write his own assembler/disassembler, which he described at the end of the talk.

There are 101 different opcodes in Python 3.3. The dis.HAVE_ARGUMENT constant (currently 90) governs whether a given opcode has an argument or not. If the opcode is less than that value, it has no argument, and is one byte in size.

Bytecode requires a runtime environment, which is provided by the Python VM. The VM has three types of data associated with it, each of which can be manipulated by various opcodes. The first is the instruction pointer, which can be changed with JUMP_* opcodes. The stack is another piece of data the VM keeps track of. Various opcodes like LOAD_* or STORE_* directly add or remove things from the stack, but there are also opcodes to do rotations and other manipulations. The final data storage location for the VM is the "fast locals array", which stores arguments and local variables; the LOAD_FAST and STORE_FAST opcodes are used to access items in that array.

The stack is used to store arbitrary Python objects. The LOAD_* opcodes push something onto the stack, while the STORE_* opcodes remove things from the top. The BINARY_ADD opcode takes the top two objects off the stack, "adds" them (which might not mean addition depending on the types) and loads the result back onto the stack.

There are six different variable types that are used in the VM and which can be moved to or from the stack with the load and store opcodes. Globals and builtins are loaded with LOAD_GLOBAL, while the fast locals array is accessed with with LOAD_FAST. There is a "slow locals" array, which was the original mechanism used for arguments and local variables, but is now used for class and module variables with the LOAD_NAME opcode. Constants use LOAD_CONST, while object attributes use LOAD_ATTR. For all except CONST, there is a corresponding STORE_* opcode.

The last variable type is "cell variables" that use the LOAD_DEREF opcode. Cell variables are local variables that are referenced by a nested function, so that function must look for them in an enclosing scope. In the following, a is a cell variable in foo() (and a "free" variable in bar()).

    def foo():
        a = 1
        def bar():
            print(a)

It is not clear to Hastings why cell variables are handled that way, but they are.

Rooting around in code objects

Looking deeper inside the gunk() object shows that its type is types.FunctionType, but that it contains a code object (gunk.__code__, which is types.CodeType). One thing to note: when talking about various double-underscore names (like __code__), it is evidently traditional in the Python community to use "dunder" to describe them. So while viewing the video of Hastings's talk, or others from PyCon, you will hear things like "dunder code" for __code__.

But if every function object contains a code object, "why have both?", Hastings asked. Code objects must be able to be handled by the marshal binary serialization format, which is conceptually similar to pickle. Marshal is used to create .pyc files. In fact, a .pyc is just three four-byte integers (magic number, datestamp and size of the .py file) followed by a marshaled code object. But function objects contain lots of "extras" that cannot be marshaled (and may not even be pickle-able).

While all of that is true, Hastings asked Python creator Guido van Rossum about the reason for having both kinds of objects a few days before the talk. Van Rossum said that the decision predates marshal and .pyc files, and was originally done to support nested functions so that the code object could be shared by multiple instantiations of the enclosing function. That saves time and memory.

To get to the actual bytecode of gunk(), you need to look at the co_code attribute:

    >>> gunk.__code__.co_code
    b't\x00\x00|\x02\x00\x83\x01\x00\x01d\x00\x00}\x03\x00|\x00\x00|\
    \x01\x00\x17|\x03\x00f\x02\x00S'

or, slightly more readably:

    >>> [x for x in gunk.__code__.co_code]
    [116, 0, 0, 124, 2, 0, 131, 1, 0, 1, 100, 0, 0, 125, 3, 0, 124, 0, 0, 124,
    1, 0, 23, 124, 3, 0, 102, 2, 0, 83]

He showed the "simplest useful disassembler", which fit on a single slide. It would simply turn the opcode byte into a name and print its argument (if any) as an integer. Running it on itself gives "almost readable" output, he said, which is 80% of what dis.dis() does but in only 15 lines of code.

Digging around in the code object further, Hastings showed other co_* attributes on gunk.__code__. There are counts of various kinds of arguments (positional, keyword) to the function and a count of the fast locals along with an array of their names ('a', 'b', 'args', and 'c', for gunk()). It also contains the arrays referred to in the parameters to opcodes. For example gunk.__code__.co_names contains the global names used in the function ('print' and 'None'), while co_consts has a tuple with the constants (None). Beyond that, there is information on line numbers, filename, module name, maximum stack depth, and on and on.

Modules and classes

A function is just a def statement followed by some code (which gets turned into a code object). A module in Python is just a bunch of code without the def; it also gets turned into a code object. There are no arguments to the module and it effectively ends with a "return None" since all code objects must return.

Classes are a bit more interesting, Hastings said. The "slow" locals dictionary is passed as the argument to the classname callable object. That is something that is new with Python 3 and it allows the __prepare__() method to substitute its own dictionary-like object in place of the __locals__ dictionary. That is all done in support of metaclass handling for Python.

Creating a function by hand is easy to do, with only four lines of code, Hastings said, "but they're four lines that you're not going to like". There is a lot of magic in that slide that creates an add() function, but the three slides following show how to do so in a more readable form. But, even that is not "terribly readable", which is why he wrote his own assembler.

The holy hand grenade

His assembler project, named "Maynard" after the keeper of the "holy hand grenade" ("the most effective disassembler known to man or beast"), includes both an assembler and disassembler. The Maynard repository also contains the sample code from his talk. Its output is "far more readable" when applied to gunk() for example. His eventual goal with Maynard is to make it "round-trippable", so that you could disassemble a function, then assemble the result, and get the function back again. It's close, but not there yet.

Hastings then presented "another terrible idea of mine", which is Perth—a Forth interpreter on top of the CPython VM. It is "really a toy" that does not (yet) implement recursion. That will change as he will not be done until it can implement the Fibonacci number algorithm. "You don't have a real language until you can run Fibonacci", he said. The biggest hurdle he has faced is that the Python stack operations do not allow deep stack manipulations, which leads to some terribly inefficient workarounds for a stack-based language like Forth.

Finally, for his last trick, Hastings presented a five-slide "simplest possible VM" that could nevertheless run fib(). It is a Python VM written completely in Python, which is not "very robust"; it works for that particular function "and nothing else".

If you play with bytecode, Hastings said, there is "something you should expect to see a lot":

    segmentation fault (core dumped)

The Python VM makes no guarantees that it will be robust in the face of badly formed bytecode—it is "very easy to crash". For those wanting more information, the ceval.c file in the Python source tree is the canonical location for the bytecodes, but requires knowledge of C. Hastings also pointed to Ned Batchelder's byterun, which is a pure Python implementation of a bytecode-executing VM.

Interpreted languages provide introspection opportunities that go well beyond just the data structures and objects in the program. There is something rather interesting in being able to root around inside the execution environment of a language like Python. One practical benefit of learning about Python's bytecodes that Hastings didn't mention is that it provides a view into the implementation of an interpreted language. That's certainly of more than simply academic interest.

Index entries for this article
Conference	PyCon/2013

PyCon: Peering in on bytecodes

Posted Apr 4, 2013 6:52 UTC (Thu) by madhatter (subscriber, #4665) [Link] (1 responses)

"Finally, for his last trick, Hutchings presented"... it's possible that should read "Hastings presented..."; there may also be a similar error in the beginning of the next para.

Fixed

Posted Apr 4, 2013 13:14 UTC (Thu) by corbet (editor, #1) [Link]

Wow, don't know how that slipped through. Apologies for the confusion.

PyCon: Peering in on bytecodes

Posted Apr 4, 2013 12:00 UTC (Thu) by rfrancoise (subscriber, #15508) [Link]

No singing or dancing in this talk, however Larry did sing at PyCon... at the opening of the Friday evening lightning talks session.

PyCon: Peering in on bytecodes

Posted Apr 6, 2013 22:47 UTC (Sat) by Tara_Li (guest, #26706) [Link] (5 responses)

I'm not sure of the value of a bytecode system, vs. going all the way to machine code for the appropriate type of silicon.

PyCon: Peering in on bytecodes

Posted Apr 8, 2013 13:48 UTC (Mon) by intgr (subscriber, #39733) [Link] (4 responses)

That makes sense for statically typed languages like C and Java.

But it's not useful to generate assembly code for dynamic languages -- most of the time you don't know what the data types are, so variable references will result in branches branch depending on the data type. There is some benefit from removing the interpreter dispatch overhead, but it seems not significant enough that languages would want to adopt it. It may indeed explode your cache footprint if all those branches become inlined rather than being in the interpreter loop.

And depending on on what level the compilation is done, you may lose much of the flexibility and introspection that dynamic languages give you.

That's why all fast dynamic language implementations use a JIT rather than static a compiler -- they benefit from knowing what the types are run-time and can fall back to the interpreter in un-optimized cases.

PyCon: Peering in on bytecodes

Posted Apr 11, 2013 16:55 UTC (Thu) by shadow-penguin (guest, #38135) [Link] (3 responses)

> That's why all fast dynamic language implementations use a JIT rather
> than static a compiler -

Ever heard of Common Lisp? You get full machine code compilation
without having to give up any dynamic features. And there _is_ a
substantial speed difference between interpreted (in the REPL [1])
and compiled code. One _can_ declare types for those rare instances
where speed is more important than dynamic type dispatch.

Cheers, RalfD

[1] some CL implementations don't bother with bytecode and compile
expressions entered in the REPL on the fly.

PyCon: Peering in on bytecodes

Posted Apr 11, 2013 17:52 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

So you either have a statically typed subset of CL _or_ you have to use JIT to get decent enough speed.

PyCon: Peering in on bytecodes

Posted Apr 11, 2013 20:09 UTC (Thu) by shadow-penguin (guest, #38135) [Link] (1 responses)

No, what makes you think so? You have compiled dynamic code (with type
dispatch) or compiled optimized dynamic code. And if you (the
programmer) are confident that you don't need the dynamic type
dispatch you can inform the compiler about this - and get much faster
code:

;; Generic version: needs to typecheck arguments and calls a
;; generic function

(defun my-add (a-number another-one)
(+ a-number another-one))

; disassembly for MY-ADD
; MOV RDX, [RBP-8] ; no-arg-parsing entry point
; MOV RDI, [RBP-16]
; LEA R11, [#x200001E0] ; GENERIC-+
; CALL R11
; RSP, RBX
; RSP, RBP
; CLC
; POP RBP
; RET
; BREAK 10 ; error trap
; BYTE #X02
; BYTE #X18 ; INVALID-ARG-COUNT-ERROR
; BYTE #X54 ; RCX

;; Optimized for fixnums (machine integers): fast integer addition

(defun my-fast-add (a-number another-one)
(declare (fixnum a-number another-one)
(optimize (speed 3) (safety 0)))
(the fixnum (+ a-number another-one)))

; disassembly for MY-FAST-ADD
; ADD RDX, RDI ; no-arg-parsing entry point
; MOV RSP, RBP
; CLC
; POP RBP
; RET

HTH RalfD

PyCon: Peering in on bytecodes

Posted Apr 11, 2013 20:11 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

So without JITs you either have fast code or dynamic dispatch.