Pickles are for delis

By Jake Edge
April 23, 2014

PyCon 2014

Alex Gaynor likes pickles, but not of the Python variety. He spoke at PyCon 2014 in Montréal, Canada to explain the problems he sees with the Python pickle object serialization mechanism. He demonstrated some of the things that can happen with pickles—long-lived pickles in particular—and pointed out some alternatives.

Pickle introduction

He began by noting that he is a fan of delis, pickles, and, sometimes, software, but that some of those things—software and the Python pickle module—were also among his least favorite things. The idea behind pickle serialization is simple enough: hand the dump() function an object, get back a byte string. That byte string can then be handed to the pickle module's load() function at a later time to recreate the object. Two of the use cases for pickles are to send objects between two Python processes or to store arbitrary Python objects in a database.

The pickle.dumps() (dump to a string) method returns "random nonsense", he said, and demonstrated that with the following code:

    >>> pickle.dumps([1, "a", None])
    "(lp0\nI1\naS'a'\np1\naNa."

By using the pickletools module, which is not well-known, he said, one can peer inside the nonsense:

    >>> pickletools.dis("(lp0\nI1\naS'a'\np1\naNa.")
	0: (    MARK
	1: l        LIST       (MARK at 0)
	2: p    PUT        0
	5: I    INT        1
	8: a    APPEND
	9: S    STRING     'a'
       14: p    PUT        1
       17: a    APPEND
       18: N    NONE
       19: a    APPEND
       20: .    STOP

The pickle format is a simple stack-based language, similar in some ways to the bytecode used by the Python interpreter. The pickle is just a list of instructions to build up the object, followed by a STOP opcode to return what it has built so far.

In principle, dumping data to the pickle format is straightforward: determine the object's type, find the dump function for that type, and call it. Each of the built-in types (like list, int, or string) would have a function that can produce the pickle format for that type.

But, what happens for user-defined objects? Pickle maintains a table of functions for the built-in types, but it can't do that for user-defined classes. It turns out that it uses the __reduce__() member function that returns a function and arguments used to recreate the object. That function and its arguments are put into the pickle, so that the function can be called (with those arguments) at unpickling time. Using the built-in object() type, he showed how that information is stored in the pickle (the output was edited by Gaynor for brevity):

    >>> pickletools.dis(pickle.dumps(object()))
	0: c    GLOBAL     'copy_reg _reconstructor'
       29: c        GLOBAL     '__builtin__ object'
       55: N        NONE
       56: t        TUPLE
       60: R    REDUCE
       64: .    STOP

The _reconstructor() method from the copy_reg module is used to reconstruct its argument, which is the object type from the __builtin__ module. Similarly, for a user-defined class (again, output has been simplified):

    >>> class X(object):
    ...  def __init__(self):
    ...   self.my_cool_attr = 3
    ...
    >>> x = X()
    >>> pickletools.dis(pickle.dumps(x))
	0: c    GLOBAL     'copy_reg _reconstructor'
       29: c        GLOBAL     '__main__ X'
       44: c        GLOBAL     '__builtin__ object'
       67: N        NONE
       68: t        TUPLE
       72: R    REDUCE
       77: d        DICT
       81: S    STRING     'my_cool_attr'
      100: I    INT        3
      103: s    SETITEM
      104: b    BUILD
      105: .    STOP

The pickle refers to the class X, by name, as well as the my_cool_attr attribute. By default, Python pickles all of the entries in x.__dict__, which stores the attributes of the object.

A class can define its own unique pickling behavior by defining the __reduce__() method. If it contains something that cannot be pickled (a file object, for example), some kind of custom pickling solution must be used. __reduce__() needs to return a function and arguments to be called at unpickling time, for example:

    >>> class FunkyPickle(object):
    ...  def __reduce__(self):
    ...   return (str, ('abc',),)
    ...
    >>> pickle.loads(pickle.dumps(FunkyPickle()))
    'abc'

Unpickling is "shockingly simple", Gaynor said. If we look at the first example again (i.e. [1, 'a', None]), the commands in the pickle are pretty straightforward (ignoring some of the extraneous bits). LIST creates an empty list on the stack, INT 1 puts the integer 1 on the stack, and APPEND appends it to the list. The string 'a' and None are handled similarly.

Pickle woes

But, as we've seen, pickles can cause calls to any function available to the program (built-ins, imported modules, or those present in the code). Using that, a crafted pickle can cause all kinds of problems—from information disclosure to a complete compromise of the user account that is unpickling the crafted data. It is not a purely theoretical problem, either, as several applications or frameworks have been compromised because they unpickled user-supplied data. "You cannot safely unpickle data that you do not trust", he said, pointing to a blog post that shows how to exploit unpickling.

But, if the data is trusted, perhaps because we are storing and retrieving it from our own database, are there other problems with pickle? He put up a quote from the E programming language web site (scroll down a ways) that pointed to the problem:

Do you, Programmer, take this Object to be part of the persistent state of your application, to have and to hold, through maintenance and iterations, for past and future versions, as long as the application shall live?

- Erm, can I get back to you on that?

He then related a story that happened "many, many maintenance iterations ago, in a code base far, far away". Someone put a pickle into a database, then no one touched it for eighteen months or so. He needed to migrate the table to a different format in order to optimize the storage of some of the other fields. About halfway through the migration of this 1.6 million row table, he got an obscure exception: "module has no attribute".

As he mentioned earlier, pickle stores the name of the pickled class. What if that class no longer exists in the application? In that case, Python throws an AttributeError exception, because the "'module' object has no attribute 'X'" (where X is the name of the class). In Gaynor's case, he was able to go back into the Git repository, find the class in question, and add it back into the code.

A similar problem can occur if the name of an attribute in the class should change. The name is "hardcoded" in the pickle itself. In both of his examples, someone was doing maintenance on the code, made some seemingly innocuous changes, but didn't realize that there was still a reference to the old names in a stored pickle somewhere. In Gaynor's mind, this is the worst problem with pickles.

Alternatives

But if pickling is not a good way to serialize Python objects, what is the alternative? He said that he advocated writing your own dump() functions for objects that need them. He demonstrated a class that had a single size attribute, along with a JSON representation that was returned from dump():

    def dump(self):
        return json.dumps({
            "version" : 1,
            "size": self.size
        })

The version field is the key to making it all work as maintenance proceeds. If, at some point, size is changed to width and height, the dump() function can be changed to emit "version" : 2. One can then create a load() function that deals with both versions. It can derive the new width and height attributes from size (perhaps using sqrt() if size was the area of a square table as in his example).

Writing your own dump() and load() functions is more testable, simpler, and more auditable, Gaynor said. It can be tested more easily because the serialization doesn't take place inside an opaque framework. The simplicity also comes from the fact that the code is completely under your control; pickle gives you all the tools needed to handle these maintenance problems (using __reduce__() and a few other special methods), but it takes a lot more code to do so. Custom serialization is more auditable because one must write dump() and load() for each class that will be dumped, rather than pickle's approach which simply serializes everything, recursively. If some attribute got pickled improperly, it won't be known until the pickle.load() operation fails somewhere down the road.

His example used JSON, but there are other mechanisms that could be used. JSON has an advantage that it is readable and supported by essentially every language out there. If speed is an issue, though, MessagePack is probably worth a look. It is a binary format and supports lots of languages, though perhaps somewhat fewer than JSON.

He concluded his talk by saying that "pickle is unsafe at any speed" due to the security issues, but, more importantly, the maintenance issues. Pickles are still great at delis, however.

An audience member wondered about using pickles for sessions, which is common in the Python web-framework world. Gaynor acknowledged pickle's attraction, saying that being able to toss any object into the session and get it back later is convenient, but it is also the biggest source of maintenance problems in his experience. The cookies that are used as session keys (or signed cookies that actually contain the pickle data) can pop up at any time, often many months later, after the code has changed. He recommends either only putting simple types that JSON can handle directly into sessions or creating custom dump() and load() functions for things JSON can't handle.

There are ways to make pickle handle code updates cleanly, but they require that lots of code be written to do so. Pickle is "unsafe by default" and it makes more sense to write your own code rather than to try to make pickle safe, he said. One thing that JSON does not handle, but pickle does, is cyclic references. Gaynor believes that YAML does handle cyclic references, though he cautioned that the safe_load() function in the most popular Python implementation must be used rather than the basic load() function (though he didn't elaborate). Cyclic references are one area that makes pickle look nice, he said.

One of the biggest lessons he has learned when looking at serialization is that there is no single serialization mechanism that is good for all use cases. Pickle may be a reasonable choice for multi-processing programs where processes are sending pickles to each other. In that case, the programmer controls both ends of the conversation and classes are not usually going to disappear during the lifetime of those processes. But the clear sense was that, even in that case, Gaynor would look at a solution other than pickle.

The video of the talk is at pyvideo.org (along with many other PyCon videos). The slides are available at Speaker Deck.

Index entries for this article
Conference	PyCon/2014
Python	Pickles

Pickles are for delis

Posted Apr 24, 2014 1:27 UTC (Thu) by ewen (subscriber, #4772) [Link]

FTR, with Python YAML you need to use safe_load() on anything but the most trusted, super trusted, could never have been user supplied, data, because YAML load() can be tricked into executing arbitrary python. Unintended execution of arbitrary python is generally Unfortunate (tm). (At least the YAML documentation warns about this now, although not with quite as much detail or horror as one might wish.)

Ewen (who fondly remembers the days when data was just data)

Pickles are for delis

Posted Apr 24, 2014 7:40 UTC (Thu) by rwmj (subscriber, #5474) [Link] (1 responses)

I think a better idea is to not store binary blobs in your database, but to use the database as it was intended. Any good DBA should apply the cluestick firmly to programmers putting Python pickled objects into database fields.

Pickles are for delis

Posted Apr 24, 2014 11:43 UTC (Thu) by robbe (guest, #16131) [Link]

Maybe the trick is to consider pickles as /code/, not data. Storing code in the DB is hardly ever useful.

Pickles are for delis

Posted Apr 25, 2014 3:20 UTC (Fri) by kweidner (guest, #6483) [Link]

I'd recommend taking a look at protocol buffers which are specifically designed to enable forwards and backwards compatibility and extensibility. Or Cap'n Proto which is a new implementation of the same concept.

Pickles are for delis

Posted Apr 25, 2014 17:24 UTC (Fri) by dashesy (guest, #74652) [Link]

Writing your dump will have same problems that pickle has if no porting code is available, with standard pickle one can still use version, I use __setstate__ and port from older models depending on the version. One should look at the pickled object as a frozen code, sometimes it is expensive to reach an object state (a model with many internal variables set), so it makes sense to pickle it for future use.

Objects Considered Dangerous

Posted Apr 27, 2014 20:52 UTC (Sun) by utoddl (guest, #1232) [Link]

Ever since computer code started sharing the same storage as data, there has been this tension between the two. The comments above by ewen and robbe hint strongly of this. If you read this article with a python-neutral filter, i.e. ignoring the python specifics, it's an excellent indictment of the whole OO charade. Objects -- the bits we pretend are objects -- are a co-mingling of code and data. It should come as no surprise that reusing old stored and unmaintained code would cause problems. The problem is by no means limited to python, though.