User: Password:
|
|
Subscribe / Log in / New account

Development

Moving to Python 3

February 9, 2011

This article was contributed by Ian Ward

Python 3.0 was released at the end of 2008, but so far only a relatively small number of packages have been updated to support the latest release; the majority of Python software still only supports Python 2. Python 3 introduced changes to Unicode and string handling, module importing, integer representation and division, print statements, and a number of other differences. This article will cover some of the changes that cause the most problems when porting code from Python 2 to Python 3, and will present some strategies for managing a single code base that supports both major versions.

The changes that made it into Python 3 were originally part of a plan called "Python 3000" as sort of a joke about language changes that could only be done in the distant future. The changes made up a laundry list of inconsistencies and inconvenient designs in the Python language that would have been really nice to fix, but had to wait because fixing them meant breaking all existing Python code. Eventually the weight of all the changes led the Python developers to decide to just fix the problems with a real stable release, and accept the fact that it will take a few years for most packages and users to make the switch.

So what's the big deal?

The biggest change is to how strings are handled in Python 3. Python 2 has 8-bit strings and Unicode text, whereas Python 3 has Unicode text and binary data. In Python 2 you can play fast and loose with strings and Unicode text, using either type for parameters and conversion is automatic when necessary. That's great until you get some 8-bit data in a string and some function (anywhere — in your code or deep in some library you're using) needs Unicode text. Then it all falls apart. Python 2 tries to decode strings as 7-bit ASCII to get Unicode text leaving the developer, or worse yet the end user, with one of these:

    Traceback (most recent call last):
    ...
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xf4 in position 3: \
                        ordinal not in range(128)

In Python 3 there are no more automatic conversions, and the default is Unicode text almost everywhere. While Python 2 treats 'all\xf4' as an 8-bit string with four bytes, Python 3 treats the same literal as Unicode text with U+00F4 as the fourth character.

Files opened in text mode (the default, including for sys.stdin, sys.stdout, and sys.stderr) in Python 3 return Unicode text from read() and expect Unicode text to be passed to write(). Files opened in binary mode operate on binary data only. This change affects Python users in Linux and other Unix-like operating systems more than Windows and Mac users — files in Python 2 on Linux that are opened in binary mode are almost indistinguishable from files opened in text mode, while Windows and Mac users have been used to Python at least munging their line breaks when in text mode.

This means that much code that used to "work" (where work is defined for uses with ASCII text only) is now broken. But once that code is updated to properly account for which inputs and outputs are encoded text and which are binary, it can then be used comfortably by people whose native languages or names don't fit in ASCII. That's a pretty nice result.

Python 3's bytes type for binary data is quite different from Python 2's 8-bit strings. Python 2.6 and later have defined bytes to be the same as the str type, which a little strange because the interface has changed significantly:

    >>> bytes([2,3,4]) # Python 2
    '[2, 3, 4]'
    >>> [x for x in 'abc']
    ['a', 'b', 'c']

In Python 3 b'' is used for byte literals:

    >>> bytes([2,3,4]) # Python 3
    b'\x02\x03\x04'
    >>> [x for x in b'abc']
    [97, 98, 99]

Python 3's byte type can be treated like an unchanging list with values between 0 and 255. That's convenient for doing bit arithmetic and other numeric operations common to dealing with binary data, but it's quite different from the string-of-length-1 Python 2 programmers expect.

Integers have changed as well. There is no distinction between long integers and normal integers and sys.maxint is gone. Integer division has changed too. Anyone with a background in Python (or C) will tell you that:

    >>> 1/2
    0
    >>> 1.0/2
    0.5

But no longer. Python 3 returns 0.5 for both expressions. Fortunately Python 2.2 and later have an operator for floor division (//). Use it and you can be certain of an integer result.

The last big change I'll point out is to comparisons. In Python 2 comparisons (<, <=, >=, >) are always defined between all objects. When no explicit ordering is defined then all the objects of one type will either be arbitrarily considered greater or less than all the objects of another type. So you could take a list with a mix of types, sort it, and all the different types will be grouped together. Most of the time though, you really don't want to order different types of objects and this feature just hides some nasty bugs.

Python 3 now raises a TypeError any time you compare objects with incompatible types, as it should. Note that equality (==, !=) is still defined for all types.

Module importing has changed. In Python 2 the directory containing the source file is searched first when importing (called a "relative import"), then the directories in the system path are tried in order. In Python 3 relative imports must be made explicit:

    from . import my_utils

The print statement has become a function in Python 3. This Python 2 code that prints a string to sys.stderr with a space instead of a newline at the end:

    import sys
    print >>sys.stderr, 'something bad happened:',

becomes:

    import sys
    print('something bad happened:', end=' ', file=sys.stderr)

These are just some of the biggest changes. The complete list is here.

That list is huge. How do I deal with all that?

Fortunately a large number of the little incompatibilities are taken care of by the 2to3 tool that ships with Python. 2to3 takes Python 2 source code and performs some automated replacements to prepare the code to run in Python 3. Print statements become functions, Unicode text literals drop their "u" prefix, relative imports are made explicit, and so on.

Unfortunately the rest of the changes need to be made by hand.

It is reasonable to maintain a single code base that works across Python 2 and Python 3 with the help of 2to3. In the case of my library "Urwid" I am targeting Python 2.4 and up, and this is part of the compatibility code I use. When you really have to write code that takes different paths for Python 2 and Python 3 it's nice to be clear with an "if PYTHON3:" statement:

    import sys
    PYTHON3 = sys.version_info >= (3, 0)

    try: # define bytes for Python 2.4, 2.5
        bytes = bytes
    except NameError:
        bytes = str
    
    if PYTHON3: # for creating byte strings
        B = lambda x: x.encode('latin1')
    else:
        B = lambda x: x

String handling and literal strings are the most common areas that need to be updated. Some guidelines:

  • Use Unicode literals (u'') for all literal text in your source. That way your intention is clear and behaviour will be the same in Python 3 (2to3 will turn these into normal text strings).

  • Use byte literals (b'') for all literal byte strings or the B() function above if you are supporting versions of Python earlier than 2.6. B() uses the fact that the first 256 code points in Unicode map to Latin-1 to create a binary string from Unicode text.

  • Use normal strings ('') only in cases where 8-bit strings are expected in Python 2 but Unicode text is expected in Python 3. These cases include attribute names, identifiers, docstrings, and __repr__ return values.

  • Document whether your functions accept bytes or Unicode text and guard against the wrong type being passed in (eg. assert isinstance(var, unicode)), or convert to Unicode text immediately if you must accept both types.

Clearly labeling text as text and binary as binary in your source serves as documentation and may prevent you from writing code that will fail when run under Python 3.

Handling binary data across Python versions can be done a few ways. If you replace all individual byte accesses such as data[i] with data[i:i+1] then you will get a byte-string-of-length-1 in both Python 2 and Python 3. However, I prefer to follow the Python 3 convention of treating byte strings as lists of integers with some more compatibility code:

    if PYTHON3: # for operating on bytes
        ord2 = lambda x: x
        chr2 = lambda x: bytes([x])
    else:
        ord2 = ord
        chr2 = chr

ord2 returns the ordinal value of a byte in Python 2 or Python 3 (where it's a no-op) and chr2 converts back to a byte string. Depending on how you are processing your binary data, it might be noticeably faster to operate on the integer ordinal values instead of byte-strings-of-length-1.

Python "doctests" are snippets of test code that appear in function, class and module documentation text. The test code resembles an interactive Python session and includes the code run and its output. For simple functions this sort of testing is often enough, and it's good documentation. Doctests create a challenge for supporting Python 2 and Python 3 from the same code base, however.

2to3 can convert doctest code in the same way as the rest of the source, but it doesn't touch the expected output. Python 2 will put an "L" at the end of a long integer output and a "u" in front of Unicode strings that won't be present in Python 3, but print-ing the value will always work the same. Make sure that other code run from doctests outputs the same text all the time, and if you can't you might be able to use the ELLIPSIS flag and ... in your output to paper over small differences.

There are a number easy changes you need to make as well, including:

  • Use // everywhere you want floor division (mentioned above).

  • Derive exception classes from BaseException.

  • Use k in my_dict instead of my_dict.has_key(k).

  • Use my_list.sort(key=custom_key_fn) instead of my_list.sort(custom_sort).

  • Use distribute instead of Setuptools.

There are two additional resources that may be helpful: Porting Python Code to 3.0 and Writing Forwards Compatible Python Code.

So if I do all that, what's in it for me?

Python 3 is unarguably a better language than Python 2. Many people new to the language are starting with Python 3, particularly users of proprietary operating systems. Many more current Python 2 users are interested in Python 3 but are held back by the code or a library they are using.

By adding Python 3 support to an application or library you help:

  • make it available to the new users just starting with Python 3

  • encourage existing users to adopt it, knowing it won't stop them from switching to Python 3 later

  • clean up ambiguous use of text and binary data and find related bugs

And as a little bonus that software can then be listed among the packages with Python 3 support in the Python Packaging Index, one click from the front page.

Many popular Python packages haven't yet made the switch, but it's certainly on everyone's radar. In my case I was lucky. Members of the community already did most of the hard work porting my library to Python 3, I only had to update my tests and find ways to make the changes work with old versions of Python as well.

There is currently a divide in the Python community because of the significant differences between Python 2 and Python 3. But with some work, that divide can be bridged. It's worth the effort.

Comments (86 posted)

Brief items

Quotes of the week

Hence, PyPy 50% faster than C on this carefully crafted example. The reason is obvious - static compiler can't inline across file boundaries. In C, you can somehow circumvent that, however, it wouldn't anyway work with shared libraries. In Python however, even when the whole import system is completely dynamic, the JIT can dynamically find out what can be inlined.
-- Maciej Fijalkowski

PostgreSQL does not have query hints because we are not a for-profit company
-- Josh Berkus

Comments (6 posted)

GNU Octave 3.4.0

Version of 3.4.0 of the GNU Octave "quite similar to Matlab" language interpreter has been released. The list of changes and new features appears to be quite long; see the NEWS file for the details.

Comments (none posted)

OpenSSH 5.8 released

OpenSSH 5.8 is available. This version fixes a vulnerability in legacy certificate signing and some bugs in Portable OpenSSH.

Full Story (comments: 17)

Psycopg 2.4 beta1 released

Psycopg is a popular PostgreSQL adapter for Python. The first 2.4 beta release is out; it adds support for PostgreSQL composite types, but the biggest news is probably that Python 3 is now supported. Now is probably a good time for people with Python 3-compatible programs to test it out.

Full Story (comments: none)

ulatencyd 0.4.0 released

Ulatencyd is "a scriptable daemon which constantly optimises the Linux kernel for best user experience." In particular, it uses a set of rules written in Lua to dynamically put processes into scheduler groups with the idea of increasing desktop interactivity. The 0.4.0 release adds a D-Bus interface, improved task grouping, some GNOME and KDE workarounds, and more.

Full Story (comments: none)

Newsletters and articles

Development newsletters from the last week

Comments (none posted)

Neary: Drawing up a roadmap

Dave Neary looks at the importance of roadmaps. "The end result of a good roadmap process is that your users know where they stand, more or less, at any given time. Your developers know where you want to take the project, and can see opportunities to contribute. Your core team knows what the release criteria for the next release are, and you have agreed together mid-term and long-term goals for the project that express your common vision. As maintainer, you have a powerful tool to explain your decisions and align your community around your ideas. A good roadmap is the fertile soil on which your developer community will grow."

Comments (2 posted)

Page editor: Jonathan Corbet
Next page: Announcements>>


Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds