February 9, 2011
This article was contributed by Ian Ward
Python 3.0 was released at the end of 2008, but so far only a relatively small
number of packages have been updated to support the latest release;
the majority of Python software still only supports Python 2. Python 3
introduced changes to Unicode and string handling, module importing,
integer representation and division, print statements, and a number of other
differences. This article will cover some of the changes that cause the
most problems when porting code from Python 2 to Python 3, and will
present some strategies for managing a single code base that supports both
major versions.
The changes that made it into Python 3 were originally part of a plan
called "Python 3000" as sort of a joke about language changes
that could only be done in the distant future. The changes made up a
laundry list of inconsistencies and inconvenient designs in the Python
language that would have been really nice to fix, but had to wait because
fixing them meant breaking all existing Python code. Eventually the weight
of all the changes led the Python developers to
decide to just fix the problems with a real stable release, and accept
the fact that it will take a few years for most packages and users to make
the switch.
So what's the big deal?
The biggest change is to how strings are handled in Python 3. Python 2
has 8-bit strings and Unicode text, whereas Python 3 has Unicode text and
binary data. In Python 2 you can play fast and loose with strings and
Unicode text, using either type for parameters and conversion is automatic
when necessary. That's great until you get some 8-bit data in a string and
some function (anywhere — in your code or deep in some library you're
using) needs Unicode text. Then it all falls apart. Python 2 tries to
decode strings as 7-bit ASCII to get Unicode text leaving the developer, or
worse yet the end user, with one of these:
Traceback (most recent call last):
...
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf4 in position 3: \
ordinal not in range(128)
In Python 3 there are no more automatic conversions, and the default is
Unicode text almost everywhere. While Python 2 treats 'all\xf4' as an
8-bit string with four bytes, Python 3 treats the same literal as Unicode
text with U+00F4 as the fourth character.
Files opened in text mode (the default, including for
sys.stdin, sys.stdout, and sys.stderr) in Python
3 return Unicode text from read() and expect Unicode text to be
passed to write(). Files opened in binary mode operate on binary
data only. This change affects Python users in Linux and other Unix-like
operating systems more than Windows and Mac users — files in Python 2
on Linux that are opened in binary mode are almost indistinguishable from
files opened in text mode, while Windows and Mac users have been used to
Python at least munging their line breaks when in text mode.
This means that much code that used to "work" (where work is
defined for uses with ASCII text only) is now broken. But once that code
is updated to properly account for which inputs and outputs are encoded
text and which are binary, it can then be used comfortably by people whose
native languages or names don't fit in ASCII. That's a pretty nice
result.
Python 3's bytes type for binary data is quite different from
Python 2's 8-bit strings. Python 2.6 and later have defined bytes
to be the same as the str type, which a little strange because the
interface has changed significantly:
>>> bytes([2,3,4]) # Python 2
'[2, 3, 4]'
>>> [x for x in 'abc']
['a', 'b', 'c']
In Python 3 b'' is used for byte literals:
>>> bytes([2,3,4]) # Python 3
b'\x02\x03\x04'
>>> [x for x in b'abc']
[97, 98, 99]
Python 3's byte type can be treated like an unchanging list with
values between 0 and 255. That's convenient for doing bit arithmetic and
other numeric operations common to dealing with binary data, but it's quite
different from the string-of-length-1 Python 2 programmers expect.
Integers have changed as well. There is no distinction between long
integers and normal integers and sys.maxint is gone. Integer
division has changed too. Anyone with a background in Python (or C) will
tell you that:
>>> 1/2
0
>>> 1.0/2
0.5
But no longer. Python 3 returns 0.5 for both expressions.
Fortunately Python 2.2 and later have an operator for floor division
(//). Use it and you can be certain of an integer result.
The last big change I'll point out is to comparisons. In Python 2
comparisons (<, <=, >=, >)
are always defined between all objects. When no explicit ordering is
defined then all the objects of one type will either be arbitrarily
considered greater or less than all the objects of another type. So you
could take a list with a mix of types, sort it, and all the different types
will be grouped together. Most of the time though, you really don't want
to order different types of objects and this feature just hides some nasty
bugs.
Python 3 now raises a TypeError any time you compare objects
with incompatible types, as it should. Note that equality (==,
!=) is still defined for all types.
Module importing has changed. In Python 2 the directory containing the
source file is searched first when importing (called a "relative import"),
then the directories in the system path are tried in order. In Python 3
relative imports must be made explicit:
from . import my_utils
The print statement has become a function in Python 3. This
Python 2 code that prints a string to sys.stderr with a space
instead of a newline at the end:
import sys
print >>sys.stderr, 'something bad happened:',
becomes:
import sys
print('something bad happened:', end=' ', file=sys.stderr)
These are just some of the biggest changes. The complete list
is here.
That list is huge. How do I deal with all that?
Fortunately a large number of the little incompatibilities are taken
care of by the 2to3 tool that ships with Python. 2to3
takes Python 2 source code and performs some automated replacements to
prepare the code to run in Python 3. Print statements become functions,
Unicode text literals drop their "u" prefix, relative imports are
made explicit, and so on.
Unfortunately the rest of the changes need to be made by hand.
It is reasonable to maintain a single code base that works across
Python 2 and Python 3 with the help of 2to3. In the
case of my library
"Urwid" I am targeting Python 2.4
and up, and this is part of the compatibility code I use. When you really
have to write code that takes different paths for Python 2 and Python 3
it's nice to be clear with an "if PYTHON3:" statement:
import sys
PYTHON3 = sys.version_info >= (3, 0)
try: # define bytes for Python 2.4, 2.5
bytes = bytes
except NameError:
bytes = str
if PYTHON3: # for creating byte strings
B = lambda x: x.encode('latin1')
else:
B = lambda x: x
String handling and literal strings are the most common areas that need to be
updated. Some guidelines:
Use Unicode literals (u'') for all literal text in your source. That way
your intention is clear and behaviour will be the same in Python 3 (2to3
will turn these into normal text strings).
Use byte literals (b'') for all literal byte strings or the B() function
above if you are supporting versions of Python earlier than 2.6. B() uses
the fact that the first 256 code points in Unicode map to Latin-1 to create
a binary string from Unicode text.
Use normal strings ('') only in cases where 8-bit strings are expected in
Python 2 but Unicode text is expected in Python 3. These cases include attribute
names, identifiers, docstrings, and __repr__ return values.
Document whether your functions accept bytes or Unicode text and guard against
the wrong type being passed in (eg. assert isinstance(var, unicode)), or convert
to Unicode text immediately if you must accept both types.
Clearly labeling text as text and binary as binary in your source
serves as documentation and may prevent you from writing code that will
fail when run under Python 3.
Handling binary data across Python versions can be done a few ways. If
you replace all individual byte accesses such as data[i] with
data[i:i+1] then you will get a byte-string-of-length-1 in both
Python 2 and Python 3. However, I prefer to follow the Python 3 convention
of treating byte strings as lists of integers with some more compatibility
code:
if PYTHON3: # for operating on bytes
ord2 = lambda x: x
chr2 = lambda x: bytes([x])
else:
ord2 = ord
chr2 = chr
ord2 returns the ordinal value of a byte in Python 2 or Python 3
(where it's a no-op) and chr2 converts back to a byte string.
Depending on how you are processing your binary data, it might be noticeably
faster to operate on the integer ordinal values instead of byte-strings-of-length-1.
Python "doctests" are snippets of test code that appear in
function, class and module documentation text. The test code resembles an
interactive Python session and includes the code run and its output. For
simple functions this sort of testing is often enough, and it's good
documentation. Doctests create a challenge for supporting Python 2 and
Python 3 from the same code base, however.
2to3 can convert doctest code in the same way as the rest of
the source, but it doesn't touch the expected output. Python 2 will put an
"L" at the end of a long integer output and a "u" in
front of Unicode strings that won't be present in Python 3, but
print-ing the value will always work the same. Make sure that
other code run from doctests outputs the same text all the time, and if you
can't you might be able to use the ELLIPSIS flag and ...
in your output to paper over small differences.
There are a number easy changes you need to make as well, including:
Use // everywhere you want floor division (mentioned above).
Derive exception classes from BaseException.
Use k in my_dict instead of my_dict.has_key(k).
Use my_list.sort(key=custom_key_fn) instead of my_list.sort(custom_sort).
Use distribute instead of Setuptools.
There are two additional resources that may be helpful: Porting Python Code
to 3.0 and Writing
Forwards Compatible Python Code.
So if I do all that, what's in it for me?
Python 3 is unarguably a better language than Python 2. Many people new to the
language are starting with Python 3, particularly users of proprietary operating
systems. Many more current Python 2 users are interested in
Python 3 but are held back by the code or a library they are using.
By adding Python 3 support to an application or library you help:
make it available to the new users just starting with Python 3
encourage existing users to adopt it, knowing it won't stop them from switching to
Python 3 later
clean up ambiguous use of text and binary data and find related bugs
And as a little bonus that software can then be listed among the
packages with Python 3 support in the Python Packaging Index, one click
from the front page.
Many popular Python packages haven't yet made the switch, but it's
certainly on everyone's radar. In my case I was lucky. Members of the
community already did most of the hard work porting my library to Python 3,
I only had to update my tests and find ways to make the changes work with
old versions of Python as well.
There is currently a divide in the Python community because of the
significant differences between Python 2 and Python 3. But with some
work, that divide can be bridged. It's worth the effort.
(
Log in to post comments)