User: Password:
|
|
Subscribe / Log in / New account

Changes ahead for Python

Did you know...?

LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.

By Jake Edge
September 12, 2007

With its first alpha just released, Python 3.0 (aka Python 3000 or Py3k) is making progress, though a final release is still a year off. Py3k overhauls the language core, removing inconsistencies and other "warts", without maintaining compatibility with the 2.x version. Various standard Python idioms go by the wayside and it will take some getting used to.

One of the driving forces for Py3k is to handle unicode strings in a uniform way. In the 2.x series, unicode handling has bugs, especially when mixing encoded and unencoded text. The Py3k solution is to separate strings, which contain decoded text, and byte-strings which are binary data into two distinct types, str and bytes. Those types cannot be combined without converting one via the encode() and decode() methods. The drawback to this change is explained in the What's New in Python 3.0 document:

This means that pretty much all code that uses Unicode, encodings or binary data in any way has to change.

This also leads to a distinction that needs to be made when handling files. Files are either binary or text files, with text files requiring an encoding to be specified when they are opened. If the wrong type or encoding is given, I/O to the file may fail.

One very visible change – perhaps the most controversial – is eliminating the print statement, moving it to a function. The change is being made mostly for consistency, as there is no other language statement like print, but it also adds additional features. One can now specify a separator, line ending, and file directly, there is no need for the print >>sys.stderr, "error" syntax, instead that becomes print("error", file=sys.stderr). As the "What's new" document points out:

Initially, you'll be finding yourself typing the old print x a lot in interactive mode. Time to retrain your fingers to type print(x) instead!

Another area that has changed significantly is the dict methods. The keys(), items(), and values() methods no longer return lists, so code that treats them that way will fail. They now return something called a "view" that references the dict directly, producing values as they are needed, much like an iterator. In addition, the has_key() boolean method has been removed, the in operator should be used instead.

There are lots of smaller changes that will catch the unwary. Many of the features removed have been deprecated for some time, but, for programmers who don't follow Python language development closely, they may surprise. The raise statement has different syntax, integer division no longer truncates, instead it returns a float (with // used to get the old behavior), xrange() has been removed, and so on. It adds up to a substantial pile of things to deal with when moving existing code to Python 3.

The migration from 2.x is being assisted by the development of Python 2.6, which is slated for release in April 2008. It will provide a Py3k warnings mode that complains at runtime when a feature is being used in a way that is incompatible. It will also have many of the new features enabled, either as __future__ imports or just added into the language if it doesn't conflict with 2.x syntax. The 2to3 tool is also being developed to translate 2.6 constructs into their 3.0 equivalents. The Python Enhancement Proposal (PEP) governing the Py3k plan (PEP 3000) gives an overview of how code can be maintained to run on both 2.6 and 3.0. It sounds somewhat painful, but incompatible language changes are never easy.

There is still plenty of work to be done, the final release of 3.0 is currently scheduled for August 2008. One of the bigger remaining chunks is a reorganization of the standard library namespace. PEP 3108 lays out the changes to be made, including removing older, unsupported, or rarely used modules, renaming modules to conform to the naming standard, merging the C and Python implementations of modules (i.e. cPickle goes away and is replaced with pickle). It cleans up what had become a bit of a mess over time.

All of these changes have not come about without some objections, both from those who think another incompatible "upgrade" is not warranted to those who think Py3k doesn't go far enough. One area that is not being changed, but is a source of frustration for some, is the "global interpreter lock" (GIL), which only allows one thread at a time to operate on any Python objects or call out to C language extensions. Especially with the advent of multi-core and multi-CPU systems, the lock is very restrictive, serializing most of the core language processing.

Guido van Rossum, Benevolent Dictator for Life (BDFL) of the Python language has been very open about addressing these concerns on his All Things Pythonic weblog. That doesn't mean he plans to change things, especially with regards to the GIL, but he puts together a well reasoned defense, mostly concerning the performance of the language with finer-grained locks. He is clearly not much of a fan of multi-threaded programming with its attendant race conditions, deadlocks, and other issues, but he is not opposed to efforts to remove the GIL either. As he points out, it is not inherent in the Python language, but is an attribute of the current language implementation, other implementations (Jython, IronPython) do not have the GIL.

There are fundamental changes in Python 3, it will be interesting to see how quickly it is adopted after being released. People learning Python won't need to learn Py3k for another two years or so, according to van Rossum, and should, instead, concentrate on 2.x (which means 2.5 until April). The unicode handling rework will probably be enough to get the increasing number of localized programs updated, but the rest of the changes are not terribly compelling. It is likely that there will be Python 2.x programs around for a long time to come.


(Log in to post comments)

Changes ahead for Python

Posted Sep 13, 2007 5:06 UTC (Thu) by bfields (subscriber, #19510) [Link]

Jake Edge is a little over-fond of comma splices, I find it distracting.

(Fine article otherwise, thanks!)

Changes ahead for Python

Posted Sep 13, 2007 9:30 UTC (Thu) by gypsumfantastic (guest, #31134) [Link]

One area that is not being changed, but is a source of frustration for some, is the "global interpreter lock" (GIL), which only allows one thread at a time to operate on any Python objects or call out to C language extensions.

Wait... It's not possible to write multi-threaded code in Py3k?

Changes ahead for Python

Posted Sep 13, 2007 10:06 UTC (Thu) by gouyou (guest, #30290) [Link]

You can write multi-threaded code, it's just get serialized during execution, and run on a single processor, pretty much like Java's green threads. And, as stated in the article, it's only in CPython, Jython and IronPython do not have this limitation.

It's not that bad.

Posted Sep 13, 2007 19:14 UTC (Thu) by dw (subscriber, #12017) [Link]

Only executing bytecode is "serialized": the GIL is generally released when a thread is invoking a system call, or by C extension modules when they know they don't need any interpreter state. Basically, you can have 100,000 threads concurrently blocking on read(); it's only when one of those calls returns that the lock needs taken again.

As repeated many zillions of times throughout the Internet, this is pretty acceptable for almost any application except heavily compute-bound programs, in which case the author would probably be wanting to use a language more suitable for high performance computational work to begin with.

It's not that bad.

Posted Sep 13, 2007 20:23 UTC (Thu) by oak (guest, #2786) [Link]

> this is pretty acceptable for almost any application except heavily
compute-bound programs, in which case the author would probably be wanting
to use a language more suitable for high performance computational work

Or separate processes (and shared memory e.g. via mmap if more data needs
to be communicated)?

It's not that bad.

Posted Sep 15, 2007 20:19 UTC (Sat) by moxfyre (guest, #13847) [Link]

As repeated many zillions of times throughout the Internet, this is pretty acceptable for almost any application except heavily compute-bound programs, in which case the author would probably be wanting to use a language more suitable for high performance computational work to begin with.
Actually, if you're doing numerical work, like decomposing big matrices and vectors and such... Python is just great!!

That's because of the NumPy extension. It is a C extension, and when doing heavy computations it will happily use multiple threads. For example, I can make a huge random matrix and compute its eigenvalues and eigenvectors:

from numpy import *
big = random.random([1000,1000])     # matrix of random values in the [0,1] interval
eigval, eigvec = linalg.eig(big)
So you can write your algorithm in Python, but all the heavy-duty number crunching will happen in multi-threaded code in C.

If your Python program is CPU-bound but not because of number crunching, then I can see the GIL being more vexing. The case of CPU-bound web applications is probably very frustrating, since there's no easy way to factor out the CPU-bound stuff into an easily parallelizable C extension :-( Anybody have any thoughts on that?

Changes ahead for Python

Posted Sep 15, 2007 17:51 UTC (Sat) by hazmat (subscriber, #668) [Link]

cpython threads map to os level threads. however due to the gil, only one thread per process may be executing python code. python c extensions may release the gil before invoking apis, a database adapter will for example release the lock before executing a query, to allow other threads a chance to execute. the internal management of the gil allows fine tuning of how often the threads are switched via sys.checkinterval api.

Changes ahead for Python

Posted Sep 13, 2007 14:20 UTC (Thu) by walters (subscriber, #7396) [Link]

"The Py3k solution is to separate strings, which contain decoded text, and byte-strings which are binary data into two distinct types, str and bytes. "

This is wrong as far as I can see; quoting from the page:

"There is only one string type; its name is str but its behavior and implementation are more like unicode in 2.x.
PEP 358: There is a new type, bytes, to represent binary data"

Which makes sense, because what the heck would a byte-string be? There are strings which contain Unicode, and byte *arrays* (calling them strings just leads to confusion).

Changes ahead for Python

Posted Sep 13, 2007 21:32 UTC (Thu) by xorbe (guest, #3165) [Link]

right, the byte arrays are not strings.

Changes ahead for Python

Posted Sep 14, 2007 18:39 UTC (Fri) by jordanb (guest, #45668) [Link]

If you think of it as "Character String" vs. "Byte String" I don't think there's an issue. A character string is a string of characters, which is to say codes that form valid characters in some predetermined encoding, whereas a 'byte string' is a string of completly arbitrary bytes.

Changes ahead for Python

Posted Sep 20, 2007 17:19 UTC (Thu) by larryr (guest, #4030) [Link]

The Py3k solution is to separate strings [...] and byte-strings [...] str and bytes
This is wrong as far as I can see; quoting from the page:
There is only one string type; its name is str but its behavior and implementation are more like unicode in 2.x. PEP 358: There is a new type, bytes, to represent binary data
Which makes sense, because what the heck would a byte-string be? There are strings which contain Unicode, and byte *arrays* (calling them strings just leads to confusion).

To me, the array in this case comprises a series of 8-bit characters, which to me is a string, and to me a "byte string" is a reasonable and useful thing to call it; it is a string whose (character) elements each have a size of one byte.

Larry


Copyright © 2007, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds