LWN.net Logo

Proposal: Moratorium on Python language changes

From:  Guido van Rossum <guido-+ZN9ApsXKcEdnm+yROfE0A-AT-public.gmane.org>
To:  Python-Ideas <python-ideas-+ZN9ApsXKcEdnm+yROfE0A-AT-public.gmane.org>
Subject:  Proposal: Moratorium on Python language changes
Date:  Wed, 21 Oct 2009 09:42:01 -0700
Archive-link:  Article, Thread

I propose a moratorium on language changes. This would be a period of
several years during which no changes to Python's grammar or language
semantics will be accepted. The reason is that frequent changes to the
language cause pain for implementors of alternate implementations
(Jython, IronPython, PyPy, and others probably already in the wings)
at little or no benefit to the average user (who won't see the changes
for years to come and might not be in a position to upgrade to the
latest version for years after).

The main goal of the Python development community at this point should
be to get widespread acceptance of Python 3000. There is tons of work
to be done before we can be comfortable about Python 3.x, mostly in
creating solid ports of those 3rd party libraries that must be ported
to Py3k before other libraries and applications can be ported. (Other
work related to Py3k acceptance might be tools to help porting, tools
to help maintaining multiple versions of a codebase, documentation
about porting to Python 3, and so on. Also, work like that going on in
the distutils-sig is very relevant.)

Note, the moratorium would only cover the language itself plus
built-in functions, not the standard library. Development in the
standard library is valuable and much less likely to be a stumbling
block for alternate language implementations. I also want to exclude
details of the CPython implementation, including the C API from being
completely frozen -- for example, if someone came up with (otherwise
acceptable) changes to get rid of the GIL I wouldn't object.

But the moratorium would clearly apply to proposals for anonymous
blocks, "yield from" (PEP 380), changes to decorator syntax, and the
like. (I'm sure it won't stop *discussion* of those proposals, and
that's not the purpose of the moratorium; but at least it will stop
worries elsewhere that such proposals might actually be *accepted* any
time soon.)

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)


(Log in to post comments)

Proposal: Moratorium on Python language changes

Posted Oct 21, 2009 22:12 UTC (Wed) by sergey (guest, #31763) [Link]

The language includes its grammar/semantics, execution engine, and run-time library. Only
the first of the three is what really motivates Jython, IronPython, and others. They build their
own execution engine on top of some host (JVM, .Net CLR) and try to implement most of
the run-time library by mapping to what's available through the host (JRE, .Net BCL). It
seems that Guido's suggestion will backfire and make Python less interesting for these same
people he's claiming need help. Those who use "the Python," yours truly included, will also
be disadvantaged because new features generally make our lifes a little easier. I liked "with"
for example and eagerly jumped through hoops to use it until general availability without
annoying references to the __future__. New features don't harm current users, benefit them
all in the long term, and bring new users into community.

I see no upside even for Python hackers. They enjoy adding features I am sure, otherwise
there would be no moratorium needed. After a few years (!) of bland life without that aspect,
they will most likely leave to do something interesting, and nobody would be left after the
moratorium is lifted.

I guess I don't get it...

Proposal: Moratorium on Python language changes

Posted Oct 21, 2009 22:29 UTC (Wed) by zuki (subscriber, #41808) [Link]

I don't either. Occasional change to the grammar is not THE problem why
Jython and others lag behind. And the occasional changes are the reason
why Python is so pleasant to work with. Removing the little annoyances,
adding language support for features recognized as important, and so on.

Python 3 support is of course important, but freezing language features
doesn't seem to be the way.

Proposal: Moratorium on Python language changes

Posted Oct 21, 2009 22:29 UTC (Wed) by hazmat (subscriber, #668) [Link]

I've used python for 12 years.. with a lot of different organizations.. fact of the matter afaics, is that
most organizations still just use the feature set from python 2.3/2.4 core, alot of the new features
are infrequently used. decorators (func and class) saw large uptake, but i only see a handful of
references to context managers (with). i can't say i've seen a single conditional expression. so i take
think guido's moratorium has good value.. given that the users adopting new features seem to be a
small percentage of the overall user base.

anyways.. i'm just as happy to see work on additional language implementations, as well speed
improvements ( unladen-swallow for example ). if a moratorium can help bring that forward,
sounds good to me.

Proposal: Moratorium on Python language changes

Posted Oct 22, 2009 16:13 UTC (Thu) by nevyn (subscriber, #33129) [Link]

fact of the matter afaics, is that most organizations still just use the feature set from python 2.3/2.4 core

2.4 is what is in RHEL-5, so unless you want to do a lot of work you are stuck with that for most production work.

alot of the new features are infrequently used. decorators (func and class) saw large uptake

They work fine in 2.4, and are pretty useful.

but i only see a handful of references to context managers (with)

2.4 didn't have with, and even 2.5 has to have it specially enabled. And you have to wait for 2.6 before you can do multiple expressions per. 'with'. I'm also not convinced that everyone will want to make all their objects context managers.

i can't say i've seen a single conditional expression.

Even if that was in 2.3, I'd never use it ... has to be the most ugly implementation of that feature.

From 2.6.x the new string.format crack seems complete insanity, preferring one DSL over another for no sane reason that I can see. My guess is that java/.net programmers will use it, and C/old-python programers will ignore it ... but maybe I'm wrong.

On the other side I'm shocked it's taken this long for "0b" to get into the language, and I'm still amazed/disappointed that they haven't added perl's _ number seperator (Eg. 1_000_000 is a valid number in perl). Having ** work on any mapping in 2.6.x will certainly stop some of my cursing, dito. keyword args. after *args (although I hit that much less). Adding more tweaks like these would be a much bigger benefit than loss, IMO.

But in general given that it looks like RHEL-6 is going to be based on a 2.6.x variant, not changing the language too much from that for a few years seems like a good idea.

Proposal: Moratorium on Python language changes

Posted Oct 22, 2009 16:49 UTC (Thu) by sergey (guest, #31763) [Link]

I use with and conditional expressions all the time and write context managers very often. It makes my life easier that open/close semantics is part of the language. I don't even have to document how to grab a resource and give it back when done -- it's obvious. Both features in my opinion are easy to read, elegant, and extremely useful. But difference of opinion on each individual feature is not the topic of this discussion I believe. Compare this, I don't know, with NUMA support in Linux: I've never had access to a machine that would require it, but I know it's being used, have nothing against it, and benefits from it trickle down even to a relatively simple kernel build on my dual-core laptop.

All these arguments above are actually in favor of keeping the language fluid: there doesn't seem to be a single case where new features harm anybody.

> But in general given that it looks like RHEL-6 is going to be based on a 2.6.x variant, not changing the language too much from that for a few years seems like a good idea.

I think RHEL or any other distribution's choice is much less relevant than it seems. I used to work for a large financial services organization, the department in charge of that stuff used RHEL 4 at the time but built their own Python packages, along with Apache, Perl, proprietary products like RogueWave libraries, and lots of other stuff. If people need these features, they'll find a way to get them "for most production work." A stronger argument would be that those on shared hosts are stuck, but I use Webfaction for example and they have all versions of Python available (on RHEL boxes btw), so even this is not generally true.

Proposal: Moratorium on Python language changes

Posted Oct 22, 2009 19:04 UTC (Thu) by nevyn (subscriber, #33129) [Link]

I use with and conditional expressions all the time and write context managers very often. It makes my life easier that open/close semantics is part of the language. I don't even have to document how to grab a resource and give it back when done -- it's obvious.

I agree conditional expressions are mostly opinion but for 'with' ... can you give some examples? I guess my biggest confusion is why you need it over just using __del__ and reference counting? (what I do with files). It also seems weird because now you can't easily change code to move things out of blocks, because the end of the block explicitly closes one or more resources (generally what I need to do with dbconnections).

All these arguments above are actually in favor of keeping the language fluid: there doesn't seem to be a single case where new features harm anybody.

I agree, I think with the move to py3k breaking so much old code and generally causing lots of pain ... people have probably voiced that they want more stability, but I don't think it's the new features which were the problem (Eg. I moved to 2.6.x almost immediately, I'm still dreading moving to py3k).

I think RHEL or any other distribution's choice is much less relevant than it seems.

I guess it depends on who you are, and it's possible be that a "lot" of large companies will move to internal versions of py3k before RHEL-6 is dead ... but my code gets deployed on other people's RHEL-x so I have no choice, and I know of a lot of people who own their production servers who would have to have a very big reason to maintain their own version of a big critical like python.

Proposal: Moratorium on Python language changes

Posted Oct 22, 2009 20:46 UTC (Thu) by tan2 (subscriber, #42953) [Link]

I agree conditional expressions are mostly opinion but for 'with' ... can you give some examples? I guess my biggest confusion is why you need it over just using __del__ and reference counting? (what I do with files).

Because it's hard to control who holds the reference. See this post on comp.lang.python for an example.

Proposal: Moratorium on Python language changes

Posted Oct 23, 2009 15:04 UTC (Fri) by sergey (guest, #31763) [Link]

> I agree conditional expressions are mostly opinion but for 'with' ... can you give some examples?

You can can get in and out of "with" blocks repeatedly with the same context manager instance. Locking and all kinds of transactions are a good example, I'd be surprised if those are not mentioned in a PEP for "with." As for less conventional stuff... I used it once to control output indentation level in a console tool. A primitive bug tracker web application wrapper I wrote for work must use SOAP so suds (https://fedorahosted.org/suds/) is my friend. I'd rather not keep recreating all the SOAP metadata, so I keep the same suds SOAP "factory" inside the wrapper class and use context manager pattern to log into/out of the bug tracker.

Proposal: Moratorium on Python language changes

Posted Oct 25, 2009 22:13 UTC (Sun) by SEJeff (subscriber, #51588) [Link]

If you're feeling froggy, you can take a F9/F10 python and add this to the
patchlist at the end.

http://www.digitalprognosis.com/opensource/patches/python...
backport-to-db4-4.3.patch

With a little mundging of %_prefix in the spec file, you can have python2.4
and python2.5
installed side by side on production RHEL5 boxen.

Proposal: Moratorium on Python language changes

Posted Oct 25, 2009 22:34 UTC (Sun) by SEJeff (subscriber, #51588) [Link]

http://tinyurl.com/yjfem6n This might work better as the LWN comment system
ate the URL.

Proposal: Moratorium on Python language changes

Posted Oct 26, 2009 4:20 UTC (Mon) by nevyn (subscriber, #33129) [Link]

There's an open BZ that the python maintainer is looking at, which is trying to put python3.x in Fedora along side python2.6 ... but I don't recommend that, for the same reasons I don't recommend 2.4 + 2.5 in RHEL-5. Doing it via. rpms is likely to cause you pain, I've written up the major bits of pain you'll hit:

http://illiterat.livejournal.com/7660.html

Proposal: Moratorium on Python language changes

Posted Oct 26, 2009 5:50 UTC (Mon) by SEJeff (subscriber, #51588) [Link]

Great blog post btw but I've had this in production (python2.4 and
python2.5) for almost a full year with no problems. We just have our
applications explicitly request python2.4 or python2.5 in the shebang. We
also did something similar for another application using the alternatives
system with %ifarch magic and %preun/%post scripts in the spec. It is
doable. Even managing /usr/bin/python using alternatives is *doable* if you
decide it is the right thing for your use case.

However for general purpose distro with finite resources I agree with you
completely. With production code that is tested in a limited environment
(read %packages --nobase in the ks) you should use whatever makes the most
sense. For my stuff, python2.4 lacks needed features. Each environment is
unique and neither is wrong.

Redhat not making python > 2.4 available in RHEL has caused some
environments to move away from RHEL/CentOS. It is one of the disturbing
reasons you'll see some of the ruby shops or newer django ponies websites
be powered by Ubuntu, Debian, or even Fedora.

Needed: Gentle transition path

Posted Oct 21, 2009 22:33 UTC (Wed) by dwheeler (guest, #1216) [Link]

I agree, a freeze is neither necessary, nor solves the real problem. The real problem is that there there's no gentle transition path for applications that use libraries. If I write a Python app that uses 20 libraries, which transitively include 100 more, then all 120 libraries have to switch to Python 3 before *I* can use Python 3. Nuts. That's a "flag day" on an unworkable scale.

What we need is a Python implementation that can take BOTH the old AND new syntax/semantics, and allow each library to say if they're using the old or new one. Then, each library can switch on different timelines.

Needed: Gentle transition path

Posted Oct 22, 2009 11:04 UTC (Thu) by epa (subscriber, #39769) [Link]

Isn't the idea that you port your code to Python 2.6, which will give you warnings when you use Python-3-unsafe constructs? Once it runs cleanly under 2.6 you should be able to switch over to 3 when you want. (I'm not a Python hacker so please correct me if I got the wrong impression.)

Needed: Gentle transition path

Posted Oct 22, 2009 14:20 UTC (Thu) by foom (subscriber, #14868) [Link]

That's pretty much the idea. But there's a couple problems:
  • Once you're on 2.6, without warnings, you need to run your code through the "2to3" converter.. It's a great idea, but it's not *perfect*. There's still a lot of manual work. Especially because mixed in with all the syntax and other removals/deprecations, the fundamental string model changed.

    As the Python 3 "What's New" document says: "Everything you thought you knew about binary data and Unicode has changed." So, basically every usage of python2 "str" (8bit) strings needs to be checked by a human to see if it was intended to be a raw bytestring, or a textual string.

  • Many projects want to stay compatible with some version of Python < 2.6 for a year or so more, since 2.6 just came out recently, and most users don't have it installed yet. E.g. RHEL 5 comes with python 2.4, and the last Debian release comes with 2.5.
  • Python 2.6 doesn't do it (yet)

    Posted Oct 22, 2009 16:50 UTC (Thu) by dwheeler (guest, #1216) [Link]

    Yes, in theory, you can use Python 2.6 and "import from __future" to get a Python 3.0-like language, and use a Python subset so that 2to3 can catch the rest. But it isn't easy. And that's the advantage of Python: Usually, stuff is easy.

    Proposal: Moratorium on Python language changes

    Posted Oct 21, 2009 22:41 UTC (Wed) by mhelsley (subscriber, #11324) [Link]

    How does the "grammar/semantics" motivate Jython/IronPython/etc? I thought they were mainly looking to capitalize on the performance work going into highly-tuned VMs. The language differences seemed a slightly-unfortunate side-effect of their implementation choices rather than a motivating factor.

    How does constantly changing the language make things easier for these folks? If anything I would think they'd be happy that the goal posts won't move for a while.

    How many language features of C have been deprecated over how many years? How does python compare?

    Proposal: Moratorium on Python language changes

    Posted Oct 21, 2009 23:39 UTC (Wed) by cortana (subscriber, #24596) [Link]

    It's not like C had many features to begin with. Did C99 deprecate anything that was in C89?

    I'm reluctantly coming to the conclusion that Python 3 should have remained 100% compatible with Python 2 code. Today, I write python3 scripts for my own stuff, but any serious work involves third-party libraries, none of which work with python 3... oh well! :(

    Proposal: Moratorium on Python language changes

    Posted Oct 22, 2009 0:40 UTC (Thu) by foom (subscriber, #14868) [Link]

    Python had a perfectly reasonable deprecation policy. It would have been much better IMO to keep using it, and slowly deprecating, replacing, and eventually deleting undesirable features, over a period of time, as has been the previous policy.

    But, that's not what happened. Instead, a ton of new things were added, old things were deleted, and significant parts of the language incomapatibly changed, all at once. So, the result is certainly different. But is it better? Perhaps. There was very little time to review all the changes that happened, and I'm afraid there may be just as many new misfeatures as there were old misfeatures that were removed.

    And, as things stand now, not even all of the standard library in Py3k works properly yet -- nevermind 3rd party software! Hopefully by 3.2, at least the stdlib will be finished being converted.

    Proposal: Moratorium on Python language changes

    Posted Oct 22, 2009 17:44 UTC (Thu) by drag (subscriber, #31333) [Link]

    I think that it reached a point were it was cleaner and easier to do a ABI
    break in one version then to try the gradual upgrade.

    I think that it has a lot to do with getting rid of the heavily overloaded
    use of strings and stuff like that. One of the irritating things with python
    I run into is that strings are used to represent everything, yet they can't
    really cleanly support things like UTF-8 and makes handling binary data
    irritating.

    Proposal: Moratorium on Python language changes

    Posted Oct 22, 2009 17:58 UTC (Thu) by dlang (✭ supporter ✭, #313) [Link]

    it's always easier for the developers of a tool/language to do a clean break rather than a gradual upgrade.

    but it's always easier for the users of that tool or language to a gradual upgrade.

    Proposal: Moratorium on Python language changes

    Posted Oct 22, 2009 20:54 UTC (Thu) by drag (subscriber, #31333) [Link]

    Well the problem is there may NOT be a way to upgrade. For all intents and
    purposes Python3 is a new language, not merely a upgrade of a old one.

    Like I mentioned before strings types in Python 2.x are heavily overloaded.
    They are used for everything and holding all sorts of information and are
    fundamental part of a lot of modules.

    But with 3.0 they got rid of strings completely. Well.. they are still
    called strings, but they are very different. Now all strings are encoded in
    Unicode and they introduced a new datatype "byte".

    For what I do this is a _massive_ improvement. In Python 2.x every time you
    touch any sort of data it must be translated to text first.

    So you end up trying to do lots of binary operations on text-encoded binary
    strings. Or you are forced to do things like read text-encoded binary
    strings from a file or datastream, convert it to byte arrays (thin wrapper
    around C arrays), perform your data manipulation, convert it back to text-
    encoded binaries and then output that to a file or data stream which
    triggers yet another string to binary conversion.

    Also operating in C data types like arrays are often slower then other more
    pythonic data types because of the overhead that occurs when you access the
    datatype and it processes the variables into something that python can
    garbage collect and that sort of thing.

    And this spreads itself out to other things like dealing with Unicode text
    which can't be handled by regular strings... instead it must be manipulated
    like a binary data; a text-encoded binary format. Of course lots of methods
    and such take care of most of the pain of dealing with that, but it's still
    a lot of overhead and a PITA.

    So while the syntax is very similar I don't think that there is any nice
    way to co-mingle Python 2.x with Python 3.x. If you try you'd just end up
    with hell for everybody.

    Proposal: Moratorium on Python language changes

    Posted Oct 22, 2009 16:01 UTC (Thu) by tjc (subscriber, #137) [Link]

    Did C99 deprecate anything that was in C89?
    Implicit int and adjacent string concatenation where declared obsolescent at some point.

    Proposal: Moratorium on Python language changes

    Posted Oct 22, 2009 16:39 UTC (Thu) by nye (guest, #51576) [Link]

    Are you sure about the latter? I'm fairly certain string literal concatenation is still considered kosher (I've certainly seen it used without generating warnings, let alone errors).

    Proposal: Moratorium on Python language changes

    Posted Nov 1, 2009 23:05 UTC (Sun) by vonbrand (subscriber, #4458) [Link]

    String concatenation in C is recent, it wasn't in K&R. And I doubt it will go away (it is way too useful, the previous way using '\' at the end of the line was bletcherous)

    Proposal: Moratorium on Python language changes

    Posted Nov 2, 2009 0:25 UTC (Mon) by nix (subscriber, #2304) [Link]

    Adjacent string concatenation via comments, e.g.

    #define PASTE(a,b) a/**/b

    PASTE("foo","bar")

    was thankfully broken when cpp was defined in terms of tokens rather than
    in terms of text-stream transformation. But that was C89...

    Proposal: Moratorium on Python language changes

    Posted Oct 23, 2009 14:16 UTC (Fri) by wingo (subscriber, #26929) [Link]

    > Did C99 deprecate anything that was in C89?

    Pointer aliasing.

    Proposal: Moratorium on Python language changes

    Posted Oct 22, 2009 3:44 UTC (Thu) by sergey (guest, #31763) [Link]

    > I thought they were mainly looking to capitalize on the performance work going into highly-tuned VMs.

    Right now that doesn't seem to be the case: both Jython and IronPython appear to be slower (I'm certain about Jython, IronPython data is harder to find). I think the motivation is to access an underlying run-time library and some VM features, like memory management and threading support, using a modern and more fully-featured language. Java and C# are catching up and growing the desired features, C# 4.0 is a good example, but they're seriously behind.

    > How does constantly changing the language make things easier for these folks?

    They focus on a specific version (e.g. IronPython 2.6 aims at parity with Python 2.6), and /that/ target, once set, is hardly moving. Moreover, existing features are not really changing that much (Python has good backward compatibility discipline within major versions), but new features are being added, which is a different thing.

    > How many language features of C have been deprecated over how many years? How does python compare?

    I didn't use Python 1.x so I can't tell from personal experience, but I know projects that baselined their code on 1.5 and it worked well into 2.2 and later.

    Proposal: Moratorium on Python language changes

    Posted Oct 23, 2009 12:26 UTC (Fri) by tseaver (subscriber, #1544) [Link]

    Nearly everything I learned on Python 1.4 still works in Python 2.6, with the exception of raising strings as exceptions.

    I don't recognize any of the Jython / IronPython folks as posting here, but they seem to be quite pleased by the idea of the moratorium: in fact, I see no significant opposition, with the exception of a plea or two for exceptions for already-implemented features.

    Proposal: Moratorium on Python language changes

    Posted Oct 22, 2009 0:33 UTC (Thu) by AdHoc (subscriber, #1115) [Link]

    The language includes its grammar/semantics, execution engine, and run-time library.
    The proposal only applies to the first part. Guido specifically mentions the run-time library and the execution engine as open to changes.

    Proposal: Moratorium on Python language changes

    Posted Oct 21, 2009 22:44 UTC (Wed) by atai (subscriber, #10977) [Link]

    Perl does need the time to get its next version ready for the competiton.

    Counter-Proposal to Proposal: Moratorium on Python language changes

    Posted Oct 22, 2009 6:49 UTC (Thu) by cmot (subscriber, #53097) [Link]

    (I'm in no way involved with Python language hackers, so YMMV):

    ONE: Never again introduce a "Python 3" thingy. A possible way for langauge changes in the future:

    • introduce the new syntax, allow both in parallel (no matter how ugly the parser gets...)
    • after 2 years: introduce a warning if the old syntax is used.
    • after another 2 years: require some kind of option ("#pragma" style? It is allowed to be ugly so people want to get rid of it...) that needs to be set for the deprecated feature to be used.
    • then drop the old feature after another 2 years.

    Yes, Py 3 would have taken 6 years to be fully introduced. But most of the new stuff could have been used immediately by new apps even if libraries are not migrated yet. Changes where existing syntax changes behaviour significantly (do we really want them?) will obviously require a flag day, but a similar schema can be used; like it was done for some things with the __future__ stuff.

    TWO: Run a Python 3 sprint now. Many projects run "sprints" or "hackfests" or whatever they call it. Why not run a python-world-wide Python 3 sprint. End goal: all important Python frameworks are migrated to Py3. Chose a group of heavily involved Pythonists (those that are not only working on one framework but see the larger picture), have them analyze library dependencies and announce a proposed roadmap. Then get a "one library per week" schedule going until all of the widely used frameworks work with py3. That way, it's clear for all projects that the critical mass of py3 code will soon be out there.

    Yes, projects will still have to support py2 and py3 versions in parallel, but this can't be avoided. But at least the "we'll not port to py3 until other move to py3, too" deadlock is solved.

    Counter-Proposal to Proposal: Moratorium on Python language changes

    Posted Nov 1, 2009 9:33 UTC (Sun) by bockman (guest, #3650) [Link]

    IMO, the big problem with python 3 is not that they changed python language, as much as they changed the C API. I read that this was out of necessity, because some of the planned changes could not have done without breaking it. But doing so, they deprived python - hopefully temporarly - of the huge set of external modules that make python fly. Until they catch up - if they do - python3 is just a "nice language, but you can't do much with it".

    Counter-Proposal to Proposal: Moratorium on Python language changes

    Posted Nov 1, 2009 17:15 UTC (Sun) by foom (subscriber, #14868) [Link]

    Nah, the changes to the C API were minimal. It seems to me actually a lot easier to support
    the slightly different C API than the new Python language/API. But, most interesting python
    modules have a large part of them written in Python...

    Proposal: Moratorium on Python language changes

    Posted Oct 22, 2009 6:57 UTC (Thu) by brouhaha (subscriber, #1698) [Link]

    Oh no! What if there's some amazing new programming language feature introduced in C++1x or Java 1.8, and Python 3.1doesn't immediately adopt an equivalent? The whole world will drop Python and switch to something else! It will be a unmitigated disaster! In fact, we'd better switch to another language right now, just to be on the safe side!

    Proposal: Moratorium on Python language changes

    Posted Oct 22, 2009 10:38 UTC (Thu) by euske (subscriber, #9300) [Link]

    As one who's developing a code analyzer for Python, this seems a good (non-)move. This is good news for book writers and teachers who use Python as a teaching material, too, because they wouldn't be compelled to cover a cutting-edge version of the language.

    Aside from developers' point of view, I guess it is good for marketing also, because it gives people more feeling of stability. Python is already rock solid stable, but the current pace of development would give a somewhat different impression to those who're starting to try a new language.

    Proposal: Moratorium on Python language changes

    Posted Oct 22, 2009 14:57 UTC (Thu) by pboddie (subscriber, #50784) [Link]

    In highlighting these groups of people who are most sensitive to language change - those having to otherwise chase language development and to cater for every new shiny feature added to the language and immediately used by "early adopters", those hoping that their recently published book isn't already obsolete, and those trying to learn Python who are confused by the two incompatible versions and two kinds of advice - you've managed to identify a number of groups who were apparently ignored when Python 3 was being planned.

    But what irritates me is this "we must get everyone onto Python 3" attitude. Aside from the marginal benefits of Python 3 (which I've ranted about before - the most disliked elements in CPython remain things like performance and the global interpreter lock), the standard library has been neglected for years aside from sporadic attempts to shoehorn new stuff into it. Maybe this is the BDFL's way of telling the core developers to get working on renovating the standard library or improving the internals, but I imagine that what will really happen is that everyone will go and play in a sandbox, and then the hard lobbying for "exceptions" to any moratorium will begin.

    Or maybe the plan is to have the PyPy project offer a more interesting alternative for Python 2.x developers: a six-fold or better speed increase versus the hard work of migrating to Python 3 for negligible performance benefits (if not an actual performance regression). Suddenly, Python 3 doesn't seem to have the air of inevitability any more.

    Proposal: Moratorium on Python language changes

    Posted Oct 23, 2009 15:23 UTC (Fri) by sergey (guest, #31763) [Link]

    Python 3 needed a clean break from 2 for multiple reasons. The way I saw it was that modern requirements (Unicode support, "professional open source") seriously challenged some of the original design principles and conventions. I don't doubt that new users will benefit from Python 3's changes. As for book writers, they've got a reason to publish updated or completely books, what's not to like?

    > the standard library has been neglected for years aside from sporadic attempts to shoehorn new stuff into it

    I don't disagree with your main point, but in fairness, standard library did receive a big face lift in Python 3. Many people get turned off by the fact that unlike Java or C#, Python 1 and 2 didn't enforce a uniform naming convention in the library. Those who try to increase adoption of the language in the "enterprise" often have hard time because this absence of a single style is deemed unprofessional.

    And then imagine a feat of remaking the entire library to deal with bytes vs Unicode strings.

    Proposal: Moratorium on Python language changes

    Posted Oct 23, 2009 18:01 UTC (Fri) by pboddie (subscriber, #50784) [Link]

    Python 3 needed a clean break from 2 for multiple reasons. The way I saw it was that modern requirements (Unicode support, "professional open source") seriously challenged some of the original design principles and conventions.

    Python 2 supported Unicode fairly well. The only thing it didn't do was confront programmers with Unicode and good text handling practices immediately, which is what making Unicode objects the main string type seems to achieve, albeit with various caveats. I recall some cases (although not the details) where this can be counterproductive - generally where system interfaces, input/output and filesystems are involved - and Python 3's approach doesn't necessarily solve all the problems. You'll have to elaborate on what "professional open source" means - it can't have too much to do with things like better support for binary compatibility (although some work has been done on this) or cross-compilation (another area where Python sees a lot of action without any support from the core developers), that are arguably big "professional" areas.

    I don't doubt that new users will benefit from Python 3's changes. As for book writers, they've got a reason to publish updated or completely books, what's not to like?

    Not a great deal unless you have a book on Python 2.x out. It's also somewhat embarrassing for all the books already out there to suddenly be incompatible with the "mainstream" edition of the language: a great way to dilute the "brand". Obviously, people can argue that PHP and a variety of other technologies do this kind of thing all the time, but it gives ammunition to people like managers who think that everyone should just write Java programs and not use "this open source experimental stuff".

    As for the standard library, I confess to having had an interest in advocating work on renovating it, and the response was more or less a brush-off. With stuff like PyPy and Unladen Swallow (and lots of other work, mostly hovering around the Python 2.x language definition), had the standard library been prioritised instead of Python 3, I don't think many people would really miss an absent Python 3 at all.

    "Unicode"

    Posted Oct 29, 2009 19:52 UTC (Thu) by spitzak (guest, #4593) [Link]

    Python is doing "Unicode" wrong, and Python 3.0 is making things worse.

    Strings should and MUST be bytes. In the real world UTF-8 can have invalid bytes in it, and UTF-16 can have invalid sequences. The Python implementers are living in a fantasy land where they believe that declaring data to be "UTF-8" somehow magically causes all the invalid sequences to disappear, as though the laws of quantum physics makes them impossible.

    Here is what really happens: the very first moment that some programmer gets an error when reading invalid UTF-8, they will "fix" it by changing the encoding to ISO-8859-1. They are replacing a total Denial of Service failure with a minor defect that all the non-ASCII characters are mangled. This is a no-brainer choice for the programmers and pretending they will act otherwise is stupid.

    Ballmer and the Microsoft programmers are laughing themselves silly right now as they watch FOSS swiftly sabotage any ability to handle Unicode with these moronic ideas and delegate Linux to an ISO-8859-1-only ghetto. This is shameful especially when you consider that UTF-8 was invented by K&R for Plan9, a Unix derivative.

    I should be able to reliably take ANY byte array and turn it into "unicode" and back with NO errors and NO loss. Any other design makes Python useless and forces me to use "bytes" for all data and to set the encoding to ISO-8859-1 so those API's that want "Unicode" will not throw errors and crash my program.

    Instead I believe the amount of Python code that would fail if string[x] returned the x'th byte from the UTF-8 encoding is minuscule. All such code is searching for ASCII and will still work.

    String constants should be bytes. "\uXXXX" means the bytes for the UTF-8 encoding. "\xNN" means exactly that byte, even if the result is invalid UTF-8.

    If you want Unicode code points, have "for x in string" return them, but as a special new object where encoding errors are unique different values that can be tested and don't compare equal to any character. Converting to "Unicode" would simply add a flag to the string object that indicates that "for x in string" acts this way. There could be other flags for all the different codecs, and it could also do canonical composition/decomposition and other Unicode actions.

    "Unicode"

    Posted Oct 30, 2009 20:31 UTC (Fri) by nix (subscriber, #2304) [Link]

    Here is what really happens: the very first moment that some programmer gets an error when reading invalid UTF-8, they will "fix" it by changing the encoding to ISO-8859-1. They are replacing a total Denial of Service failure with a minor defect that all the non-ASCII characters are mangled. This is a no-brainer choice for the programmers and pretending they will act otherwise is stupid.
    Maybe in the US (hell, there they'd probably just go back to 7-bit ASCII!); maybe in parts of Europe. But a lot of people want stuff to work in the Far East now...

    "Unicode"

    Posted Oct 30, 2009 21:59 UTC (Fri) by spitzak (guest, #4593) [Link]

    hell, there they'd probably just go back to 7-bit ASCII!

    Indeed they have. I have personally encountered software that "fixed" encoding problems by masking the high bit, by removing all bytes with the high bit set, and by replacing all bytes with the high bit set with "\xNN" sequences. So claiming that they would even preserve ISO-8859-1 was perhaps being too kind. In fact we are regressing to earlier than the 1980's by going ASCII-only.

    What is happening in the far east is that Asian text is getting stored in UCS-2 (thought they may claim it is UTF-16), or in non-error-throwing encodings such as the older JP multibyte, while all other text is in ISO-8859-1 or ASCII (they may claim it is UTF-8). Thus text is delegated to two different file types, the exact thing Unicode was supposed to fix!

    "Unicode"

    Posted Oct 31, 2009 0:24 UTC (Sat) by nix (subscriber, #2304) [Link]

    In fact we are regressing to earlier than the 1980's by going ASCII-only.
    I don't know who 'we' is, but it doesn't describe any software development shop I know of. Everyone is more i18n-aware than they used to be, not less.

    As for text being relegated to multiple encodings, well, Unicode is rapidly conquering over there, as well. Yes, you have to distinguish between UCS-2 and UTF-8, but you've had to do that for ages, and there are pretty accurate heuristics now. Needing heuristics to detect encodings is nothing new, either: we've always needed them for EBCDIC-versus-ASCII, even before ISO-8859 was heard of.

    And this problem of illegal UTF-8 characters which you claim is so catastrophic? I've never once seen them outside fuzz tests, attempted attacks, and while debugging a heuristic charset detector. They just don't occur in normal use of a system, at all. Catastrophe? No. The security implications are interesting, but not as significant as problems with equality-comparing UTF-8 characters without considering that they may not be in canonical form -- a problem you didn't mention.

    "Unicode"

    Posted Oct 31, 2009 1:47 UTC (Sat) by spitzak (guest, #4593) [Link]

    The most obvious catostrophe is the inability to access data that is not valid UTF-8, even to fix it.
    You can't fix incorrect UTF-8 if your editor refuses to load the file. For a more obvious example,
    you cannot correct an incorrect UTF-8 filename if your filesystem API refuses to provide a way to
    identify that file in the rename call.

    If you have not seen a junior programmer "fix" these by treating the UTF-8 as ISO-8859-1
    (sometimes done by "double encoding UTF-8" but the result is the same) then I don't think you
    have worked very much with teams of programmers. This is destroying I18N on Linux and in many
    internet standards. On Windows it is destroying UTF-16 but it is less of a problem as only non-BMP
    characters are lost.

    I think changing to iterators is the first step to correctly handling canonical forms and all the other
    Unicode problems. This insistence on changing it to a fixed size array and ignoring patterns is
    actually a deterrent to correct comparisons.

    "Unicode"

    Posted Oct 31, 2009 12:37 UTC (Sat) by nix (subscriber, #2304) [Link]

    How often do you *see* allegedly-UTF-8 data that isn't valid UTF-8? In my
    experience it's vanishingly rare, much less common than encountering
    Unicode mapping to unmapped codepoints. What's more, both are dealt with
    the same way: the latter is shown using a square box glyph (losing
    information about precisely which character it is, but you rarely care);
    the former is dealt with by transforming it into a convenient valid
    character, often a form of ? or the replacement character, or a graphical
    box containing the invalid bytes (you sometimes lose information about
    precisely what the invalid string was, but you rarely care). (Noncanonical
    UTF-8 is generally quietly canonicalized.)

    You are making a mountain out of a very, very small and already-levelled
    molehill: Python's behaviour is known bad and will almost certainly be
    fixed, that's why there was such a lot of noise over it. To claim that
    it's 'destroying' UTF-8 is utterly laughable.

    As for filenames in UTF-8, well, that's why POSIX considers filenames to
    be a byte string. So should interfaces to POSIX. This is unlikely to
    affect anything but a language that nobody much uses yet and an OS
    (Windows/NTFS) that has taken considerable pain over this (especially
    combined with case-insensitivity) and which thankfully is not an OS this
    site is about.

    "Unicode"

    Posted Nov 2, 2009 18:37 UTC (Mon) by spitzak (guest, #4593) [Link]

    I agree it is very rare, but it just takes ONE failure to make a programmer say "forget that, I'll treat it as ISO-8859-1 because I don't give a s**t about Chinese..."

    You would like errors to turn into boxes, but the majority of software does not, instead they throw exceptions, which is most cases is equivalent to a Denial of Service if in fact there is no other way to convey the string to the back end. Particularily nasty for me are Python's convertions to "Unicode", QT strings, QT's HTML renderer, and the XRender "draw this UTF-8 string" api. I am sure there are many many other examples.

    In my ideal solution, conversion is deferred until as late as possible, probably as part of the glyph layout code (ie Pango, etc). At this point it is harmless to make a lossy conversion (since layout is lossy anyway, doing canonicalization), and I would convert the error bytes to the matching characters in the Microsoft CP1252 character set. This has the advantage that accidental non-UTF-8 is readable by the users. Believe me they really don't want to see boxes!

    The Python "solution" of turning errors into 0xCDxx sort of works, but has the nasty problem that you must track the original source of a string to properly convert back to UTF-8 or UTF-16. If you don't, you either make it impossible to produce all possible UTF-16 strings (very bad because you will be unable to name all files on Windows), or you make it possible for a malicious invalid UTF-16 string to turn into a valid UTF-8 string. For this reason I don't think this solution is going to work, and that keeping the strings as UTF-8 (and converting UTF-16 TO UTF-8, which is lossless) is the only way to go.

    "POSIX considers filenames to be byte strings": this sort of statement is the problem. Of course it is byte strings. What you are really saying is "I will pretend the problem does not exist by declaring anything that might contain errors to be "not UTF-8"". The problem is that at some point somebody wants to look at what the byte string means to the user, they will indeed have to say "oh yes this *is* UTF-8". Or worse, they might say "oh this is ISO-8859-1 because then I know my program won't throw a damn exception". Statements like this are exactly the problem I am hoping can be fixed.

    In reality, of course filenames are "byte strings", but this is because UTF-8 is a byte string. ALL BYTE STRINGS ARE UTF-8. They are also ASCII and ISO-8859-1 or JP encoded or random binary garbage! They can have invalid UTF-8 sequences in them. They can also have misspelled words, control characters, or they can have French words in them while the program thinks they are English. They can invalid Unicode glyph sequences such as misplaced combining accents. They can spell out a false math proof, or a political opinion that you disagree with. There are billions of errors that can be in the string. Deal with it correctly, instead of declaring that some tiny ill-defined subset of possible errors make the string be "not UTF-8".

    "Unicode"

    Posted Nov 3, 2009 6:47 UTC (Tue) by Cato (subscriber, #7643) [Link]

    I think the answer is more configurability of the conversion process - depending on the context you may want to stop the conversion or insert substitute characters as XML entities, \xNNNN, etc. Perl's Encode module does a pretty good job here.

    Also, filenames are not always byte strings, unfortunately - every filesystem has various illegal characters, and NTFS and HFS+ expect valid UTF-8 (HFS+ uses UTF-16 internally, and it must also be decomposed Unicode i.e. NFD).

    "Unicode"

    Posted Nov 3, 2009 19:17 UTC (Tue) by spitzak (guest, #4593) [Link]

    > filenames are not always byte strings, unfortunately - every filesystem has various illegal characters

    That is the opposite problem. The problem I am trying to solve is that the filesystem can have filenames that are NOT possible in the API that libraries are providing.

    The only non-byte-stream filename api that is used at all is UTF-16. However UTF-16 (including invalid UTF-16) can be losslessly translated to UTF-8 and then back to UTF-16. Therefore all filesystems in existence can be controlled by a byte stream API, using UTF-8 as the encoding.

    It is true that there are UTF-8 streams that cannot be turned into UTF-16, these would be "illegal characters" for the filenames. If the filesystem does not have a byte api then this can be replicated by turning all errors into "illegal characters" in UTF-16 so that an equivalent error is thrown.

    "Unicode"

    Posted Nov 2, 2009 9:40 UTC (Mon) by njs (guest, #40338) [Link]

    > This is destroying I18N on Linux and in many internet standards.

    I think if you want us to take such catastrophic declarations seriously you should perhaps name some examples of specific free software or internet standards that have had their I18N "destroyed" (or even negatively affected).

    "Unicode"

    Posted Oct 30, 2009 22:38 UTC (Fri) by foom (subscriber, #14868) [Link]

    I believe from reading some of the Apocalypsen years ago that Perl 6 is going to do something like this. Seems like a nice plan.

    Decoding all your bytes into UCS4 is wasteful in time and memory, and 99% of the time unnecessary. Who ever wants to talk about codepoints, anyways? If you want more than bytes, graphemes are the more useful concept, and Python's unicode strings do nothing for you there.

    People (especially people who design programming langauges) need to get over this obsession with constant-time access to arbitrary numbered codepoints. It's really *not* an important or useful feature to have in your API!

    "Unicode"

    Posted Oct 31, 2009 0:27 UTC (Sat) by nix (subscriber, #2304) [Link]

    O(1) *anything* is a good property to retain, IMO, especially with regard
    to things as fundamental as characters.

    Go from O(1) to O(n) on individual codepoint access and suddenly O(n)
    stuff on strings goes to O(n^2) and so on: not remotely good.

    "Unicode"

    Posted Oct 31, 2009 0:50 UTC (Sat) by foom (subscriber, #14868) [Link]

    > Go from O(1) to O(n) on individual codepoint access and suddenly O(n)
    > stuff on strings goes to O(n^2) and so on: not remotely good.

    That clearly only happens if your language doesn't have such a thing as iterators.
    "increment(character_iterator)" is still O(1) even if your underlying representation is
    UTF-8. The need to access an arbitrary numbered unicode codepoint in a string in
    constant time isn't really all that useful.

    Unicode codepoints don't really correspond to anything humans care about...splitting
    a string in the middle of a Grapheme is really just as bad as splitting it in the middle of
    a UTF-8 codepoint-sequence.

    "Unicode"

    Posted Oct 31, 2009 1:11 UTC (Sat) by spitzak (guest, #4593) [Link]

    You are exactly the sort of misguided person who is destroying UTF-8.

    Please explain EXACTLY where the "N" comes from that you are passing to your "go to the Nth UTF-16 code point" function. Answer: it is calculated by looking at all the preceeding N-1 "characters" and therefore it is a misguided attempt to store an iterator in a integer, and that it can be trivially replaced by a real iterator that uses a byte offset or pointer.

    "Unicode"

    Posted Oct 31, 2009 1:29 UTC (Sat) by nix (subscriber, #2304) [Link]

    You're just repeating what foom said in different words, I think.

    "Unicode"

    Posted Oct 31, 2009 1:29 UTC (Sat) by nix (subscriber, #2304) [Link]

    Sorry, I misinterpreted you. Of course it's more complicated to iterate
    over strings now, but really not much more, and UTF-8 (unlike the
    fixed-width multibyte encodings) is easy to resync to if you start from an
    arbitrary byte, so things like binary searches in long strings are still
    possible with a tiny bit of extra tweaking.

    And, agreed, the ability to treat a string as a fixed-width array is
    really quite unimportant: generally people iterate over strings rather
    than leaping to position N. (You meant 'position' or 'offset', though,
    not 'codepoint', which is entirely different. Codepoint 'access' isn't
    even a particularly meaningful concept: what does it mean to 'access'
    ASCII codepoint 65? Codepoints just *are*.)

    "Unicode"

    Posted Nov 2, 2009 9:46 UTC (Mon) by njs (guest, #40338) [Link]

    You can store text in UTF-8 and still have O(log n) random access (and insertion!) by storing your string in an charpoint-offset indexed tree. No-one seems to bother implementing this, though, presumably because it's complicated, and the overhead only amortizes out when you need to do random access/insertion on large chunks of text, which turns out to be a fairly rare need in practice. Also, converting from UCS-4 to UTF-8 is a pessimization for CJK.

    "Unicode"

    Posted Nov 2, 2009 18:49 UTC (Mon) by spitzak (guest, #4593) [Link]

    UTF-8 is self-synchronizing, meaning that you can find character boundaries in O(1) time and thus you can do random access/insertion of large chunks of text, even if the insertion point is somehow generated at random.

    What you can't do is define the insertion point as "after N repetitions of this regexp" which is really what is wanted when people say "characters". But no other data structure would dare to require this as the basic iterator. For some reason though text makes otherwise intelligent programmers into morons and they are just blind to the obvious solution.

    As for CJK, I think you meant UTF-16, not UCS-4. Converting from UCS-4 to anything will save memory, as it uses 4 bytes per Unicode code point. UTF-16 does use only 2 bytes for 0x800-0xFFFF while UTF-8 uses 3. However UTF-8 uses one byte for 0x00-0x7F and UTF-16 uses 2, so in fact the text will be smaller if there are more of these. In real CJK texts (ie not just single words) there are, because the ASCII range has spaces, newlines, numbers, all english quoted words, and all the XML markup, while the 3-byte ones are generally one-per-word and thus outnumbered. I believe however that some lengthy east-Indian texts, which use a phonetic alphabet that unfortunately requires 3-bytes per letter in UTF-8, is less efficient in UTF-8 than in UTF-16.

    In any case talking about compression is silly, as you can use ZIP to turn any large document into far less than 1 byte per Unicode code point.

    "Unicode"

    Posted Nov 2, 2009 20:31 UTC (Mon) by njs (guest, #40338) [Link]

    > UTF-8 is self-synchronizing, meaning that you can find character boundaries in O(1) time and thus you can do random access/insertion of large chunks of text, even if the insertion point is somehow generated at random.

    Yes, I know. I thought we were talking about random access using character offsets, rather than byte offsets, though -- at least, that's what I was talking about in my comment. My point is that you can still do better than O(n) for arbitrary character access.

    I don't really understand what point you're making about regexps -- all the utf-8 apis I know provide character iterators. I am, though, skeptical that the authors are really all morons, and not sure that claiming they are really adds anything to the conversation.

    Re: UTF-8 vs. UCS-4/UTF-16: You're right, I misremembered. UTF-8 and UTF-16 are identical in terms of the hassle of doing random access indexing, and both are more memory-efficient than UCS-4, so I guess everything I said applies to both.

    I mentioned compression because the original poster complained that UCS-4 was wasteful of memory; one of the motivations for using UTF-8 instead is that it gives some effective compression. Obviously for long-term compressed storage there are better solutions, but that's not what we're talking about.

    "Unicode"

    Posted Nov 3, 2009 4:01 UTC (Tue) by dvdeug (subscriber, #10998) [Link]

    If you really want compressed in-core string storage, there's SCSU or BOCU-1. The memory versus code tradeoff is generally not considered worth it, though.

    Proposal: Moratorium on Python language changes

    Posted Oct 24, 2009 6:22 UTC (Sat) by foom (subscriber, #14868) [Link]

    > And then imagine a feat of remaking the entire library to deal with bytes vs Unicode strings.

    That will indeed be have been quite a feat...whenever it's actually done.

    Proposal: Moratorium on Python language changes

    Posted Oct 23, 2009 0:32 UTC (Fri) by hozelda (guest, #19341) [Link]

    Careful that a closed source python implementation not seize this opportunity to move ahead with features, forcing others to follow along in a way that is difficult to do (because of closed source).

    I don't know if ironpython is closed or not, but Microsoft is hypercompetitive and leverages their monopolies to create instant market share in new markets. Then they use closed source extensions to create a situation where they take the lead as users find that it's best to use their language impl with extensions instead of a language impl without the proprietary extensions.

    Slowing things down would allow the market to be seized by a closed source vendor.

    Of course, if there is a technical real reason not to want to add features, then that is another story.

    I'd prefer Python as a standard ISO.

    Posted Oct 23, 2009 13:06 UTC (Fri) by Markon (guest, #58619) [Link]

    I don't understand why Python hasn't been proposed as a standard ISO, now that's stable.

    This would simplify things as "freezing" for a long/short period.
    Do you have new features? Propose them. If they're accepted, well.

    Copyright © 2009, Eklektix, Inc.
    Comments and public postings are copyrighted by their creators.
    Linux is a registered trademark of Linus Torvalds