|
|
Subscribe / Log in / New account

Python 2.8?

By Jake Edge
January 11, 2017

The appearance of a "Python 2.8" got the attention of the Python core developers in early December. It is based on Python 2.7, with features backported from Python 3.x. In general, there was little support for the effort—core developers tend to clearly see Python 3 as the way forward—but no opposition to it either. The Python license makes it clear that these kinds of efforts are legal and even encouraged—any real opposition to the project lies in its name.

Larry Hastings alerted the python-dev mailing list about the Python 2.8 project (which has since been renamed to "Placeholder" until another name can be found). It is a fork of Python 2.7.12 with features like function annotations, yield from, async/await, and the matrix multiplication operator ported from Python 3. It is meant to be a drop-in replacement for Python 2.7, so it won't have features that are incompatible with it. It is aimed at those who are not ready (or willing) to make the jump to Python 3, but want some of the features from it.

The name "Python 2.8" implies a level of support, though; it also uses a name (Python) that is trademarked by the Python Software Foundation. Steven D'Aprano recalled discussions at the time of the decision to stop Python 2.x development at 2.7:

I seem to recall that when we discussed the future of Python 2.x, and the decision that 2.7 would be the final version and there would be no 2.8, we reached a consensus that if anyone did backport Python 3 features to a Python 2 fork, they should not call it Python 2.8 as that could mislead people into thinking it was officially supported.

He and others called for the project to be renamed. An issue was filed for the project suggesting a rename. As it turns out, the owner of the project, Naftali Harris, is amenable to the change, which simplifies things greatly. Had that not been the case, though, it is not entirely clear that the PSF Trademark Usage Policy precludes using the name "Python" that way.

David Mertz, who is a member of the PSF Trademarks committee, believes that "Python 2.8" would be a misuse of the trademark and referred it to the committee. Terry Reedy agreed, saying that the project was a "derived work" and that clause 7 of the Python License does not automatically allow the use of the PSF trademarks.

But Marc-Andre Lemburg noted that the trademark policy is seemingly written to allow for uses like this. The policy says:

[...] stating accurately that software is written in the Python programming language, that it is compatible with the Python programming language, or that it contains the Python programming language, is always allowed. In those cases, you may use the word "Python" or the unaltered logos to indicate this, without our prior approval.

He pointed out that the project also fulfilled the license requirements by listing the differences from 2.7.12 as is required in clause 3. But he agreed that a name change should be requested. For his part, Guido van Rossum is not particularly concerned by the existence of the project:

While I think the name is misleading and in violation of PSF policy and/or license, I am not too worried about this. I expect it will be tough to port libraries from Python 3 reliably because it is not true Python 3 (e.g. str/bytes). So then it's just a toy. Who cares about having 'async def' if there's no backport of asyncio?

Mertz, however is not so sure. The existence of a "Python 2.8" may "serve as a pretext for managers to drag their feet further on migration plans", which will be detrimental to organizations where that happens. PEP 404 (the "Python 2.8 Un-release Schedule") makes it quite clear that the core development team (and, presumably, the PSF) is resolute about a 2.8 release: "There never will be an official Python 2.8 release. It is an ex-release."

But there are various other projects that have "Python" in their names (IronPython, ActivePython, MicroPython, etc.) as well as projects with names that are suggestive of Python without directly using the name (Jython, PyPy, Cython, Mython, and so on). Where is the line to be drawn? As with all trademark questions, though, it comes down to a question of user confusion: will users expect that something called "Python 2.8" is officially endorsed and supported by the PSF? The answer would seem to clearly be "yes".

Luckily, everyone is being fairly reasonable—no legal action has been needed or even really considered. The fact that Harris was willing to change the name obviated any need to resort to legal remedies. The GitHub issue thread is full of suggestions for alternate names, replete with Monty Python references—our communities love to bikeshed about names. There are also some snide comments about Python 3 and the like, but overall the thread was constructive.

As far as new names go, an early favorite was "Pythonesque", but calling the binary "pesque" reminded some of the word "pesky", which is not quite what Harris is after (though "pyesque" might work). He renamed the project to "Placeholder" on December 12 "while we find a good permanent name that I like and that works for the PSF". The current leader appears to be Pyvergent (since Mython already exists and one might guess that Harris is not entirely serious about Placeholder). In any case, he said, the decision does not need to be made immediately.

At this point, Placeholder appears to largely be a one-developer project. Its GitHub history starts in October 2016 and some real progress has seemingly been made; quite a few features have been ported from Python 3. The issues list shows some ambitious plans that might make it less of a "toy" than Van Rossum envisioned. If it ends up being popular and attracting more of a community, it could perhaps become a strong player in the Python world.

There is a balance to be struck on trademark policies for free-software projects. As we saw in the Debian-Mozilla trademark conflict, which resulted in the "Iceweasel" browser and was resolved early last year, distributions and others want to be able to make changes to projects while still being able to use the trademarks. As Nick Coghlan pointed out, for Python, Linux distributions are likely pushing the envelope the furthest:

Linux distros probably diverge the furthest out of anyone distributing binaries that are still recognised as a third party build of CPython, such that the Linux system Python releases are more properly called "<distro> Python" rather than just Python. However, distro packaging formats are also generally designed to clearly distinguish between the unmodified upstream source code and any distro-specific patches, so the likelihood of confusion is low (in a legal sense).

It would seem that the PSF might want to tighten its policy slightly such that it retains control over "Python x.y" and similar trademarks, while still allowing the Python name to appear in the names of other related projects (like MicroPython). That way, if legal action is actually needed at some point (which no one wants to see, of course) it will be clear that the intent and the policy line up. Fragmentation is a clear possibility given the "forkable" nature of free-software projects, but it is certainly not unreasonable for the parent project to retain a measure of control to reduce confusion—that is precisely what trademarks are for.



to post comments

Python 2.8?

Posted Jan 11, 2017 20:52 UTC (Wed) by karkhaz (subscriber, #99844) [Link] (1 responses)

> As it turns out, the owner of the project, Naftali Harris, is amenable to the change

That was a charming and amusing thread, especially in the face of that one troll. It pains me to say this, but it's commendable for everyone involved to have kept up a cheerful and whimsical attitude on such a bikeshedworthy topic, and also one with such a contentious history.

I'd be super-interested in reading some kind of analysis on what makes these discussions work, or not. There was a nice BoF in Debconf'15 about engineering discussions so that they proceed civilly, though I don't know if anybody pursued that further. It would be nice to have a collection of shining examples of discussions going constructively, with an analysis of why so that we can learn by example. And a collection of the opposite, discussions that disintegrate into emotionally-draining flamewars, and an analysis of what catalysed that and how it could have been prevented.

Python 2.8?

Posted Jan 20, 2017 13:51 UTC (Fri) by pboddie (guest, #50784) [Link]

That was a charming and amusing thread, especially in the face of that one troll.

You missed the wink gesture to everyone else in the know here, I think. Who are the trolls here? The ones voicing their concern at the hostility towards anyone prolonging the life of Python 2, or the ones who suggest names like "Norwegian Blue" to passive-aggressively suggest that Python 2 should cease to be? The ones voicing their concern about threats of trademark litigation, or the ones calling for the lawyers to get involved?

I guess that Python 2 users will be excluded from presenting at Python conferences and events in future if they don't commit to using Python 3. That's the sort of mindset creeping in here. Never mind that at those conferences there will undoubtedly be talks about JavaScript and who knows what else, just as there always has been.

Python 2.8?

Posted Jan 12, 2017 1:32 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (36 responses)

> I expect it will be tough to port libraries from Python 3 reliably because it is not true Python 3 (e.g. str/bytes).
Wow! A new sane version of Python. Finally!

I hoped that the PathLike fiasco taught Python 3 developers a lesson, but no....

> serve as a pretext for managers to drag their feet further on migration plans
Seriously. That's the main reason why Google is rewriting their code in Go now.

Python 2.8?

Posted Jan 12, 2017 14:41 UTC (Thu) by rbrito (guest, #66188) [Link] (35 responses)

Could you please (kindly) comment on this fiasco thing? I just skimmed (very briefly) about this os.PathLike protocol that you mentioned and it seemed to be a good thing. Is it something bad?

Python 2.8?

Posted Jan 12, 2017 19:03 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (34 responses)

The need for this protocol is a bad thing. Python 2.7 just uses byte strings and it works perfectly fine.

Python 3 developers were bitten by a rabid weasel and decided that they need "Unicode". The rational was that: "We need Unicode for foreign languages because Python. And why are you asking, are you some kind of foreign-language hating American imperialist?"

So magic Unicode was added. It existed before with u"" strings in Py2.7, but in Py3k it was cranked all the way up to 11. So as a result, first Python 3 versions couldn't even print formatted byte strings (that was fixed in Python 3.2, I believe).

And of course, "Unicode" in Python means some magic butterfly unicorny thingie that with unspecified internal implementation, so you don't even get O(1) character addressing within a string. At least Perl 6 uses the normalized grapheme form which provides it.

The problem here is that the world around is not really Unicode. File paths can contain arbitrary byte strings but it would be awkward to admit that, so earlier versions of Python simply skipped files with "incorrect" names in Unicode filesystem operations (not joking here). And there's no easy way to fix that in Python 3 - even a simple print() statement would likely mix the pure abstract world of Unicode unicorns with dirty reality around it.

So there's whole new protocol to make filesystem paths to behave both as strings and as unicode entities. And now I'm eagerly waiting for HttpHeaderLike, StringReadFromFileLike, UserInputOnCommandLineLike and so on.

Oh, and Python 3 also switched defaults for strings, so all un-annotated strings are Unicode by default. And don't you dare to try to concatenate them with data that has impure thoughts:
>>>> a = "asdfasfd"
>>>> b = b'\x01\x02\x00'
>>>> a+b
>>>> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
>TypeError: Can't convert 'bytes' object to str implicitly

In Python 2.7 that just works:
>>>> a = "asdfasfd"
>>>> b = b'\x01\x02\x00'
>>>> a+b
>>>> 'asdfasfd\x01\x02\x00'

I can continue, but that should be enough for one post.

Python 2.8?

Posted Jan 13, 2017 5:38 UTC (Fri) by ddevault (subscriber, #99589) [Link] (19 responses)

You're making the fundamental mistake of thinking of Python 3's bytes as a string at all. It's not. It's an array of octets. It is NOT a string. If you ever try to do string manipulation on a bytes, you're Doing It Wrong. Bytes _may_ be an encoded string, but to use it you of course have to decode it.

It's fair to argue that Python 3's choice to break backwards compatability was a mistake, but it's not fair imo to criticise the current design. It's completely sane, much more so than Python 2's design.

Python 2.8?

Posted Jan 13, 2017 6:48 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (16 responses)

> You're making the fundamental mistake of thinking of Python 3's bytes as a string at all. It's not. It's an array of octets.
I don't care what you call them. The real world out there is not unicode, it is composed of strings of bytes.

Python 3 made working with them extremely cumbersome.

> It's fair to argue that Python 3's choice to break backwards compatability was a mistake, but it's not fair imo to criticise the current design.
Yes it is fair. Python 3 unicode design is totally braindead. We now have several examples of GOOD unicode-enabled design, from both parts of the spectrum:

1) Perl 6 dove deep and encodes strings in the NFG, thus making string operations intuitively correct. It won't split words between combining characters, for example. Python 3, of course, is completely clueless about that.

2) Go treats strings as arrays of octets. Always. It also has handy functions to manipulate arrays of octets that happen to be UTF-8 encoded texts.

Sorry. But Py3 is not well-designed, it's a great example of a "second system effect". The fact that companies are migrating from Py27 to Golang should give Py3 developers some clue.

Example from this thread:
1) Set up:
>>> open('test-\ud800.txt', 'w').close()

2) Python 2.7 (works as expected):
>>> [open(f, 'r').close() for f in os.listdir(b'.')]
>>> [None]

3) Python3.5 (craps its pants):
>>> [open(f, 'r').close() for f in os.listdir(b'.')]
>FileNotFoundError: [Errno 2] No such file or directory: b'test-\xed\xa0\x80.txt'

So, why should I migrate to Py3 if it's full of stuff like this? What's the point?

Python 2.8?

Posted Jan 13, 2017 7:01 UTC (Fri) by ddevault (subscriber, #99589) [Link] (5 responses)

>I don't care what you call them. The real world out there is not unicode, it is composed of strings of bytes.

_Strings_ are NOT bytes. They are characters. Python 3 strings are arrays of characters, and bytes are just that - bytes. Bytes are not strings. Bytes are bytes. A "string of bytes" isn't a phrase that makes sense. Bytes in Python 3 is a byte array. That's all that it is. Period. End of discussion. It could be a byte array whose contents is an _encoded_ string, which you might decode into an array of characters (aka a string). But it's NOT a string.

>Python 3 made working with them extremely cumbersome.

No, it made it wonderfully simple. You're just using it wrong.

Your example is also broken.

>open('test-\ud800.txt', 'w').close()

You're already off to a bad start. Did you even try your example in Python 3?

>>> open('test-\ud800.txt', 'w').close()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 5: surrogates not allowed

Let me show you what you meant to do:

>>> open(b'test-\ud800.txt', 'w').close()
>>> [open(f, 'r').close() for f in os.listdir(b'.')]
[None]

Note how it doesn't crap out. You need to get your head out of Python 2 and learn Python 3 properly.

Python 2.8?

Posted Jan 13, 2017 7:20 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

> _Strings_ are NOT bytes. They are characters. Python 3 strings are arrays of characters
No, they aren't. They are NOT arrays of characters, and anybody who is familiar with Unicode can tell you that. Python strings are arrays of Unicode _codepoints_ - a Unicode character can _easily_ consist of multiple codepoints.

> But it's NOT a string.
I don't CARE about strings. Py3 "strings" are an exception in the REAL world out here. Real world operates on byte sequences - they are in HTTP headers, in CSV files, in filenames and so on.

Why should "real strings" be special?

> No, it made it wonderfully simple. You're just using it wrong.
"You're holding it wrong" - exactly.

> You're already off to a bad start. Did you even try your example in Python 3?
Yes, I did. It's an example from: https://lwn.net/Articles/711492/ - on Windows, though I think I've used 'test\x01\x02.txt'.

> Note how it doesn't crap out. You need to get your head out of Python 2 and learn Python 3 properly.
Thanks, but I don't like Koolaid.

If I jump through all the hoops - what do I gain as a result? Nothing but headache. Unicode support in Python3 provides no usefulness whatsoever.

Python 2.8?

Posted Jan 13, 2017 7:26 UTC (Fri) by ddevault (subscriber, #99589) [Link] (3 responses)

You simply don't understand how it works and you're too stubborn to learn so you write ignorant flames about it. The problems you claim Python 3 has don't exist. _You_ and those like you are the only problem with Python 3.

Python 2.8?

Posted Jan 13, 2017 7:39 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

I know perfectly well how to work with Unicode. I know how Py3 is supposed to be used - it assumes that everything can be losslessly converted to "real strings" on the external border.

BUT THIS IS NOT TRUE!

Real world out here is full of misencoded stuff that can only be represented by byte arrays. It can NOT be encoded into strings without losses or quoting. Thus "PathLike" class with dual nature to represent them is needed for Py3.

Py3 makes working with such entities a total pain. Py27 is entirely a breeze - it just works.

And please, do explain what I gain from following all the Py3 rules of unicode strings? What are the advantages over Py27 with utf-8 strings?

Python 2.8?

Posted Jan 13, 2017 17:42 UTC (Fri) by sfeam (subscriber, #2841) [Link]

"Development would be so easy if it weren't for those pesky users."

Python 2.8?

Posted Jan 14, 2017 15:36 UTC (Sat) by bronson (subscriber, #4806) [Link]

> _You_ and those like you are the only problem with Python 3.

You're saying that there are lots of problems with Python 3?

Python 2.8?

Posted Jan 19, 2017 10:28 UTC (Thu) by chojrak11 (guest, #52056) [Link] (1 responses)

> The fact that companies are migrating from Py27 to Golang should give Py3 developers some clue.

You're grossly wrong. There are many rationales for the migration, none of them being wrong Unicode design in Python. Some of them are:

1) Poor Python performance, caused by runtime type reconciliation because of
2) Dynamic typing, which then requires
3) Lots of unit tests required to keep that dynamic types in shape.

Go is fast, because it's staticly typed language. There's no risk that a method will be called with unplanned data type, because compiler does the hard work with its type system. Thanks to that less unit testing is required and that, in turn, allows for faster feature delivery.

Python 2.8?

Posted Jan 20, 2017 16:49 UTC (Fri) by NAR (subscriber, #1313) [Link]

You need unit tests anyway. When I first started to code in Erlang (after C++) I was afraid that there would be too many type errors not found by the compiler. I was wrong. The C++ compilation (that supposed to find the type errors) takes more time than the Erlang compilation + unit test run, so I actually find the type errors sooner (and also have some ideas about correctness). There's a type checker tool for Erlang, I think it found something like 20 bugs over 4 years, all of them either in dead code or hard-to-trigger error handling (one of the bugs would only occur if the code runs before 2000 - we found this bug in 2012). I presume there's not much compilation difference between Python and Go, still I do think omitting unit tests for static type checking is a really bad idea.

Python 2.8?

Posted Jan 19, 2017 14:45 UTC (Thu) by anselm (subscriber, #2796) [Link] (4 responses)

That's not what happens on my machine (which is running Debian Stretch):

$ python3.5
Python 3.5.2+ (default, Dec 13 2016, 14:16:35) 
[GCC 6.2.1 20161124] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> [open(f, 'r').close() for f in os.listdir(b'.')]
[None]
>>> os.listdir(b'.')
[b'test-\xed\xa0\x80.txt']

Looks reasonable to me. In any case Python 3.5 doesn't seem to have a problem dealing with file names that are byte strings. (Trying to retrieve the file name as as (Unicode) string doesn't work but then again the file name isn't valid UTF-8-encoded Unicode to begin with, which I guess was part of the original point.)

Generally I've been quite happy with Python 3 so far. I have yet to run into something that would tempt me to go back to Python 2.7, or for that matter a hypothetical Python 2.8.

Python 2.8?

Posted Jan 19, 2017 16:49 UTC (Thu) by excors (subscriber, #95769) [Link] (3 responses)

The problem with that example occurs on Windows (https://lwn.net/Articles/711492/). Paths on Windows are sequences of arbitrary 16-bit symbols (which are conventionally expected to be UTF-16, but aren't guaranteed to be), so you need to be very careful about lossless conversion to/from strings of 8-bit symbols, and Python 3.6 doesn't appear to be careful enough (it looks like io.open() passes the 8-bit string directly into some CRT function, which decodes it with the current ANSI code page, so it ends up with the wrong 16-bit symbols).

Paths on Linux are sequences of arbitrary 8-bit symbols, so byte strings work trivially there (as long as you don't assume they're e.g. UTF-8). And Python's Unicode strings work okay for paths on Windows since no conversion is needed. But it's a pain when you're trying to write code to run on both platforms - you either pick one type of string and suffer with error-prone conversions on the other platform, or you create a new opaque native-path type with explicitly lossy conversions to the standard string type (like Rust's OsString).

Python 2.8?

Posted Jan 19, 2017 23:22 UTC (Thu) by anselm (subscriber, #2796) [Link] (2 responses)

Am I glad I don't have anything to do with Windows. Having to use Windows must really suck.

Python 2.8?

Posted Jan 22, 2017 2:03 UTC (Sun) by flussence (guest, #85566) [Link] (1 responses)

OS X is worse, from what I've heard: the filesystem is case-insensitive, case-preserving, and it mangles everything to one Unicode normalisation form. So you can write one filename, get something entirely different back, and the differences are invisible unless you hexdump them.

Python 2.8?

Posted Jan 22, 2017 10:25 UTC (Sun) by mathstuf (subscriber, #69389) [Link]

That's HFS+. Their new one (AppleFS?) isn't silly like that (well, Unicode normalization might still be a thing, but I hope not). Apparently they'd have ZFS already, but licensing issues made the deal fall through.

Python 2.8?

Posted Jan 27, 2017 1:26 UTC (Fri) by ras (subscriber, #33059) [Link] (2 responses)

I have some sympathy for your position. Python 3.0's implementation of str==unicode was very rough. To their credit they have mostly knocked off the pain points as time has gone on. Granted it took what seemed like ages. But they have done it, so I was surprised when you said this doesn't work in Python 3:

> >>> open('test-\ud800.txt', 'w').close()
> UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 5: surrogates not allowed

Well you were right it doesn't work in 3.5, but it does so in an entirely reasonable way. This was not true in earlier versions.

> The real world out there is not unicode, it is composed of strings of bytes.

Actually, the real world has both strings and bytes. Strings are things you can display to a human. You can not reliably display bytes. A type system the distinguishes between the two reduces the odds of displaying unintelligible crap to users.

The distinction was unnecessary when ASCII was the sole encoding. Even after that it wasn't a pressing issue when the only people likely to see your strings were in the same office. But then the internet came along, and the solution we had adopted was to represent printable strings as the tuple (encoding, bytes) - and there were 100's of encodings. Agreeing on an encoding when you are all in the same office is one thing, but negotiating a common encoding between two remote systems so they can exchange strings was unworkable.

Unicode was meant to fix it, but initially it didn't. Amazingly, it appears they failed to understand the problem because they gave us multiple encodings for Unicode. Fortunately a C programmer came along, gave us UTF-8 which the world has settled on. Somewhat less fortunately early systems (Java, Windows) have adopted the failed earlier attempts at encoding Unicode. Python 3 has avoided that mess - it's internal representation looks pretty good to me.

Python 3's biggest mistake, which as you point out they still haven't recovered from, is they didn't abandon making "open('name', 'rt')" the default. The core problem is what 't' means isn't well defined on Unix (whereas "open(b'name', 'rb')" is always well defined). In fact it was never well defined - line endings have always varied between systems. But extending 't' to define the encoding broke it completely because it no longer just effected line endings - it was the whole document, and the file names that pointed to it. Worse, assuming 't' on file handles that transport bytes (like many pipes and sockets) meant they failed to work at all whereas before they did.

But the solution is easy enough. In python3, get into to habit of treating casting all file names to bytes, opening all files in binary, and doing the conversations yourself - which is how it should be done. You will find this hugely annoying, as there will be many times you won't know what encoding to use. But in that case you will be able to put the blame where it belongs - which on whoever wrote the data, not python3.

For files that should be printable ASCII (eg, most config files in /etc) I find "open('...', encoding='ascii', errors='ignore')" a reasonable way to pretend the world hasn't changed in 30 years.

Python 2.8?

Posted Jan 27, 2017 23:40 UTC (Fri) by lsl (subscriber, #86508) [Link]

> Strings are things you can display to a human.

That's certainly how Python 3 approaches it but is not something universally agreed upon in the field. I like to think of strings as just that, a sequence of …things (bits, bytes, "characters", whatever kind of symbol you can come up with).

Ok, so the Py3 string type doesn't represent a generic string but is a type tailored to the use-case of displaying text to users. That's fine. What's not fine is how it permeates all kind of APIs in the Python standard library that don't have anything to do with displaying text to users and might even be incompatible with it, requiring lots of weird workarounds for things that just work with Python 2.

That's kinda of what was meant by Python 3 throwing system programmers under the bus for the benefit of high-level "app" developers. That's certainly a decision one can make but there shouldn't be any surprise that not everyone likes it.

Python 2.8?

Posted Jan 28, 2017 0:09 UTC (Sat) by lsl (subscriber, #86508) [Link]

> Python 3's biggest mistake, which as you point out they still haven't recovered from, is they didn't abandon making "open('name', 'rt')" the default.

In my opinion, the actual mistake is the inclusion of such "text" mode for file I/O in the first place. It doesn't have a place in a modern language designed for modern computers.

Even decades ago it reeked of design-by-commitee, probably driven by mainframe vendors who (from today's perspective) did ultra-weird things with text vs. non-text (and "file" I/O in general). Back then it was probably a reasonable argument that a general-purpose language should support stuff like this. But today? Virtually all operating systems settled on the same "bag of bytes" file abstraction. Even those who didn't either run Python in a POSIX-like environment or don't run it at all (and probably never will).

There's no reason left to justify the complication of file I/O with this "text vs. binary" stuff. Just as a modern file transfer protocol wouldn't include a "tenex" mode-switching command.

Python 2.8?

Posted Jan 14, 2017 5:45 UTC (Sat) by lsl (subscriber, #86508) [Link] (1 responses)

> If you ever try to do string manipulation on a bytes, you're Doing It Wrong. Bytes _may_ be an encoded string, but to use it you of course have to decode it.

You can redefine terms all you want but there are operations I can reasonably do on bytes that the literature calls "string manipulation". I'm talking about things like splitting on 0x0A ('\n') or 0x2F ('/') bytes. Those are reasonable things to do if whatever you're working on defines them as a reasonable thing to do. I don't have to somehow "decode" the byte string before I can manipulate it. In fact, I cannot possibly decode it as I have no idea (nor a desire to know) what any of these bytes are supposed to mean. The only thing I need to know is that I can legitimately split them upon encountering a '/' byte.

Super simple stuff, until you bring Python 3 into the mix with its desire to enforce specific encodings where none were agreed upon.

split() works on bytestrings now

Posted Jan 14, 2017 10:54 UTC (Sat) by rschroev (subscriber, #4164) [Link]

Nowadays split() does work on bytestrings. In Python 3.4:

>>> b'abc/def/ghi'.split(b'/')
[b'abc', b'def', b'ghi']
>>> b'abc def ghi'.split()
[b'abc', b'def', b'ghi']

I think this didn't work in 3.0; I don't know when that changed. IIRC at the same time other string manipulations were implemented fro bytestrings.

Python 2.8?

Posted Jan 14, 2017 0:15 UTC (Sat) by vstinner (subscriber, #42675) [Link]

> The need for this protocol is a bad thing. Python 2.7 just uses byte strings and it works perfectly fine.

Python 3 supports Unicode and bytes, both work perfectly fine. Using Unicode on Linux, you *can* handle any filename, including filenames not decodable from the locale encoding. Just try (ex: non-ASCII filename but POSIX locale, LC_ALL=C), you will see (os.listdir() doesn't fail, but print(filename) can fail). On Windows, using bytes is more weird, but it should "just work" since Python 3.6.

But you missed the point. The "fspath" protocol is unrelated to "bytes vs Unicode", it's only an enhancement to support custom objects like pathlib.Path, accept them in functions expecting a filename like open(). See the What's New in Python 3.6 for an example:
https://docs.python.org/dev/whatsnew/3.6.html#pep-519-add...

You are free to not use pathlib. It seems like some users prefer pathlib over handling "manually" paths using os.path.join() for example.

Python 2.8?

Posted Jan 19, 2017 9:34 UTC (Thu) by breamoreboy (guest, #113635) [Link] (12 responses)

I suggest that you try discussing Python 3 unicode with this gentleman https://2016.pycon.ca/en/schedule/116-stephen-j-turnbull/ and see how far you get.

Python 2.8?

Posted Jan 19, 2017 22:33 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (11 responses)

Uhm... Appeals to authority?

I do use Unicode extensively as I speak 4 languages, every one of them written with a different script. I can tell you stories about bad codepages for _hours_. Like how way back then I used to have an "encoding decryptor" that used character and digraph frequency analysis to determine the sequence of bad re-codings. Or about the magically disappearing letter "Н" in messages. Or about pseudographic characters appearing on printers instead of text.

Python 2.8?

Posted Jan 20, 2017 6:55 UTC (Fri) by ssokolow (guest, #94568) [Link] (10 responses)

You wouldn't still happen to have the source for that sitting around somewhere, would you?

I've been meaning to write something similar for some old stuff where things like ½ in recipes got mangled.

Python 2.8?

Posted Jan 20, 2017 7:18 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (9 responses)

It was written with Russian encodings in mind as there were so many of them in active use: DOS, Win1251, KOI-8, GOST, etc. with misencodings being common.

I might be able to dig it up, but it should probably be rewritten for today's world and for other languages anyway.

Python 2.8?

Posted Jan 20, 2017 9:40 UTC (Fri) by ssokolow (guest, #94568) [Link] (8 responses)

I'd probably rewrite it in Rust or Python anyway... it'd just be nice to have a known-to-be-effective example to learn from.

Python 2.8?

Posted Jan 23, 2017 5:52 UTC (Mon) by adobriyan (subscriber, #30858) [Link] (7 responses)

> rewrite it in Rust

Ironically, Rust assumes everything is valid UTF-8 including Unix filenames.

Python 2.8?

Posted Jan 23, 2017 7:03 UTC (Mon) by liw (subscriber, #6379) [Link] (4 responses)

Rust doesn't have a concept of binary data? That would be so weird I have trouble believing it. It would mean Rust can't handle, say, a JPEG.

Python 2.8?

Posted Jan 23, 2017 11:44 UTC (Mon) by ssokolow (guest, #94568) [Link] (3 responses)

Rust has three types of "strings". From least to most "binary", they are:
  1. std::string::String
    • Guaranteed to be valid UTF-8 but uses a Vec<u8> internally.
    • May contain NULL bytes.
    • Codepoint-oriented APIs
    • Can be iterated byte-wise or codepoint-wise. Grapheme clusters or words available via the unicode-segmentation crate.
  2. std::ffi::OsString
    • Can represent any byte/codepoint sequence returned by OS-native APIs.
    • Can represent NULL bytes even if the OS-native form can't
    • Designed to be producible from String without conversion cost.
      • On POSIX platforms, In-memory representation is identical to String.
      • On Windows, uses a UTF-8 superset called WTF-8 which matches the relaxed well-formedness rules of the Windows UTF-16 APIs.
  3. Vec<u8>
    • Exactly what it looks like. A vector of unsigned 8-bit integers
    • May contain NULL bytes
    • Serves the same role as Python's "bytestrings".
    • Usable as a "string" type because Vec has "sequence of items" versions of most APIs you'd think of as being for string manipulation.
See also: std::path::{Path,PathBuf} (OsString internally)

Python 2.8?

Posted Jan 23, 2017 12:20 UTC (Mon) by mathstuf (subscriber, #69389) [Link] (2 responses)

Um, I know it was mentioned below, but I want the record straight on this subthread too.

> On POSIX platforms, In-memory representation is identical to String.

No, it can represent arbitrary bytes.

> On Windows, uses a UTF-8 superset called WTF-8 which matches the relaxed well-formedness rules of the Windows UTF-16 APIs.

It actually uses WTF-16 internally, a superset of UTF-16.

Right from the docs at the OsString link you gave:

On Unix systems, strings are often arbitrary sequences of non-zero bytes, in many cases interpreted as UTF-8.

On Windows, strings are often arbitrary sequences of non-zero 16-bit values, interpreted as UTF-16 when it is valid to do so.

In Rust, strings are always valid UTF-8, but may contain zeros.

Python 2.8?

Posted Jan 23, 2017 12:51 UTC (Mon) by ssokolow (guest, #94568) [Link] (1 responses)

> No, [OsString] can represent arbitrary bytes.

I never said it couldn't. On POSIX platforms, String (1) and OsString (1, 2) both rely on an internal Vec<u8> to store the raw bytes... String just enforces extra invariants and presents a different set of methods.

That's what I meant when I said they had the same in-memory representation. You can even use the to_str method to get an &str pointing at the contents of the OsString as long it's valid UTF-8.

> It actually uses WTF-16 internally, a superset of UTF-16.

I looked at the up-to-date source to the standard library before I made my post.

OsString uses an inner type called sys::os_str::Buf to actually store the string and, on Windows, that's an API wrapper around sys_common::wtf8::Wtf8Buf which, again, uses a Vec<u8> internally.

(The key detail is that, when Windows UTF-16 is invalid UTF-8, it's not because it encodes codepoints that UTF-8 can't represent, it's that it violates higher-level rules about surrogates never occuring in isolation or out of order. After the transformation process is complete, UTF-16 and UTF-8 are simply two different ways to map 21-bit numbers into sequences of bytes, so It's trivial for WTF-8 to represent an arbitrary sequence of numbers 21-bits or narrower just by omitting some of the rules checks that a UTF-8 codec enforces.)

> Right from the docs at the OsString link you gave

I actually started out with that exact quote but edited it out for brevity.

Python 2.8?

Posted Jan 23, 2017 20:34 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

Ah, yeah about the "backing store"; I shouldn't comment too early in the morning.

The WTF-8 versus WTF-16 thing is news to me. Maybe it was communicated poorly when I first heard about it or it has changed since then.

Thanks for the clarifications.

Python 2.8?

Posted Jan 23, 2017 11:12 UTC (Mon) by Jonno (subscriber, #49613) [Link] (1 responses)

> Ironically, Rust assumes everything is valid UTF-8 including Unix filenames.

Rust does not. While Rust Strings are always UTF-8, Rust OsStrings are not. They are "arbitrary sequences of non-zero bytes" (on Unix) or "arbitrary sequences of non-zero 16-bit values" (on Windows).

Directory listings uses OsStrings, not Strings, for filename components, and File::open() will accept anything from which Rust knows how to build a path, including both Strings *and* OsStrings.

There are convenience methods to convert an OsString to a String (which will fail if the OsString does not contain valid Unicode), as well as to convert a String to an OsString (which will fail if the String contains any "U+0000 NULL" characters), but there is no requirement that you use them.

In fact, in most circumstances you should not. Keep the OsString for path manipulations, and if you need a pretty UTF-8 string to show the user, use the heavier OsString::to_string_lossy() method to get a string with any invalid Unicode sequences replaced with "U+FFFD REPLACEMENT CHARACTER".

Python 2.8?

Posted Jan 23, 2017 12:22 UTC (Mon) by ssokolow (guest, #94568) [Link]

Actually, OsString is a superset of String and whatever the OS offers. It'll carry NULL characters just fine.

Here's a Rust Playground link demonstrating that.

Python 2.8?

Posted Jan 12, 2017 1:47 UTC (Thu) by biergaizi (subscriber, #92498) [Link] (4 responses)

Being unsatisfied that programs written for Python 2 are not compatible on Python 3, so someone makes a Python "2.8", and now your "python" program does not work on neither Python 2 nor Python3? Good for you.

Python 2.8?

Posted Jan 12, 2017 4:05 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

I just tested a large system with it and it worked perfectly. Doubtlessly there are going to be small inconsistencies but who cares, it's not a total rewrite.

Python 2.8?

Posted Jan 12, 2017 14:15 UTC (Thu) by cesarb (subscriber, #6266) [Link] (2 responses)

I believe what biergaizi means, is that when someone writes a program for this "python", said program will work neither on Python 2 nor on Python 3, due to use of new features (which are not on Python 2) while depending on the old string semantics (which are not on Python 3).

Python 2.8?

Posted Jan 13, 2017 19:17 UTC (Fri) by khim (subscriber, #9252) [Link] (1 responses)

And what's the problem? You could say the same about program written for Python 2.4 in a world where Python 2.2 is the norm. People would just upgrade.

I'm kinda surprised that we needed to wait that long for that effort to start. Hopefully it'll go the way of Perl6: new language for the ones who want to play with bizzare pseudo-Unicode strings (see above) and old language + new features for people who want to see work being done.

Python 2.8?

Posted Jan 14, 2017 15:46 UTC (Sat) by bronson (subscriber, #4806) [Link]

Python 3 for the lab, Python 2 for the dirty real world?

That's interesting... This does appear how it's going at a previous client. The ML guys use 3.latest, the ops guys are still mostly 2.7.

Python 2.8?

Posted Jan 12, 2017 10:21 UTC (Thu) by epa (subscriber, #39769) [Link] (14 responses)

So how about a Python 3.-1 which is essentially Python 3 with the most troublesome features (from a backwards compatibility point of view) 'un-back-ported'? Maybe allowing implicit conversions between str and bytes using a set of rules that roughly reproduces the Python 2.x semantics?

Python 2.8?

Posted Jan 12, 2017 16:11 UTC (Thu) by JFlorian (guest, #49650) [Link] (10 responses)

Or how about we put that effort into completing the port to Python 3 and kick Python 2 to the curb where it belongs? I don't mean that in a snide way, but if an organization is really tied to something ancient like RHEL5, it must be because they really want no change... at all.

Python 2.8?

Posted Jan 12, 2017 16:42 UTC (Thu) by xnox (guest, #63320) [Link]

My understanding is that people who worked on Py3k migration for OpenStack are no longer in full-time employment @ HPE... Hence OpenStack will be stuck at 2.7. Who knows if it is still a viable project or not, given that most/largest public clouds are proprietary (AWS, Azure, GCE, RackSpace (has additions, semi-openstack but not quite, at least that's my perception of RackSpace))

Python 2.8?

Posted Jan 12, 2017 23:19 UTC (Thu) by lsl (subscriber, #86508) [Link] (8 responses)

Some people actually prefer Python 2 as a language. I tend to agree with that. The Unicode thing in Python 3 is just too much of a mess.

Even Nick Coghlan's notes on Python 3[1] hint at the fact that people writing systems or networking code were thrown under the bus for the alleged benefit of "high-level application code" and supposedly better Windows integration.

Except that the former is way too fuzzy a concept to be useful (what program doesn't use the file system at all?) and the latter is just plain wrong. Look at Go for how to do it right. It has excellent Unicode support on Windows, yet manages to present a sane interface to programmers, converting to UTF-16 only when calling into the system.

If the obvious way to open a file specified in argv is broken, your language is doing it wrong. If I have to re-open stdin/stdout in some kind of weird "binary" mode just so that my program works, your language is, again, wrong.

[1] http://python-notes.curiousefficiency.org/en/latest/pytho...

Python 2.8?

Posted Jan 13, 2017 0:55 UTC (Fri) by foom (subscriber, #14868) [Link] (7 responses)

Ironically, Python 3.6 has actually, finally, fixed the way it deals with Windows path APIs when given non-unicode strings -- it now uses the *W APIs, and recodes into utf-8.

https://www.python.org/dev/peps/pep-0529/

So, now, finally, Python supports a sane API to access files: use byte strings on all platforms. Too bad it came too late...That'd be a real good candidate for importing into Python 2.8, though!

Python 2.8?

Posted Jan 13, 2017 3:19 UTC (Fri) by excors (subscriber, #95769) [Link] (6 responses)

It seems to convert to something that's nearly but not quite UTF-8:

Python 3.6.0 (v3.6.0:41df79263a11, Dec 23 2016, 08:06:12) [MSC v.1900 64 bit (AMD64)] on win32

>>> open('test-\ud800.txt', 'w').close()

>>> os.listdir('.')
['test-\ud800.txt']

>>> os.listdir(b'.')
[b'test-\xed\xa0\x80.txt']

>>> [f.decode('utf-8') for f in os.listdir(b'.')]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <listcomp>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 5: invalid continuation byte

So you need to treat them as opaque byte strings, not as encoded Unicode even on Windows.

Hmm, but how are you meant to use byte strings with open()? I would have thought this should work, but it doesn't:

>>> [open(f, 'r').close() for f in os.listdir(b'.')]
FileNotFoundError: [Errno 2] No such file or directory: b'test-\xed\xa0\x80.txt'

Python 2.8?

Posted Jan 13, 2017 19:12 UTC (Fri) by foom (subscriber, #14868) [Link] (2 responses)

Regarding [f.decode('utf-8') for f in os.listdir(b'.')]: Apparently python no longer allows surrogates in its utf-8 codec by default; you need to use: .decode('utf-8', errors='surrogatepass'), instead.

Regarding [open(f, 'r').close() for f in os.listdir(b'.')]: That sounds like a bug, at least to me.

Python 2.8?

Posted Jan 13, 2017 19:53 UTC (Fri) by excors (subscriber, #95769) [Link] (1 responses)

The open() issue is not restricted to weird surrogate cases - it fails with pretty much any non-ASCII filename, like 'test-\u00c0.txt' ("No such file or directory: b'test-\xc3\x80.txt'"). Looks like it actually tries to open 'test-\u00c3\u20ac.txt', i.e. open() is always decoding the filename as Win-1252, which doesn't seem especially helpful. (This is with Python 3.6.0 on English-language Windows 7.)

os.open() seems to do the right thing with byte string filenames, but I guess it would be nice if open() did too. So I think this claim:

> Python 3.6 has actually, finally, fixed the way it deals with Windows path APIs

is unfortunately a bit premature.

Python 2.8?

Posted Jan 13, 2017 23:04 UTC (Fri) by foom (subscriber, #14868) [Link]

D'oh.. Guess that's what I get for not testing before praising it... 😐

Python 2.8?

Posted Jan 14, 2017 0:08 UTC (Sat) by vstinner (subscriber, #42675) [Link] (2 responses)

> It seems to convert to something that's nearly but not quite UTF-8:

Python 3.6 now uses UTF-8/surrogatepass on Windows in os.fsdecode() / os.fsencode(). Hopefully, these functions are almost never used on Windows, since Windows has a native support for Unicode. For example, command line arguments, list filenames in a directory, get the hostname, ... : Windows return data directly as Unicode.

The surrogatepass error handler is required to support the same character set than Windows. Windows does allow surrogate characters in filenames. It's really weird and does not conform to Unicode standards which deny these characters in UTF-* encodings.

Python 3 respects Unicode standards: surrogate characters are not allowed in the UTF-8 codec for example. It allowed to implement new nice error handlers for UTF-8: surrogateescape, surrogatepass, etc. By the way, Python 3.6 has a new interesting "namereplace" error handler.

Python 2.8?

Posted Jan 15, 2017 21:11 UTC (Sun) by mathstuf (subscriber, #69389) [Link]

Rust handles this by having an "OsStr" type at system call boundaries related to paths (and, IIRC, environment variables and process arguments). Strings can be cast to them easily (they implement AsRef<OsStr>), but getting them back out requires an explicit from_utf8 call (which can fail) or a _lossy version (which uses replacement characters for unconvertible sequences). On Windows, there is a purely internal "WTF-16" encoding which is UTF-16 with allowances for Windows specific exceptions. This allows the encodings to not get mixed up and allows the real POSIX policy of "filenames are sequences of nonzero, non-/ characters" gracefully without having LANG screw up your code because assumptions are made based on it.

But Python isn't a fan of these kinds of type safties (implicit casting with exceptions would probably be fine though and would have better error messages than "file from readdir does not exist" errors).

Python 2.8?

Posted Jan 20, 2017 4:17 UTC (Fri) by foom (subscriber, #14868) [Link]

> Windows return data directly as Unicode

I know you clarified later in your comment, but I'd just like to emphasize (esp since a lot of people seem to say that without clarification): Windows APIs absolutely *do not* return "Unicode". Instead, they deal with arrays of 16-bit values, which, when you're lucky, can be decoded via UTF-16 into a unicode string.

And, just like decoding Linux "UTF-8" paths to a unicode string might fail due to the path not actually being UTF8, decoding a Windows "UTF-16" path might fail due to the path not actually being valid UTF16.

In both OSes, in order to avoid errors, you'll want to tell the unicode decoder/encoder to allow the invalid input bytestrings and transform to something nonsensical but reversible. Or, alternatively, avoid decoding the paths into unicode at all, and just leave them in their native 8/16-bit bytestring representations.

Python 2.8?

Posted Jan 14, 2017 0:20 UTC (Sat) by vstinner (subscriber, #42675) [Link] (1 responses)

I am working on a the new PEP 540 which would basically avoid almost all common encoding errors by default when using the POSIX locale or when my new UTF-8 mode is enabled explicitly. In this mode, mojibake is preferred over hard failure.
https://www.python.org/dev/peps/pep-0540/

It should be at least as good Python 2 "I don't care of encodings". IMHO it's even better since even if you can get mojibake, in most cases, you will get perfectly valid Unicode and so will benefit of advanced Unicode features to handle any languages and not just english in ASCII.

My hope is that the UTF-8 mode would give you the best of the two worlds (Python 2 bytes, Python 3 Unicode).

Python 2.8?

Posted Jan 14, 2017 9:15 UTC (Sat) by rghetta (subscriber, #39444) [Link]

It seems a very welcomed improvement, thank you. But please make it the default in any condition, not only with a posix locale. I could have an utf-8 locale but still need to handle legacy files.

Python 2.8?

Posted Jan 22, 2017 23:22 UTC (Sun) by nas (subscriber, #17) [Link]

A Python 3 minus exists, it is here:

https://github.com/nascheme/ppython

It coerces between str and bytes but will generate warnings. I found it to be very useful when porting code from Python 2 to 3. You run 2to3 on your code-base and that gets you most of the way done. Run your application and fix all the bytes/str warnings you get.

Python 2.8?

Posted Jan 13, 2017 1:37 UTC (Fri) by timrichardson (subscriber, #72836) [Link]

They should call it Bavaria. There are fossilised pythons there https://www.thelocal.de/20111018/38277

Python 2.8?

Posted Jan 13, 2017 8:55 UTC (Fri) by SiB (subscriber, #4048) [Link]

We use Python 2.7 to control our scientific data acquisition hardware, including space flight hardware. (Mostly via via python-usb and python-serial). This is more scripting than application programming. Python 3 is a pretty useless language in this setting. I welcome any effort to keep Python 2 alive.

Python 2.8?

Posted Jan 19, 2017 9:26 UTC (Thu) by ballombe (subscriber, #9523) [Link] (3 responses)

A software project that use trademark to coerce people into migrating to the next major version is not free software.
The right of run and support abandoned releases without being subject to harassment is a fundamental right.

It is understandable that the PSF is not interested in supporting python 2 further. However interfering with effort to support it by others is not.

Python 2.8?

Posted Jan 19, 2017 13:05 UTC (Thu) by rahulsundaram (subscriber, #21946) [Link]

> The right of run and support abandoned releases without being subject to harassment is a fundamental right.

There is no fundamental right to the name. It is a trademark.

Python 2.8?

Posted Jan 19, 2017 14:39 UTC (Thu) by Wol (subscriber, #4433) [Link]

> It is understandable that the PSF is not interested in supporting python 2 further. However interfering with effort to support it by others is not.

This attitude unfortunately is extremely dangerous. As pointed out, "Python" is a mark of authenticity. If your attitude holds, how do we know that "Python 2.9" isn't some malicious rootkit?

This, unfortunately, is why many FLOSS projects have had to clamp down - hard - on trademarks. There's far too much malware out there, and if it can ride on the coat-tails of a successful project, it will.

Trademarks need to be controlled. Strictly. For the benefit of everyone, and especially for the protection of lusers. Sadly, we need lusers, or there's no point having a successful project (that is, if you can call a project without them "successful").

Cheers,
Wol

Python 2.8?

Posted Jan 19, 2017 19:42 UTC (Thu) by mjg59 (subscriber, #23239) [Link]

A requirement that projects be renamed if modified is explicitly permitted by the DFSG.


Copyright © 2017, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds