Python 2.8?
Python 2.8?
Posted Jan 12, 2017 1:32 UTC (Thu) by Cyberax (✭ supporter ✭, #52523)Parent article: Python 2.8?
Wow! A new sane version of Python. Finally!
I hoped that the PathLike fiasco taught Python 3 developers a lesson, but no....
> serve as a pretext for managers to drag their feet further on migration plans
Seriously. That's the main reason why Google is rewriting their code in Go now.
Posted Jan 12, 2017 14:41 UTC (Thu)
by rbrito (guest, #66188)
[Link] (35 responses)
Posted Jan 12, 2017 19:03 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (34 responses)
Python 3 developers were bitten by a rabid weasel and decided that they need "Unicode". The rational was that: "We need Unicode for foreign languages because Python. And why are you asking, are you some kind of foreign-language hating American imperialist?"
So magic Unicode was added. It existed before with u"" strings in Py2.7, but in Py3k it was cranked all the way up to 11. So as a result, first Python 3 versions couldn't even print formatted byte strings (that was fixed in Python 3.2, I believe).
And of course, "Unicode" in Python means some magic butterfly unicorny thingie that with unspecified internal implementation, so you don't even get O(1) character addressing within a string. At least Perl 6 uses the normalized grapheme form which provides it.
The problem here is that the world around is not really Unicode. File paths can contain arbitrary byte strings but it would be awkward to admit that, so earlier versions of Python simply skipped files with "incorrect" names in Unicode filesystem operations (not joking here). And there's no easy way to fix that in Python 3 - even a simple print() statement would likely mix the pure abstract world of Unicode unicorns with dirty reality around it.
So there's whole new protocol to make filesystem paths to behave both as strings and as unicode entities. And now I'm eagerly waiting for HttpHeaderLike, StringReadFromFileLike, UserInputOnCommandLineLike and so on.
Oh, and Python 3 also switched defaults for strings, so all un-annotated strings are Unicode by default. And don't you dare to try to concatenate them with data that has impure thoughts:
In Python 2.7 that just works:
I can continue, but that should be enough for one post.
Posted Jan 13, 2017 5:38 UTC (Fri)
by ddevault (subscriber, #99589)
[Link] (19 responses)
It's fair to argue that Python 3's choice to break backwards compatability was a mistake, but it's not fair imo to criticise the current design. It's completely sane, much more so than Python 2's design.
Posted Jan 13, 2017 6:48 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (16 responses)
Python 3 made working with them extremely cumbersome.
> It's fair to argue that Python 3's choice to break backwards compatability was a mistake, but it's not fair imo to criticise the current design.
1) Perl 6 dove deep and encodes strings in the NFG, thus making string operations intuitively correct. It won't split words between combining characters, for example. Python 3, of course, is completely clueless about that.
2) Go treats strings as arrays of octets. Always. It also has handy functions to manipulate arrays of octets that happen to be UTF-8 encoded texts.
Sorry. But Py3 is not well-designed, it's a great example of a "second system effect". The fact that companies are migrating from Py27 to Golang should give Py3 developers some clue.
Example from this thread:
2) Python 2.7 (works as expected):
3) Python3.5 (craps its pants):
So, why should I migrate to Py3 if it's full of stuff like this? What's the point?
Posted Jan 13, 2017 7:01 UTC (Fri)
by ddevault (subscriber, #99589)
[Link] (5 responses)
_Strings_ are NOT bytes. They are characters. Python 3 strings are arrays of characters, and bytes are just that - bytes. Bytes are not strings. Bytes are bytes. A "string of bytes" isn't a phrase that makes sense. Bytes in Python 3 is a byte array. That's all that it is. Period. End of discussion. It could be a byte array whose contents is an _encoded_ string, which you might decode into an array of characters (aka a string). But it's NOT a string.
>Python 3 made working with them extremely cumbersome.
No, it made it wonderfully simple. You're just using it wrong.
Your example is also broken.
>open('test-\ud800.txt', 'w').close()
You're already off to a bad start. Did you even try your example in Python 3?
>>> open('test-\ud800.txt', 'w').close()
Let me show you what you meant to do:
>>> open(b'test-\ud800.txt', 'w').close()
Note how it doesn't crap out. You need to get your head out of Python 2 and learn Python 3 properly.
Posted Jan 13, 2017 7:20 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (4 responses)
> But it's NOT a string.
Why should "real strings" be special?
> No, it made it wonderfully simple. You're just using it wrong.
> You're already off to a bad start. Did you even try your example in Python 3?
> Note how it doesn't crap out. You need to get your head out of Python 2 and learn Python 3 properly.
If I jump through all the hoops - what do I gain as a result? Nothing but headache. Unicode support in Python3 provides no usefulness whatsoever.
Posted Jan 13, 2017 7:26 UTC (Fri)
by ddevault (subscriber, #99589)
[Link] (3 responses)
Posted Jan 13, 2017 7:39 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link]
BUT THIS IS NOT TRUE!
Real world out here is full of misencoded stuff that can only be represented by byte arrays. It can NOT be encoded into strings without losses or quoting. Thus "PathLike" class with dual nature to represent them is needed for Py3.
Py3 makes working with such entities a total pain. Py27 is entirely a breeze - it just works.
And please, do explain what I gain from following all the Py3 rules of unicode strings? What are the advantages over Py27 with utf-8 strings?
Posted Jan 13, 2017 17:42 UTC (Fri)
by sfeam (subscriber, #2841)
[Link]
Posted Jan 14, 2017 15:36 UTC (Sat)
by bronson (subscriber, #4806)
[Link]
You're saying that there are lots of problems with Python 3?
Posted Jan 19, 2017 10:28 UTC (Thu)
by chojrak11 (guest, #52056)
[Link] (1 responses)
You're grossly wrong. There are many rationales for the migration, none of them being wrong Unicode design in Python. Some of them are:
1) Poor Python performance, caused by runtime type reconciliation because of
Go is fast, because it's staticly typed language. There's no risk that a method will be called with unplanned data type, because compiler does the hard work with its type system. Thanks to that less unit testing is required and that, in turn, allows for faster feature delivery.
Posted Jan 20, 2017 16:49 UTC (Fri)
by NAR (subscriber, #1313)
[Link]
Posted Jan 19, 2017 14:45 UTC (Thu)
by anselm (subscriber, #2796)
[Link] (4 responses)
That's not what happens on my machine (which is running Debian Stretch):
Looks reasonable to me. In any case Python 3.5 doesn't seem to have a problem dealing with file names that are byte strings. (Trying to retrieve the file name as as (Unicode) string doesn't work but then again the file name isn't valid UTF-8-encoded Unicode to begin with, which I guess was part of the original point.)
Generally I've been quite happy with Python 3 so far. I have yet to run into something that would tempt me to go back to Python 2.7, or for that matter a hypothetical Python 2.8.
Posted Jan 19, 2017 16:49 UTC (Thu)
by excors (subscriber, #95769)
[Link] (3 responses)
Paths on Linux are sequences of arbitrary 8-bit symbols, so byte strings work trivially there (as long as you don't assume they're e.g. UTF-8). And Python's Unicode strings work okay for paths on Windows since no conversion is needed. But it's a pain when you're trying to write code to run on both platforms - you either pick one type of string and suffer with error-prone conversions on the other platform, or you create a new opaque native-path type with explicitly lossy conversions to the standard string type (like Rust's OsString).
Posted Jan 19, 2017 23:22 UTC (Thu)
by anselm (subscriber, #2796)
[Link] (2 responses)
Am I glad I don't have anything to do with Windows. Having to use Windows must really suck.
Posted Jan 22, 2017 2:03 UTC (Sun)
by flussence (guest, #85566)
[Link] (1 responses)
Posted Jan 22, 2017 10:25 UTC (Sun)
by mathstuf (subscriber, #69389)
[Link]
Posted Jan 27, 2017 1:26 UTC (Fri)
by ras (subscriber, #33059)
[Link] (2 responses)
> >>> open('test-\ud800.txt', 'w').close()
Well you were right it doesn't work in 3.5, but it does so in an entirely reasonable way. This was not true in earlier versions.
> The real world out there is not unicode, it is composed of strings of bytes.
Actually, the real world has both strings and bytes. Strings are things you can display to a human. You can not reliably display bytes. A type system the distinguishes between the two reduces the odds of displaying unintelligible crap to users.
The distinction was unnecessary when ASCII was the sole encoding. Even after that it wasn't a pressing issue when the only people likely to see your strings were in the same office. But then the internet came along, and the solution we had adopted was to represent printable strings as the tuple (encoding, bytes) - and there were 100's of encodings. Agreeing on an encoding when you are all in the same office is one thing, but negotiating a common encoding between two remote systems so they can exchange strings was unworkable.
Unicode was meant to fix it, but initially it didn't. Amazingly, it appears they failed to understand the problem because they gave us multiple encodings for Unicode. Fortunately a C programmer came along, gave us UTF-8 which the world has settled on. Somewhat less fortunately early systems (Java, Windows) have adopted the failed earlier attempts at encoding Unicode. Python 3 has avoided that mess - it's internal representation looks pretty good to me.
Python 3's biggest mistake, which as you point out they still haven't recovered from, is they didn't abandon making "open('name', 'rt')" the default. The core problem is what 't' means isn't well defined on Unix (whereas "open(b'name', 'rb')" is always well defined). In fact it was never well defined - line endings have always varied between systems. But extending 't' to define the encoding broke it completely because it no longer just effected line endings - it was the whole document, and the file names that pointed to it. Worse, assuming 't' on file handles that transport bytes (like many pipes and sockets) meant they failed to work at all whereas before they did.
But the solution is easy enough. In python3, get into to habit of treating casting all file names to bytes, opening all files in binary, and doing the conversations yourself - which is how it should be done. You will find this hugely annoying, as there will be many times you won't know what encoding to use. But in that case you will be able to put the blame where it belongs - which on whoever wrote the data, not python3.
For files that should be printable ASCII (eg, most config files in /etc) I find "open('...', encoding='ascii', errors='ignore')" a reasonable way to pretend the world hasn't changed in 30 years.
Posted Jan 27, 2017 23:40 UTC (Fri)
by lsl (subscriber, #86508)
[Link]
That's certainly how Python 3 approaches it but is not something universally agreed upon in the field. I like to think of strings as just that, a sequence of …things (bits, bytes, "characters", whatever kind of symbol you can come up with).
Ok, so the Py3 string type doesn't represent a generic string but is a type tailored to the use-case of displaying text to users. That's fine. What's not fine is how it permeates all kind of APIs in the Python standard library that don't have anything to do with displaying text to users and might even be incompatible with it, requiring lots of weird workarounds for things that just work with Python 2.
That's kinda of what was meant by Python 3 throwing system programmers under the bus for the benefit of high-level "app" developers. That's certainly a decision one can make but there shouldn't be any surprise that not everyone likes it.
Posted Jan 28, 2017 0:09 UTC (Sat)
by lsl (subscriber, #86508)
[Link]
In my opinion, the actual mistake is the inclusion of such "text" mode for file I/O in the first place. It doesn't have a place in a modern language designed for modern computers.
Even decades ago it reeked of design-by-commitee, probably driven by mainframe vendors who (from today's perspective) did ultra-weird things with text vs. non-text (and "file" I/O in general). Back then it was probably a reasonable argument that a general-purpose language should support stuff like this. But today? Virtually all operating systems settled on the same "bag of bytes" file abstraction. Even those who didn't either run Python in a POSIX-like environment or don't run it at all (and probably never will).
There's no reason left to justify the complication of file I/O with this "text vs. binary" stuff. Just as a modern file transfer protocol wouldn't include a "tenex" mode-switching command.
Posted Jan 14, 2017 5:45 UTC (Sat)
by lsl (subscriber, #86508)
[Link] (1 responses)
You can redefine terms all you want but there are operations I can reasonably do on bytes that the literature calls "string manipulation". I'm talking about things like splitting on 0x0A ('\n') or 0x2F ('/') bytes. Those are reasonable things to do if whatever you're working on defines them as a reasonable thing to do. I don't have to somehow "decode" the byte string before I can manipulate it. In fact, I cannot possibly decode it as I have no idea (nor a desire to know) what any of these bytes are supposed to mean. The only thing I need to know is that I can legitimately split them upon encountering a '/' byte.
Super simple stuff, until you bring Python 3 into the mix with its desire to enforce specific encodings where none were agreed upon.
Posted Jan 14, 2017 10:54 UTC (Sat)
by rschroev (subscriber, #4164)
[Link]
>>> b'abc/def/ghi'.split(b'/')
I think this didn't work in 3.0; I don't know when that changed. IIRC at the same time other string manipulations were implemented fro bytestrings.
Posted Jan 14, 2017 0:15 UTC (Sat)
by vstinner (subscriber, #42675)
[Link]
Python 3 supports Unicode and bytes, both work perfectly fine. Using Unicode on Linux, you *can* handle any filename, including filenames not decodable from the locale encoding. Just try (ex: non-ASCII filename but POSIX locale, LC_ALL=C), you will see (os.listdir() doesn't fail, but print(filename) can fail). On Windows, using bytes is more weird, but it should "just work" since Python 3.6.
But you missed the point. The "fspath" protocol is unrelated to "bytes vs Unicode", it's only an enhancement to support custom objects like pathlib.Path, accept them in functions expecting a filename like open(). See the What's New in Python 3.6 for an example:
You are free to not use pathlib. It seems like some users prefer pathlib over handling "manually" paths using os.path.join() for example.
Posted Jan 19, 2017 9:34 UTC (Thu)
by breamoreboy (guest, #113635)
[Link] (12 responses)
Posted Jan 19, 2017 22:33 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (11 responses)
I do use Unicode extensively as I speak 4 languages, every one of them written with a different script. I can tell you stories about bad codepages for _hours_. Like how way back then I used to have an "encoding decryptor" that used character and digraph frequency analysis to determine the sequence of bad re-codings. Or about the magically disappearing letter "Н" in messages. Or about pseudographic characters appearing on printers instead of text.
Posted Jan 20, 2017 6:55 UTC (Fri)
by ssokolow (guest, #94568)
[Link] (10 responses)
I've been meaning to write something similar for some old stuff where things like ½ in recipes got mangled.
Posted Jan 20, 2017 7:18 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (9 responses)
I might be able to dig it up, but it should probably be rewritten for today's world and for other languages anyway.
Posted Jan 20, 2017 9:40 UTC (Fri)
by ssokolow (guest, #94568)
[Link] (8 responses)
Posted Jan 23, 2017 5:52 UTC (Mon)
by adobriyan (subscriber, #30858)
[Link] (7 responses)
Ironically, Rust assumes everything is valid UTF-8 including Unix filenames.
Posted Jan 23, 2017 7:03 UTC (Mon)
by liw (subscriber, #6379)
[Link] (4 responses)
Posted Jan 23, 2017 11:44 UTC (Mon)
by ssokolow (guest, #94568)
[Link] (3 responses)
Posted Jan 23, 2017 12:20 UTC (Mon)
by mathstuf (subscriber, #69389)
[Link] (2 responses)
> On POSIX platforms, In-memory representation is identical to String.
No, it can represent arbitrary bytes.
> On Windows, uses a UTF-8 superset called WTF-8 which matches the relaxed well-formedness rules of the Windows UTF-16 APIs.
It actually uses WTF-16 internally, a superset of UTF-16.
Right from the docs at the OsString link you gave:
On Unix systems, strings are often arbitrary sequences of non-zero bytes, in many cases interpreted as UTF-8.
On Windows, strings are often arbitrary sequences of non-zero 16-bit values, interpreted as UTF-16 when it is valid to do so.
In Rust, strings are always valid UTF-8, but may contain zeros.
Posted Jan 23, 2017 12:51 UTC (Mon)
by ssokolow (guest, #94568)
[Link] (1 responses)
I never said it couldn't. On POSIX platforms, That's what I meant when I said they had the same in-memory representation. You can even use the I looked at the up-to-date source to the standard library before I made my post. (The key detail is that, when Windows UTF-16 is invalid UTF-8, it's not because it encodes codepoints that UTF-8 can't represent, it's that it violates higher-level rules about surrogates never occuring in isolation or out of order. After the transformation process is complete, UTF-16 and UTF-8 are simply two different ways to map 21-bit numbers into sequences of bytes, so It's trivial for WTF-8 to represent an arbitrary sequence of numbers 21-bits or narrower just by omitting some of the rules checks that a UTF-8 codec enforces.) I actually started out with that exact quote but edited it out for brevity.
Posted Jan 23, 2017 20:34 UTC (Mon)
by mathstuf (subscriber, #69389)
[Link]
The WTF-8 versus WTF-16 thing is news to me. Maybe it was communicated poorly when I first heard about it or it has changed since then.
Thanks for the clarifications.
Posted Jan 23, 2017 11:12 UTC (Mon)
by Jonno (subscriber, #49613)
[Link] (1 responses)
Rust does not. While Rust Strings are always UTF-8, Rust OsStrings are not. They are "arbitrary sequences of non-zero bytes" (on Unix) or "arbitrary sequences of non-zero 16-bit values" (on Windows).
Directory listings uses OsStrings, not Strings, for filename components, and File::open() will accept anything from which Rust knows how to build a path, including both Strings *and* OsStrings.
There are convenience methods to convert an OsString to a String (which will fail if the OsString does not contain valid Unicode), as well as to convert a String to an OsString (which will fail if the String contains any "U+0000 NULL" characters), but there is no requirement that you use them.
In fact, in most circumstances you should not. Keep the OsString for path manipulations, and if you need a pretty UTF-8 string to show the user, use the heavier OsString::to_string_lossy() method to get a string with any invalid Unicode sequences replaced with "U+FFFD REPLACEMENT CHARACTER".
Posted Jan 23, 2017 12:22 UTC (Mon)
by ssokolow (guest, #94568)
[Link]
Actually, Here's a Rust Playground link demonstrating that.
Python 2.8?
Python 2.8?
>>>> a = "asdfasfd"
>>>> b = b'\x01\x02\x00'
>>>> a+b
>>>> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
>TypeError: Can't convert 'bytes' object to str implicitly
>>>> a = "asdfasfd"
>>>> b = b'\x01\x02\x00'
>>>> a+b
>>>> 'asdfasfd\x01\x02\x00'
Python 2.8?
Python 2.8?
I don't care what you call them. The real world out there is not unicode, it is composed of strings of bytes.
Yes it is fair. Python 3 unicode design is totally braindead. We now have several examples of GOOD unicode-enabled design, from both parts of the spectrum:
1) Set up:
>>> open('test-\ud800.txt', 'w').close()
>>> [open(f, 'r').close() for f in os.listdir(b'.')]
>>> [None]
>>> [open(f, 'r').close() for f in os.listdir(b'.')]
>FileNotFoundError: [Errno 2] No such file or directory: b'test-\xed\xa0\x80.txt'
Python 2.8?
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 5: surrogates not allowed
>>> [open(f, 'r').close() for f in os.listdir(b'.')]
[None]
Python 2.8?
No, they aren't. They are NOT arrays of characters, and anybody who is familiar with Unicode can tell you that. Python strings are arrays of Unicode _codepoints_ - a Unicode character can _easily_ consist of multiple codepoints.
I don't CARE about strings. Py3 "strings" are an exception in the REAL world out here. Real world operates on byte sequences - they are in HTTP headers, in CSV files, in filenames and so on.
"You're holding it wrong" - exactly.
Yes, I did. It's an example from: https://lwn.net/Articles/711492/ - on Windows, though I think I've used 'test\x01\x02.txt'.
Thanks, but I don't like Koolaid.
Python 2.8?
Python 2.8?
"Development would be so easy if it weren't for those pesky users."
Python 2.8?
Python 2.8?
Python 2.8?
2) Dynamic typing, which then requires
3) Lots of unit tests required to keep that dynamic types in shape.
Python 2.8?
Python 2.8?
$ python3.5
Python 3.5.2+ (default, Dec 13 2016, 14:16:35)
[GCC 6.2.1 20161124] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> [open(f, 'r').close() for f in os.listdir(b'.')]
[None]
>>> os.listdir(b'.')
[b'test-\xed\xa0\x80.txt']
Python 2.8?
Python 2.8?
Python 2.8?
Python 2.8?
Python 2.8?
> UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 5: surrogates not allowed
Python 2.8?
Python 2.8?
Python 2.8?
split() works on bytestrings now
[b'abc', b'def', b'ghi']
>>> b'abc def ghi'.split()
[b'abc', b'def', b'ghi']
Python 2.8?
https://docs.python.org/dev/whatsnew/3.6.html#pep-519-add...
Python 2.8?
Python 2.8?
Python 2.8?
Python 2.8?
Python 2.8?
Python 2.8?
Python 2.8?
Rust has three types of "strings". From least to most "binary", they are:
Python 2.8?
See also: std::string::String
Vec<u8>
internally.NULL
bytes.std::ffi::OsString
NULL
bytes even if the OS-native form can'tString
without conversion cost.
String
.
Vec<u8>
NULL
bytesVec
has "sequence of items" versions of most APIs you'd think of as being for string manipulation.std::path::{Path,PathBuf}
(OsString
internally)
Python 2.8?
> No, [OsString] can represent arbitrary bytes.
Python 2.8?
String
(1) and OsString
(1, 2) both rely on an internal Vec<u8>
to store the raw bytes... String
just enforces extra invariants and presents a different set of methods.to_str
method to get an &str
pointing at the contents of the OsString
as long it's valid UTF-8.OsString
uses an inner type called sys::os_str::Buf
to actually store the string and, on Windows, that's an API wrapper around sys_common::wtf8::Wtf8Buf
which, again, uses a Vec<u8>
internally.Python 2.8?
Python 2.8?
Python 2.8?
OsString
is a superset of String
and whatever the OS offers. It'll carry NULL
characters just fine.