Python 2.8?

Posted Jan 13, 2017 3:19 UTC (Fri) by excors (subscriber, #95769)
In reply to: Python 2.8? by foom
Parent article: Python 2.8?

It seems to convert to something that's nearly but not quite UTF-8:

Python 3.6.0 (v3.6.0:41df79263a11, Dec 23 2016, 08:06:12) [MSC v.1900 64 bit (AMD64)] on win32

>>> open('test-\ud800.txt', 'w').close()

>>> os.listdir('.')
['test-\ud800.txt']

>>> os.listdir(b'.')
[b'test-\xed\xa0\x80.txt']

>>> [f.decode('utf-8') for f in os.listdir(b'.')]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <listcomp>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 5: invalid continuation byte

So you need to treat them as opaque byte strings, not as encoded Unicode even on Windows.

Hmm, but how are you meant to use byte strings with open()? I would have thought this should work, but it doesn't:

>>> [open(f, 'r').close() for f in os.listdir(b'.')]
FileNotFoundError: [Errno 2] No such file or directory: b'test-\xed\xa0\x80.txt'

Python 2.8?

Posted Jan 13, 2017 19:12 UTC (Fri) by foom (subscriber, #14868) [Link] (2 responses)

Regarding [f.decode('utf-8') for f in os.listdir(b'.')]: Apparently python no longer allows surrogates in its utf-8 codec by default; you need to use: .decode('utf-8', errors='surrogatepass'), instead.

Regarding [open(f, 'r').close() for f in os.listdir(b'.')]: That sounds like a bug, at least to me.

Python 2.8?

Posted Jan 13, 2017 19:53 UTC (Fri) by excors (subscriber, #95769) [Link] (1 responses)

The open() issue is not restricted to weird surrogate cases - it fails with pretty much any non-ASCII filename, like 'test-\u00c0.txt' ("No such file or directory: b'test-\xc3\x80.txt'"). Looks like it actually tries to open 'test-\u00c3\u20ac.txt', i.e. open() is always decoding the filename as Win-1252, which doesn't seem especially helpful. (This is with Python 3.6.0 on English-language Windows 7.)

os.open() seems to do the right thing with byte string filenames, but I guess it would be nice if open() did too. So I think this claim:

> Python 3.6 has actually, finally, fixed the way it deals with Windows path APIs

is unfortunately a bit premature.

Python 2.8?

Posted Jan 13, 2017 23:04 UTC (Fri) by foom (subscriber, #14868) [Link]

D'oh.. Guess that's what I get for not testing before praising it... 😐

Python 2.8?

Posted Jan 14, 2017 0:08 UTC (Sat) by vstinner (subscriber, #42675) [Link] (2 responses)

> It seems to convert to something that's nearly but not quite UTF-8:

Python 3.6 now uses UTF-8/surrogatepass on Windows in os.fsdecode() / os.fsencode(). Hopefully, these functions are almost never used on Windows, since Windows has a native support for Unicode. For example, command line arguments, list filenames in a directory, get the hostname, ... : Windows return data directly as Unicode.

The surrogatepass error handler is required to support the same character set than Windows. Windows does allow surrogate characters in filenames. It's really weird and does not conform to Unicode standards which deny these characters in UTF-* encodings.

Python 3 respects Unicode standards: surrogate characters are not allowed in the UTF-8 codec for example. It allowed to implement new nice error handlers for UTF-8: surrogateescape, surrogatepass, etc. By the way, Python 3.6 has a new interesting "namereplace" error handler.

Python 2.8?

Posted Jan 15, 2017 21:11 UTC (Sun) by mathstuf (subscriber, #69389) [Link]

Rust handles this by having an "OsStr" type at system call boundaries related to paths (and, IIRC, environment variables and process arguments). Strings can be cast to them easily (they implement AsRef<OsStr>), but getting them back out requires an explicit from_utf8 call (which can fail) or a _lossy version (which uses replacement characters for unconvertible sequences). On Windows, there is a purely internal "WTF-16" encoding which is UTF-16 with allowances for Windows specific exceptions. This allows the encodings to not get mixed up and allows the real POSIX policy of "filenames are sequences of nonzero, non-/ characters" gracefully without having LANG screw up your code because assumptions are made based on it.

But Python isn't a fan of these kinds of type safties (implicit casting with exceptions would probably be fine though and would have better error messages than "file from readdir does not exist" errors).

Python 2.8?

Posted Jan 20, 2017 4:17 UTC (Fri) by foom (subscriber, #14868) [Link]

> Windows return data directly as Unicode

I know you clarified later in your comment, but I'd just like to emphasize (esp since a lot of people seem to say that without clarification): Windows APIs absolutely *do not* return "Unicode". Instead, they deal with arrays of 16-bit values, which, when you're lucky, can be decoded via UTF-16 into a unicode string.

And, just like decoding Linux "UTF-8" paths to a unicode string might fail due to the path not actually being UTF8, decoding a Windows "UTF-16" path might fail due to the path not actually being valid UTF16.

In both OSes, in order to avoid errors, you'll want to tell the unicode decoder/encoder to allow the invalid input bytestrings and transform to something nonsensical but reversible. Or, alternatively, avoid decoding the paths into unicode at all, and just leave them in their native 8/16-bit bytestring representations.