Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 10:18 UTC (Tue) by smurf (subscriber, #17840)
In reply to: Szorc: Mercurial's Journey to and Reflections on Python 3 by roc
Parent article: Szorc: Mercurial's Journey to and Reflections on Python 3

Yeah, but that's an implementation problem. Conceptually, the surrogateescape route is bidirectionally lossless and thus solves the problem.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 12:02 UTC (Tue) by roc (subscriber, #30627) [Link] (7 responses)

Read the comments more carefully. surrogateescape does not give you a way to encode a Windows filename containing a lone surrogate as a Python Unicode string.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 17:02 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (3 responses)

My read of https://docs.microsoft.com/en-us/windows/win32/fileio/nam... is that those filenames are illegal anyway:

> Use any character in the current code page for a name, including Unicode characters and characters in the extended character set (128–255), except for the following:

> [various exceptions]

In this context, "Unicode" means "UTF-16," because Microsoft. A lone surrogate is certainly not a "UTF-16 character." It's half a character.

I don't know if the various *W interfaces actually check for lone surrogates and error out (the documentation for CreateFileW does not explicitly call this case out), but they probably should.

(Microsoft's error checking of filenames is kinda terrible anyway, so I would not be too surprised if you could get lone surrogates through the API. For example, it says that you can't create a file whose name ends in a dot, but if you prefix the path with the \\?\ prefix that they also discuss on the same page, then that check is bypassed. And then your sysadmin has to figure out how to delete the damned thing, because none of the standard tools will even recognize that it exists. See also: https://bugs.python.org/issue22299)

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 19:39 UTC (Tue) by foom (subscriber, #14868) [Link] (2 responses)

No, it doesn't really mean UTF-16, it really means "16-bit unicode". This wasn't some mistake -- the Windows unicode support was defined back when surrogates didn't exist, and "unicode characters" simply *were* only 16-bits wide, because people were REALLY REALLY trying to pretend that all the characters would be able to be encoded in 65535 codepoints. It's the same in Java and JavaScript, too, fwiw...

So, yes, you can perfectly well put lone halves of a surrogate pair in windows unicode strings and filesystems.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 21:46 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (1 responses)

That was true once upon a couple of decades ago, but now "Unicode" means either "UTF-16" or "UTF-16LE" depending on context. See for example https://docs.microsoft.com/en-us/windows/win32/intl/surro..., which explicitly states:

> Standalone surrogate code points have either a high surrogate without an adjacent low surrogate, or vice versa. These code points are invalid and are not supported. Their behavior is undefined.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 22:45 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

Yes, we know that. But there are real Windows filesystems that don't.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 19:47 UTC (Tue) by foom (subscriber, #14868) [Link] (2 responses)

You don't need a special way to store a lone surrogate in a python string -- python unicode strings are fine with holding lone surrogates.

However, you do need to have a utf-16 decoder/encoder variant which allows lone surrogates to be decoded/encoded without throwing an error. Python has the "surrogatepass" error handler for that. E.g.:

b'\x00\xD8'.decode('utf-16le', errors='surrogatepass').encode('utf-16le', errors='surrogatepass')

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 21:55 UTC (Tue) by roc (subscriber, #30627) [Link] (1 responses)

Good point. That's interesting. So you can use the "lone surrogate" (mis)feature to represent both Linux and Windows filenames as Python3 "Unicode" strings... it's just that the method is different for Linux and Windows.

But do APIs like os.listdir() do that automatically on Windows like they do on Linux?

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 4:18 UTC (Wed) by NYKevin (subscriber, #129325) [Link]

According to (a comment in) the source code: https://github.com/python/cpython/blob/master/Modules/pos...

> On Windows, if we get a (Unicode) string we extract the wchar_t * and return it; if we get bytes we decode to wchar_t * and return that.

I believe this means that the Windows version of that module will never try to encode Unicode strings into UTF-16LE (or any other encoding), meaning it won't "catch" invalid surrogates. They should pass straight through to the Windows *W APIs.

This is also supported by the os.path docs, which say the following:

> Unfortunately, some file names may not be representable as strings on Unix, so applications that need to support arbitrary file names on Unix should use bytes objects to represent path names. Vice versa, using bytes objects cannot represent all file names on Windows (in the standard mbcs encoding), hence Windows applications should use string objects to access all files.

Since they only call out bytes objects, I think the implication is that str objects are not a problem on Windows. But the fact that it makes no mention of os.fsencode() and os.fsdecode(), nor the surrogateescape handler, makes me suspicious of whether this documentation is still up-to-date.