Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Posted Jan 14, 2020 10:18 UTC (Tue) by smurf (subscriber, #17840)In reply to: Szorc: Mercurial's Journey to and Reflections on Python 3 by roc
Parent article: Szorc: Mercurial's Journey to and Reflections on Python 3
Posted Jan 14, 2020 12:02 UTC (Tue)
by roc (subscriber, #30627)
[Link] (7 responses)
Posted Jan 14, 2020 17:02 UTC (Tue)
by NYKevin (subscriber, #129325)
[Link] (3 responses)
> Use any character in the current code page for a name, including Unicode characters and characters in the extended character set (128–255), except for the following:
> [various exceptions]
In this context, "Unicode" means "UTF-16," because Microsoft. A lone surrogate is certainly not a "UTF-16 character." It's half a character.
I don't know if the various *W interfaces actually check for lone surrogates and error out (the documentation for CreateFileW does not explicitly call this case out), but they probably should.
(Microsoft's error checking of filenames is kinda terrible anyway, so I would not be too surprised if you could get lone surrogates through the API. For example, it says that you can't create a file whose name ends in a dot, but if you prefix the path with the \\?\ prefix that they also discuss on the same page, then that check is bypassed. And then your sysadmin has to figure out how to delete the damned thing, because none of the standard tools will even recognize that it exists. See also: https://bugs.python.org/issue22299)
Posted Jan 14, 2020 19:39 UTC (Tue)
by foom (subscriber, #14868)
[Link] (2 responses)
So, yes, you can perfectly well put lone halves of a surrogate pair in windows unicode strings and filesystems.
Posted Jan 14, 2020 21:46 UTC (Tue)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
> Standalone surrogate code points have either a high surrogate without an adjacent low surrogate, or vice versa. These code points are invalid and are not supported. Their behavior is undefined.
Posted Jan 14, 2020 22:45 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Jan 14, 2020 19:47 UTC (Tue)
by foom (subscriber, #14868)
[Link] (2 responses)
However, you do need to have a utf-16 decoder/encoder variant which allows lone surrogates to be decoded/encoded without throwing an error. Python has the "surrogatepass" error handler for that. E.g.:
b'\x00\xD8'.decode('utf-16le', errors='surrogatepass').encode('utf-16le', errors='surrogatepass')
Posted Jan 14, 2020 21:55 UTC (Tue)
by roc (subscriber, #30627)
[Link] (1 responses)
But do APIs like os.listdir() do that automatically on Windows like they do on Linux?
Posted Jan 15, 2020 4:18 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link]
> On Windows, if we get a (Unicode) string we extract the wchar_t * and return it; if we get bytes we decode to wchar_t * and return that.
I believe this means that the Windows version of that module will never try to encode Unicode strings into UTF-16LE (or any other encoding), meaning it won't "catch" invalid surrogates. They should pass straight through to the Windows *W APIs.
This is also supported by the os.path docs, which say the following:
> Unfortunately, some file names may not be representable as strings on Unix, so applications that need to support arbitrary file names on Unix should use bytes objects to represent path names. Vice versa, using bytes objects cannot represent all file names on Windows (in the standard mbcs encoding), hence Windows applications should use string objects to access all files.
Since they only call out bytes objects, I think the implication is that str objects are not a problem on Windows. But the fact that it makes no mention of os.fsencode() and os.fsdecode(), nor the surrogateescape handler, makes me suspicious of whether this documentation is still up-to-date.
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3