Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Posted Jan 13, 2020 22:16 UTC (Mon) by excors (subscriber, #95769)In reply to: Szorc: Mercurial's Journey to and Reflections on Python 3 by vstinner
Parent article: Szorc: Mercurial's Journey to and Reflections on Python 3
The article mentions that issue: POSIX filenames are arbitrary byte strings. There is simply no good lossless way to decode them to Unicode. (There's PEP 383 but that produces strings that are not quite Unicode, e.g. it becomes impossible to encode them as UTF-16, so that's not good). And Windows filenames are arbitrary uint16_t strings, with no good lossless way to decode them to Unicode. For an application whose job is to manage user-created files, it's not safe to make assumptions about filenames; it has to be robust to whatever the user throws at it.
(The article also mentions the solution, as implemented in Rust: filenames are a platform-specific string type, with lossy conversions to Unicode if you really want that (e.g. to display to users).)
Posted Jan 13, 2020 23:19 UTC (Mon)
by vstinner (subscriber, #42675)
[Link] (12 responses)
On Python 3, there is a good practical solution for that: Python uses surrogateescape error handler (PEP 383) by default for filenames. It escapes undecodable bytes as Unicode surrogate characters.
Read my articles https://vstinner.github.io/python30-listdir-undecodable-f... and https://vstinner.github.io/pep-383.html for the history the Unicode usage for filenames in the early days of Python 3 (Python 3.0 and Python 3.1).
The problem is that the UTF-8 codec of Python 2 doesn't respect the Unicode standard: it does encode surrogate characters. The Python 3 codec doesn't encode them, which makes possible to use surrogateescape error handler with UTF-8.
> And Windows filenames are arbitrary uint16_t strings, with no good lossless way to decode them to Unicode.
I'm not sure of which problem you're talking about.
If you care of getting the same character on Windows and Linux (ex: é letter = U+00E9), you should encode the filename differently. Storing the filename as Unicode in the application is a convenient way for that. That's why Python prefers Unicode for filenames. But it also accepts filenames as bytes.
> For an application whose job is to manage user-created files, it's not safe to make assumptions about filenames; it has to be robust to whatever the user throws at it.
Well, it is where I gave up :-)
Posted Jan 13, 2020 23:29 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (5 responses)
Posted Jan 13, 2020 23:37 UTC (Mon)
by roc (subscriber, #30627)
[Link] (3 responses)
Posted Jan 14, 2020 2:22 UTC (Tue)
by excors (subscriber, #95769)
[Link] (2 responses)
You can't create a string like '\U00123456' (SyntaxError) or chr(0x123456) (ValueError); it's limited to the 21-bit range. But you *can* create a string like '\udccc' and Python will happily process it, at least until you try to encode it. '\udccc'.encode('utf-8') throws UnicodeEncodeError.
If you use the special decoding mode, b'\xcc'.decode('utf-8', 'surrogateescape') gives '\udccc'. If you (or some library) does that, now your application is tainted with not-really-Unicode strings, and I think if you ever try to encode without surrogateescape then you'll risk getting an exception.
If you tried to decode Windows filenames as round-trippable UCS-2, like
>>> ''.join(chr(c) for c, in struct.iter_unpack(b'>H', b'\xd8\x08\xdf\x45'))
then you'd be introducing a third type of string (after Unicode and Unicode-plus-surrogate-escapes) which seems likely to make things even worse.
Posted Jan 14, 2020 2:44 UTC (Tue)
by excors (subscriber, #95769)
[Link]
Incidentally, that seems to include the default encoding performed by print() (at least in Python 3.6 on my system):
>>> for f in os.listdir('.'): print(f)
os.listdir() will surrogateescape-decode and functions like open() will surrogateescape-encode the filenames, but that doesn't help if you've got e.g. logging code that touches the filenames too.
Posted Jan 14, 2020 4:47 UTC (Tue)
by roc (subscriber, #30627)
[Link]
Posted Jan 16, 2020 8:08 UTC (Thu)
by marcH (subscriber, #57642)
[Link]
Yet all VCS provide some sort of auto.crlf insanity, go figure.
Just in case someone wants to use Notepad-- from the last decade.
Posted Jan 13, 2020 23:32 UTC (Mon)
by roc (subscriber, #30627)
[Link] (1 responses)
Posted Jan 16, 2020 17:40 UTC (Thu)
by kjp (guest, #39639)
[Link]
Python: It's a [unstable] scripting language. NOT a systems or application language.
Posted Jan 14, 2020 1:37 UTC (Tue)
by excors (subscriber, #95769)
[Link] (2 responses)
But then you end up with a "Unicode" string in memory which can't be safely encoded as UTF-8 or UTF-16, so it's not really a Unicode string at all. (As far as I can see, the specifications are very clear that UTF-* can't encode U+D800..U+DFFF. An implementation that does encode/decode them is wrong or is not Unicode.)
That means Python applications that assume 'str' is Unicode are liable to get random exceptions when encoding properly (i.e. without surrogateescape).
> > And Windows filenames are arbitrary uint16_t strings, with no good lossless way to decode them to Unicode.
Windows (with NTFS) lets you create a file whose name is e.g. "\ud800". The APIs all handle filenames as strings of wchar_t (equivalent to uint16_t), so they're perfectly happy with that file. But it's clearly not a valid string of UTF-16 code units (because it would be an unpaired surrogate) so it can't be decoded, and it's not a valid string of Unicode scalar values so it can't be directly encoded as UTF-8 or UTF-16. It's simply not Unicode.
In practice most native Windows applications and APIs treat filenames as effectively UCS-2, and they never try to encode or decode so they don't care about surrogates, though the text rendering APIs try to decode as UTF-16 and go a bit weird if that fails. Python strings aren't UCS-2 so it has to convert to 'str' somehow, but there's no correct way to do that conversion.
Posted Jan 14, 2020 6:04 UTC (Tue)
by ssmith32 (subscriber, #72404)
[Link] (1 responses)
https://docs.microsoft.com/en-us/windows/win32/fileio/nam...
Also, whatever your complaints are about whatever language, with respect to filenames, the win32 api is worse.
It's amazingly inconsistent. The level of insanity is just astonishing, especially if you're going across files created with the win api, and the .net libs.
You *have to p/invoke to read some files, and use the long filepath prefix, which doesn't support relative paths. And that's just the start.
Admittedly, I haven't touched it for almost a decade in any serious fashion, but, based on the docs linked above, it doesn't seem much has changed.
It's remarkable how easy they make it to write files that are quite hard to open..
Posted Jan 14, 2020 15:35 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Jan 15, 2020 0:26 UTC (Wed)
by gracinet (guest, #89400)
[Link]
don't forget that Mercurial has to cope with filenames in its history that are 25 years old. Yes, that predates Mercurial but some of the older repos have had a long history as CVS then SVN.
Factor in the very strong stability requirements and the fact that risk to change a hash value is to be avoided, no wonder a VCS is one of the last to take the plundge. It's really not a matter of size of the codebase in this case.
Note: I wasn't directly involved in Mercurial at the time you were engaging with the project about that, I hope some good came out of it anyway.
Posted Jan 14, 2020 2:18 UTC (Tue)
by flussence (guest, #85566)
[Link]
Posted Jan 14, 2020 7:57 UTC (Tue)
by epa (subscriber, #39769)
[Link] (2 responses)
If you do want a truly arbitrary ‘bag of bytes’ not just for file contents but for names too, I have the feeling you’d probably be using a different tool anyway.
Posted Jan 14, 2020 15:39 UTC (Tue)
by mathstuf (subscriber, #69389)
[Link]
Losing the ability to read history of when the tool did not have such a restriction would not be a good thing. Losing the ability to manipulate those files (even to rename them to something valid) would also be tricky if it failed up front about bad filenames.
Posted Jan 15, 2020 18:58 UTC (Wed)
by hkario (subscriber, #94864)
[Link]
just unzip a file from non-UTF-8 system, you're almost guaranteed to get mojibake as a result; then blindly commit files to the VCS and bam, you're set
Posted Jan 14, 2020 11:35 UTC (Tue)
by dvdeug (guest, #10998)
[Link] (5 responses)
Posted Jan 14, 2020 11:59 UTC (Tue)
by roc (subscriber, #30627)
[Link] (4 responses)
You can accept all filenames and make repositories portable between Windows and Unix if they have valid Unicode filenames. AFAIK that's what Mercurial does, and I hope it's what git does.
Posted Jan 14, 2020 12:33 UTC (Tue)
by dezgeg (subscriber, #92243)
[Link] (3 responses)
Posted Jan 14, 2020 13:21 UTC (Tue)
by roc (subscriber, #30627)
[Link] (1 responses)
Posted Jan 14, 2020 15:51 UTC (Tue)
by Wol (subscriber, #4433)
[Link]
They had a load of grief with mixed Windows/linux repos, so there's now a switch that says "convert cr/lf on checkout/checkin".
Add a switch that says "enforce valid utf-8/utf-16/Apple filenames, and sort out the mess at checkout/checkin".
If that's off by default, or on by default for new repos, or whatever, then at least NEW stuff will be sane, even if older stuff isn't.
Cheers,
Posted Jan 14, 2020 15:42 UTC (Tue)
by mathstuf (subscriber, #69389)
[Link]
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
A VCS must be able to round-trip files on the same FS. Even if they are not encoded correctly.
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
'\ud808\udf45'
Szorc: Mercurial's Journey to and Reflections on Python 3
UnicodeEncodeError: 'utf-8' codec can't encode character '\udccc' in position 4: surrogates not allowed
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
>
> I'm not sure of which problem you're talking about.
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Just wait until you see POSIX!
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Wol
Szorc: Mercurial's Journey to and Reflections on Python 3