Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 13, 2020 22:16 UTC (Mon) by excors (subscriber, #95769)
In reply to: Szorc: Mercurial's Journey to and Reflections on Python 3 by vstinner
Parent article: Szorc: Mercurial's Journey to and Reflections on Python 3

> Well, I wanted to use Unicode for filenames, they didn't want to hear this idea.

The article mentions that issue: POSIX filenames are arbitrary byte strings. There is simply no good lossless way to decode them to Unicode. (There's PEP 383 but that produces strings that are not quite Unicode, e.g. it becomes impossible to encode them as UTF-16, so that's not good). And Windows filenames are arbitrary uint16_t strings, with no good lossless way to decode them to Unicode. For an application whose job is to manage user-created files, it's not safe to make assumptions about filenames; it has to be robust to whatever the user throws at it.

(The article also mentions the solution, as implemented in Rust: filenames are a platform-specific string type, with lossy conversions to Unicode if you really want that (e.g. to display to users).)

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 13, 2020 23:19 UTC (Mon) by vstinner (subscriber, #42675) [Link] (12 responses)

> The article mentions that issue: POSIX filenames are arbitrary byte strings. There is simply no good lossless way to decode them to Unicode.

On Python 3, there is a good practical solution for that: Python uses surrogateescape error handler (PEP 383) by default for filenames. It escapes undecodable bytes as Unicode surrogate characters.

Read my articles https://vstinner.github.io/python30-listdir-undecodable-f... and https://vstinner.github.io/pep-383.html for the history the Unicode usage for filenames in the early days of Python 3 (Python 3.0 and Python 3.1).

The problem is that the UTF-8 codec of Python 2 doesn't respect the Unicode standard: it does encode surrogate characters. The Python 3 codec doesn't encode them, which makes possible to use surrogateescape error handler with UTF-8.

> And Windows filenames are arbitrary uint16_t strings, with no good lossless way to decode them to Unicode.

I'm not sure of which problem you're talking about.

If you care of getting the same character on Windows and Linux (ex: é letter = U+00E9), you should encode the filename differently. Storing the filename as Unicode in the application is a convenient way for that. That's why Python prefers Unicode for filenames. But it also accepts filenames as bytes.

> For an application whose job is to manage user-created files, it's not safe to make assumptions about filenames; it has to be robust to whatever the user throws at it.

Well, it is where I gave up :-)

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 13, 2020 23:29 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (5 responses)

> I'm not sure of which problem you're talking about.
A VCS must be able to round-trip files on the same FS. Even if they are not encoded correctly.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 13, 2020 23:37 UTC (Mon) by roc (subscriber, #30627) [Link] (3 responses)

It sounds to me like on Windows you can round-trip arbitrary native filenames through Python "Unicode" strings because in both systems the strings are simply a list of 16-bit code units (which are normally interpreted as UTF-16 but may not be valid UTF-16). So maybe that 'surrogateescape' hack is enough. (But only because Python3 Unicode strings don't have to be valid Unicode after all.)

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 2:22 UTC (Tue) by excors (subscriber, #95769) [Link] (2 responses)

Python strings aren't 16-bit code units. b'\xf0\x92\x8d\x85'.decode('utf-8') is '\U00012345' with length 1, which is sensible.

You can't create a string like '\U00123456' (SyntaxError) or chr(0x123456) (ValueError); it's limited to the 21-bit range. But you *can* create a string like '\udccc' and Python will happily process it, at least until you try to encode it. '\udccc'.encode('utf-8') throws UnicodeEncodeError.

If you use the special decoding mode, b'\xcc'.decode('utf-8', 'surrogateescape') gives '\udccc'. If you (or some library) does that, now your application is tainted with not-really-Unicode strings, and I think if you ever try to encode without surrogateescape then you'll risk getting an exception.

If you tried to decode Windows filenames as round-trippable UCS-2, like

>>> ''.join(chr(c) for c, in struct.iter_unpack(b'>H', b'\xd8\x08\xdf\x45'))
'\ud808\udf45'

then you'd be introducing a third type of string (after Unicode and Unicode-plus-surrogate-escapes) which seems likely to make things even worse.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 2:44 UTC (Tue) by excors (subscriber, #95769) [Link]

> I think if you ever try to encode without surrogateescape then you'll risk getting an exception

Incidentally, that seems to include the default encoding performed by print() (at least in Python 3.6 on my system):

>>> for f in os.listdir('.'): print(f)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udccc' in position 4: surrogates not allowed

os.listdir() will surrogateescape-decode and functions like open() will surrogateescape-encode the filenames, but that doesn't help if you've got e.g. logging code that touches the filenames too.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 4:47 UTC (Tue) by roc (subscriber, #30627) [Link]

Thanks for clearing that up.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 8:08 UTC (Thu) by marcH (subscriber, #57642) [Link]

> A VCS must be able to round-trip files on the same FS

Yet all VCS provide some sort of auto.crlf insanity, go figure.

Just in case someone wants to use Notepad-- from the last decade.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 13, 2020 23:32 UTC (Mon) by roc (subscriber, #30627) [Link] (1 responses)

Huh, so Python3 "Unicode" strings aren't even necessarily valid Unicode :-(.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 17:40 UTC (Thu) by kjp (guest, #39639) [Link]

And the zen of python was forgotten long ago. Explicit is better than implicit? Errors should not pass silently? Nah. Let's just add math operators to dictionaries. Python has no direction, no stewardship, and I think it's been taken over by windows and perl folks.

Python: It's a [unstable] scripting language. NOT a systems or application language.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 1:37 UTC (Tue) by excors (subscriber, #95769) [Link] (2 responses)

> On Python 3, there is a good practical solution for that: Python uses surrogateescape error handler (PEP 383) by default for filenames. It escapes undecodable bytes as Unicode surrogate characters.

But then you end up with a "Unicode" string in memory which can't be safely encoded as UTF-8 or UTF-16, so it's not really a Unicode string at all. (As far as I can see, the specifications are very clear that UTF-* can't encode U+D800..U+DFFF. An implementation that does encode/decode them is wrong or is not Unicode.)

That means Python applications that assume 'str' is Unicode are liable to get random exceptions when encoding properly (i.e. without surrogateescape).

> > And Windows filenames are arbitrary uint16_t strings, with no good lossless way to decode them to Unicode.
>
> I'm not sure of which problem you're talking about.

Windows (with NTFS) lets you create a file whose name is e.g. "\ud800". The APIs all handle filenames as strings of wchar_t (equivalent to uint16_t), so they're perfectly happy with that file. But it's clearly not a valid string of UTF-16 code units (because it would be an unpaired surrogate) so it can't be decoded, and it's not a valid string of Unicode scalar values so it can't be directly encoded as UTF-8 or UTF-16. It's simply not Unicode.

In practice most native Windows applications and APIs treat filenames as effectively UCS-2, and they never try to encode or decode so they don't care about surrogates, though the text rendering APIs try to decode as UTF-16 and go a bit weird if that fails. Python strings aren't UCS-2 so it has to convert to 'str' somehow, but there's no correct way to do that conversion.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 6:04 UTC (Tue) by ssmith32 (subscriber, #72404) [Link] (1 responses)

Microsoft refers to it as an "extended character set":

https://docs.microsoft.com/en-us/windows/win32/fileio/nam...

Also, whatever your complaints are about whatever language, with respect to filenames, the win32 api is worse.

It's amazingly inconsistent. The level of insanity is just astonishing, especially if you're going across files created with the win api, and the .net libs.

You *have to p/invoke to read some files, and use the long filepath prefix, which doesn't support relative paths. And that's just the start.

Admittedly, I haven't touched it for almost a decade in any serious fashion, but, based on the docs linked above, it doesn't seem much has changed.

It's remarkable how easy they make it to write files that are quite hard to open..

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 15:35 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

> It's amazingly inconsistent. The level of insanity is just astonishing
Just wait until you see POSIX!

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 0:26 UTC (Wed) by gracinet (guest, #89400) [Link]

Hey Victor,

don't forget that Mercurial has to cope with filenames in its history that are 25 years old. Yes, that predates Mercurial but some of the older repos have had a long history as CVS then SVN.

Factor in the very strong stability requirements and the fact that risk to change a hash value is to be avoided, no wonder a VCS is one of the last to take the plundge. It's really not a matter of size of the codebase in this case.

Note: I wasn't directly involved in Mercurial at the time you were engaging with the project about that, I hope some good came out of it anyway.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 2:18 UTC (Tue) by flussence (guest, #85566) [Link]

This was a sore point in Perl 6 too for many years due to its over-eagerness to destructively normalise everything on read. It fixed it eventually by adding a special encoding, similar to how Java has Modified UTF-8. It's not perfect, but without mandating a charset and normalisation at the filesystem level (something only Apple's dared to do) nothing is.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 7:57 UTC (Tue) by epa (subscriber, #39769) [Link] (2 responses)

How many Mercurial users store non-Unicode file names in a repository? Perhaps if the Mercurial developers had declared that from now on hg requires Unicode-clean filenames, their port to Python 3 would have gone much smoother.

If you do want a truly arbitrary ‘bag of bytes’ not just for file contents but for names too, I have the feeling you’d probably be using a different tool anyway.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 15:39 UTC (Tue) by mathstuf (subscriber, #69389) [Link]

> Perhaps if the Mercurial developers had declared that from now on hg requires Unicode-clean filenames

Losing the ability to read history of when the tool did not have such a restriction would not be a good thing. Losing the ability to manipulate those files (even to rename them to something valid) would also be tricky if it failed up front about bad filenames.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 18:58 UTC (Wed) by hkario (subscriber, #94864) [Link]

it's easy to end up with malformed names in file system

just unzip a file from non-UTF-8 system, you're almost guaranteed to get mojibake as a result; then blindly commit files to the VCS and bam, you're set

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 11:35 UTC (Tue) by dvdeug (guest, #10998) [Link] (5 responses)

Which means there's no way to reliably share a Mercurial repository between Windows and Unix. You can either accept all filenames or make repositories portable between Windows and Unix, not both. Note that even pretending that you can support both systems ignores those whole "arbitrary byte strings" and "arbitrary uint16_t strings" issues. I'd certainly feel comfortable with Mercurial and other tools rejecting junk file names, though I can see where people with old 8-bit charset filenames in their history could have problems.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 11:59 UTC (Tue) by roc (subscriber, #30627) [Link] (4 responses)

> You can either accept all filenames or make repositories portable between Windows and Unix, not both.

You can accept all filenames and make repositories portable between Windows and Unix if they have valid Unicode filenames. AFAIK that's what Mercurial does, and I hope it's what git does.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 12:33 UTC (Tue) by dezgeg (subscriber, #92243) [Link] (3 responses)

Not quite enough... Let's not forget the portability troubles of Mac, where the filesystem does Unicode (de)normalization behind the application's back.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 13:21 UTC (Tue) by roc (subscriber, #30627) [Link] (1 responses)

OK sure. The point is: you can preserve native filenames, and also ensure that repos are portable to any OS/filesystem that can represent the repo filenames correctly. That's what I want any VCS to do.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 15:51 UTC (Tue) by Wol (subscriber, #4433) [Link]

Do what git does with line endings, maybe?

They had a load of grief with mixed Windows/linux repos, so there's now a switch that says "convert cr/lf on checkout/checkin".

Add a switch that says "enforce valid utf-8/utf-16/Apple filenames, and sort out the mess at checkout/checkin".

If that's off by default, or on by default for new repos, or whatever, then at least NEW stuff will be sane, even if older stuff isn't.

Cheers,
Wol

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 15:42 UTC (Tue) by mathstuf (subscriber, #69389) [Link]

There are also the invalid path components on Windows. Other than the reserved names and characters, space and `.` are not allowed to be the end of a path component. All the gritty details: https://docs.microsoft.com/en-us/windows/win32/fileio/nam...