Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 1:17 UTC (Thu) by roc (subscriber, #30627)
In reply to: Szorc: Mercurial's Journey to and Reflections on Python 3 by togga
Parent article: Szorc: Mercurial's Journey to and Reflections on Python 3

To be fair, that's an unsolvable problem, because there's no way to know how any given program will unescape file names.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 9:17 UTC (Thu) by nim-nim (subscriber, #34454) [Link] (6 responses)

It’s only an unsolvable problem if you let your code write malformed filenames.

Defining standard ways to process filenames (text) is the whole point of the unicode standard. Remove standard compliance, and you remove the ability to safely process the result.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 10:51 UTC (Thu) by roc (subscriber, #30627) [Link] (5 responses)

It would be great if we could impose a requirement that all filenames are valid Unicode, but unfortunately that cat left the building a long time ago. Operating systems don't enforce that, and non-Unicode filenames exist.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 12:46 UTC (Thu) by nim-nim (subscriber, #34454) [Link] (4 responses)

So what? Operating systems will fail enforcing against all kinds of brokeness and malware, that’s no reason to write broken files or malware.

Own up to the things your code does, do not hide behind lack of OS enforcement.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 13:46 UTC (Thu) by smurf (subscriber, #17840) [Link]

The assumption is that some people created source archives with Latin1-or-worse-encoded file names. They're embedded in Mercurial archives, and you need to be able to reproduce them exactly when checking them out. Replacing the file name with its UTF-8 equivalent is not an option if you want to 1:1 reproduce these files.

That being said, I seriously wonder how many of these archives actually exist and whether spending a lot of engineering time on fixing a legacy problem that simply doesn't exist these days – nobody who's even remotely sane still creates new files with non-UTF-8 file names – is a good idea. The far-easier solution might be "here's a tool that goes through your archive and re-encodes your broken file names, you need a flag day before you can use the latest Mercurial, sorry about that but non-UTF8 file names are broken by design and no longer supported".

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 14:54 UTC (Thu) by mathstuf (subscriber, #69389) [Link] (2 responses)

No, but if `ls` were written in Python (2 or 3), I wouldn't want it to not be able to list files that can be created by programs that do treat filenames as a bag of bytes and deliberately makes filenames invisible to common tools. Imagine malware hiding behind a filename of `\xff` on your filesystem. Should my Python tools be blind to it or accept the reality that, in general, filenames suck and the status quo at least needs support (though not necessarily be encouraged).

System behaviors are dictated by the platform. Any tools doing an ostrich impression related to "broken" or "malformed" filenames loses a lot of usability in system recovery and introspection.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 15:17 UTC (Thu) by anselm (subscriber, #2796) [Link] (1 responses)

No, but if `ls` were written in Python (2 or 3), I wouldn't want it to not be able to list files that can be created by programs that do treat filenames as a bag of bytes and deliberately makes filenames invisible to common tools.

This is something of a red herring because of surrogateescape. Python won't make filenames invisible just because they contain non-UTF-8 bytes.

In any case as far as I'm concerned, ls (whether written in Python or not) should issue obvious warnings if it encounters file names whose encoding is invalid according to the current locale (in this day and age, usually something using UTF-8).

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 15:32 UTC (Thu) by mathstuf (subscriber, #69389) [Link]

Well, surrogateescape and the other modes should really only happen at the display boundary. Internally, storing paths as a bag of bytes is the way it should be done (and 16bit units on Windows). It's only at the display side that things need munged for safety. Of course, sometimes you have the display (stdout) as the communication medium and an escaping strategy needs to exist there. Flags such as `-print0` and the like bypass that, but not everything likes to communicate with that.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 19, 2020 10:40 UTC (Sun) by togga (subscriber, #53103) [Link] (2 responses)

Yet still Python2 solved this by treating everything as a bag of bytes. I didn't even know this was a problem until I tried Python3 for the first time. I'd say that the problem itself is imposed to users by Python3. This, and many more similar problems is why Python3 became such a pain.

Like Python2 (at least previous versions) let developers choose which transforms are valid to them and provide them in the included batteries.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 19, 2020 11:28 UTC (Sun) by smurf (subscriber, #17840) [Link] (1 responses)

Well, you obviously never added some bytes to a string in some of your code (or worse had a library do it) which then splatted you with Mojibake in some other – almost, but not quite, entirely unrelated – procedure.

Identifying problems like this is no fun, let alone fixing them, but it's even less fun when the language silently accepts said nonsense and cannot be taught not to.

It's not as if the Python people just threw some dice labelled "fun incompatibilities", and "make strings incompatible with bytes" came up on top. This change was intended to solve real problems. We can debate until we're all blue in the face whether that was the right way to do it and whether the resulting incompatibilities were justified and whether "surrogateescape" should be the default for UTF8ifying random bytes you can reasonably expect to be ASCII these days, but without acknowledging said real problems this isn't going anywhere.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 19, 2020 18:36 UTC (Sun) by anselm (subscriber, #2796) [Link]

We can debate until we're all blue in the face whether that was the right way to do it and whether the resulting incompatibilities were justified and whether "surrogateescape" should be the default for UTF8ifying random bytes you can reasonably expect to be ASCII these days, but without acknowledging said real problems this isn't going anywhere.

Python has recently (for Python values of “recently”, i.e., in Python 3.4) acquired a pathlib module that purports to enable system-independent handling of file and directory names. Presumably the way forward towards fixing the whole mess as far as file names are concerned is to handle non-UTF-8 file names in this module; they could be kept as “bags of bytes” under the hood, with best-effort conversions to UTF-8 or bytes available but not mandatory. The Path class already includes methods that will open, read, and write files and list the content of directories (returning more Path objects) etc., so one could presumably go quite far without ever having to convert a path name to UTF-8.

The problem is that there are various places in the library that expect path names as strings and can't deal with Path objects, and these would need to be fixed. As I said, it might be a possible solution for the future.