Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 9:01 UTC (Wed) by roc (subscriber, #30627)
In reply to: Szorc: Mercurial's Journey to and Reflections on Python 3 by NYKevin
Parent article: Szorc: Mercurial's Journey to and Reflections on Python 3

You probably don't want *every* filename to be quoted, you'd need some wrapper that quotes only "irregular" filenames.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 22:25 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (15 responses)

So use one of the other options from https://docs.python.org/3.8/library/codecs.html#error-han...

str.encode(..., errors='replace') # Replaces bad chars with U+FFFD.
str.encode(..., errors='ignore') # Silently drops bad chars.
str.encode(..., errors='xmlcharrefreplace') # Replaces with XML &-encoding
str.encode(..., errors='backslashreplace') # Replaces with \u.... syntax.
Or write your own and hook it into the standard system with https://docs.python.org/3.8/library/codecs.html#codecs.re...

Python is, after all, a "batteries included" language. This is a solved problem.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 23:44 UTC (Wed) by togga (subscriber, #53103) [Link] (13 responses)

Of which none gives back the original pathname when passed to another program that tries to find the file?

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 1:17 UTC (Thu) by roc (subscriber, #30627) [Link] (10 responses)

To be fair, that's an unsolvable problem, because there's no way to know how any given program will unescape file names.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 9:17 UTC (Thu) by nim-nim (subscriber, #34454) [Link] (6 responses)

It’s only an unsolvable problem if you let your code write malformed filenames.

Defining standard ways to process filenames (text) is the whole point of the unicode standard. Remove standard compliance, and you remove the ability to safely process the result.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 10:51 UTC (Thu) by roc (subscriber, #30627) [Link] (5 responses)

It would be great if we could impose a requirement that all filenames are valid Unicode, but unfortunately that cat left the building a long time ago. Operating systems don't enforce that, and non-Unicode filenames exist.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 12:46 UTC (Thu) by nim-nim (subscriber, #34454) [Link] (4 responses)

So what? Operating systems will fail enforcing against all kinds of brokeness and malware, that’s no reason to write broken files or malware.

Own up to the things your code does, do not hide behind lack of OS enforcement.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 13:46 UTC (Thu) by smurf (subscriber, #17840) [Link]

The assumption is that some people created source archives with Latin1-or-worse-encoded file names. They're embedded in Mercurial archives, and you need to be able to reproduce them exactly when checking them out. Replacing the file name with its UTF-8 equivalent is not an option if you want to 1:1 reproduce these files.

That being said, I seriously wonder how many of these archives actually exist and whether spending a lot of engineering time on fixing a legacy problem that simply doesn't exist these days – nobody who's even remotely sane still creates new files with non-UTF-8 file names – is a good idea. The far-easier solution might be "here's a tool that goes through your archive and re-encodes your broken file names, you need a flag day before you can use the latest Mercurial, sorry about that but non-UTF8 file names are broken by design and no longer supported".

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 14:54 UTC (Thu) by mathstuf (subscriber, #69389) [Link] (2 responses)

No, but if `ls` were written in Python (2 or 3), I wouldn't want it to not be able to list files that can be created by programs that do treat filenames as a bag of bytes and deliberately makes filenames invisible to common tools. Imagine malware hiding behind a filename of `\xff` on your filesystem. Should my Python tools be blind to it or accept the reality that, in general, filenames suck and the status quo at least needs support (though not necessarily be encouraged).

System behaviors are dictated by the platform. Any tools doing an ostrich impression related to "broken" or "malformed" filenames loses a lot of usability in system recovery and introspection.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 15:17 UTC (Thu) by anselm (subscriber, #2796) [Link] (1 responses)

No, but if `ls` were written in Python (2 or 3), I wouldn't want it to not be able to list files that can be created by programs that do treat filenames as a bag of bytes and deliberately makes filenames invisible to common tools.

This is something of a red herring because of surrogateescape. Python won't make filenames invisible just because they contain non-UTF-8 bytes.

In any case as far as I'm concerned, ls (whether written in Python or not) should issue obvious warnings if it encounters file names whose encoding is invalid according to the current locale (in this day and age, usually something using UTF-8).

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 15:32 UTC (Thu) by mathstuf (subscriber, #69389) [Link]

Well, surrogateescape and the other modes should really only happen at the display boundary. Internally, storing paths as a bag of bytes is the way it should be done (and 16bit units on Windows). It's only at the display side that things need munged for safety. Of course, sometimes you have the display (stdout) as the communication medium and an escaping strategy needs to exist there. Flags such as `-print0` and the like bypass that, but not everything likes to communicate with that.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 19, 2020 10:40 UTC (Sun) by togga (subscriber, #53103) [Link] (2 responses)

Yet still Python2 solved this by treating everything as a bag of bytes. I didn't even know this was a problem until I tried Python3 for the first time. I'd say that the problem itself is imposed to users by Python3. This, and many more similar problems is why Python3 became such a pain.

Like Python2 (at least previous versions) let developers choose which transforms are valid to them and provide them in the included batteries.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 19, 2020 11:28 UTC (Sun) by smurf (subscriber, #17840) [Link] (1 responses)

Well, you obviously never added some bytes to a string in some of your code (or worse had a library do it) which then splatted you with Mojibake in some other – almost, but not quite, entirely unrelated – procedure.

Identifying problems like this is no fun, let alone fixing them, but it's even less fun when the language silently accepts said nonsense and cannot be taught not to.

It's not as if the Python people just threw some dice labelled "fun incompatibilities", and "make strings incompatible with bytes" came up on top. This change was intended to solve real problems. We can debate until we're all blue in the face whether that was the right way to do it and whether the resulting incompatibilities were justified and whether "surrogateescape" should be the default for UTF8ifying random bytes you can reasonably expect to be ASCII these days, but without acknowledging said real problems this isn't going anywhere.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 19, 2020 18:36 UTC (Sun) by anselm (subscriber, #2796) [Link]

We can debate until we're all blue in the face whether that was the right way to do it and whether the resulting incompatibilities were justified and whether "surrogateescape" should be the default for UTF8ifying random bytes you can reasonably expect to be ASCII these days, but without acknowledging said real problems this isn't going anywhere.

Python has recently (for Python values of “recently”, i.e., in Python 3.4) acquired a pathlib module that purports to enable system-independent handling of file and directory names. Presumably the way forward towards fixing the whole mess as far as file names are concerned is to handle non-UTF-8 file names in this module; they could be kept as “bags of bytes” under the hood, with best-effort conversions to UTF-8 or bytes available but not mandatory. The Path class already includes methods that will open, read, and write files and list the content of directories (returning more Path objects) etc., so one could presumably go quite far without ever having to convert a path name to UTF-8.

The problem is that there are various places in the library that expect path names as strings and can't deal with Path objects, and these would need to be fixed. As I said, it might be a possible solution for the future.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 6:35 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (1 responses)

That is not a meaningful question, because encode() returns an object of a different type (str -> bytes), so you can't use the result with the same API that you started with, unless it's a polymorphic API.* You have to call str.decode() to undo the operation, and I don't think there's a built-in way to automatically reverse those error handlers (other than surrogateescape which was specifically designed for round-tripping). But you don't need to do that in the first place, because both str and bytes are immutable classes, so you still have the original object. If you lose the original object, I don't think you have the right to complain that it "got lost." Just don't lose it, and then you don't need to worry about reversing the operation.

(Regardless, xmlcharrefreplace and backslashreplace are both *mostly* lossless, and in the appropriate context, can be fully lossless, so long as you escape all ampersands or backslashes respectively. If you are outputting XML or HTML, then of course you have to escape ampersands anyway, which is obviously what xmlcharrefreplace was intended for. Similarly, backslash replacement is not a very sensible thing to do, unless you are working in a context where backslashes are normally escaped.)

* For example, Python's filesystem API calls os.fsencode() and os.fsdecode() automatically to translate between the operating system's preferred type and whatever type the user passes, but you can still call these manually to pass the errors argument or if you decide you actually wanted the other type.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 12:51 UTC (Thu) by excors (subscriber, #95769) [Link]

If you are outputting filenames as XML, it looks like you can't safely use xmlcharrefreplace, because e.g. "\udccc".encode("utf-8", errors="xmlcharrefreplace") returns b'&#56524;' which is not well-formed XML and will cause the XML parser to reject the whole document. (Character references must refer to one of https://www.w3.org/TR/xml/#NT-Char). And in HTML the parser will convert &#56524; into U+FFFD, so it's not lossless. You'll need a different scheme to encode Python's filename strings as XML or HTML.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 1:16 UTC (Thu) by roc (subscriber, #30627) [Link]

Thanks, that's helpful.