Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 14, 2020 22:11 UTC (Tue) by roc (subscriber, #30627)
In reply to: Szorc: Mercurial's Journey to and Reflections on Python 3 by dvdeug
Parent article: Szorc: Mercurial's Journey to and Reflections on Python 3

> you should probably be escaping a weird filename before dumping it to a terminal or the like.

Who does that escaping, though? What API would you even use to do it? In practice, no-one's going to do it until some user hits a "weird"-but-valid-UTF8 filename, then they're going to hack in some escaping that relies on UTF8 encoding not failing, then maybe one day some user hits a non-Unicode filename and then they hack in more escaping for strings containing lone surrogates.

That's not good if you want your software to be reliable. I don't think it makes sense to expect a dynamically-typed language like Python to be good for writing reliable software, but Python's specific choices make it unnecessarily unreliable.

As an example of a better way to do things: in Rust, the Path type is not a string and does not implement Display; if you want to print one, you call path.display() which does the necessary escaping. (It does not return a string, but does return something that implements Display, i.e. can be printed or converted to a string).

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 1:01 UTC (Wed) by roc (subscriber, #30627) [Link] (9 responses)

And of course the compiler guarantees that your program, once successfully compiled, will never try to print a Path.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 20:27 UTC (Wed) by nim-nim (subscriber, #34454) [Link] (8 responses)

That’s even more braindamaged than the python solution, since it makes sure rust is able to create and propagate paths that can not be displayed in an interoperable way.

*That* means any form of argument passing from rust to other software will fail in strange and underwhelming ways.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 20:30 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

Why would it fail?

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 9:12 UTC (Thu) by nim-nim (subscriber, #34454) [Link] (5 responses)

Because the whole system relies on a rust-specific Display() method. That won’t be supported or compatible with all other apps out there that will encounter rust filenames and expect them to work as normal filename arguments.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 10:28 UTC (Thu) by smurf (subscriber, #17840) [Link] (4 responses)

Well, when you print anything you're expected to use the current locale. If you can't because the string is not displayable, tough luck.

The idea that file names are somehow privileged to not require that went out of the window a long time ago. It doesn't matter one whit whether the code printing said file name is written in Python, Rust, Go, C++, Z80 assembly, or LOLCODE.

If you want a standard way to carry non-UTF8 pseudo-printable data (e.g. Latin1 filenames from the stone ages), no problem, either use the surrogateescape method or do proper shell quoting. The "write the non-UTF8 data" method is fine only when limited to streams that are known to be binary. "find -print0" comes to mind.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 16:21 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

> Well, when you print anything you're expected to use the current locale. If you can't because the string is not displayable, tough luck.
And what if you need to write a transparent proxy that needs to cope with non-UTF-8 headers?

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 17:43 UTC (Thu) by smurf (subscriber, #17840) [Link] (2 responses)

You ask the tool's author to please add "surrogateescape" to their .en/decode("utf-8") calls, either unconditionally or via some special mode. Or to transparently pass unencodeable headers as bytes, either …[ditto]. Or you submit a patch to do that yourself. Or you fork the code.

None of this is rocket science. Headers are supposed to be valid ASCII strings, after all, so why blame the people who try to adhere to the standard? Yes this could have been easier from the beginning, but that's why Python 3.8 is a whole lot better at this than 3.0.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 17:53 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

> None of this is rocket science. Headers are supposed to be valid ASCII strings, after all, so why blame the people who try to adhere to the standard?
The reality (that stubborn thing that doesn't go away) has agents that don't obey the standard. So a transparent proxy must accommodate it.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 6:11 UTC (Fri) by smurf (subscriber, #17840) [Link]

I know that. But most of the world is not a transparent proxy.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 10:48 UTC (Thu) by roc (subscriber, #30627) [Link]

It just means you need to think when you transmit paths to other programs. You will not use Path::display(), instead you will do whatever those other programs require.

By far the most common way to pass a path to a program is on its command line. In Rust you pass a command-line argument by calling Command::arg(...) to add the argument to the command you're building, and Command::arg(...) accepts Paths. Each platform has a standard way to pass arbitrary filenames as command-line parameters, and Rust does what the platform requires.

A few programs accept arbitrary paths as strings on stdin; they need to define how those paths are encoded on stdin. On Linux the program could read null-terminated strings of bytes; then in Rust you would use std::os::unix::ffi::OsStrExt::as_bytes() to extract the raw bytes from a Path and write them to the pipe. That code wouldn't even compile on Windows, which makes sense because a null-terminated string of bytes is not a reasonable way to represent a Windows path. A Windows program accepting paths as strings on stdin needs to define a different encoding, e.g. null-terminated strings of WCHAR, in which case the Rust program would use std::os::windows::ffi::OsStrExt::encode_wide() to produce such strings.

Rust makes it about as easy as possible to reliably pass non-UTF8 strings to other programs.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 4:22 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (17 responses)

> Who does that escaping, though? What API would you even use to do it?

repr()?

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 9:01 UTC (Wed) by roc (subscriber, #30627) [Link] (16 responses)

You probably don't want *every* filename to be quoted, you'd need some wrapper that quotes only "irregular" filenames.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 22:25 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (15 responses)

So use one of the other options from https://docs.python.org/3.8/library/codecs.html#error-han...

str.encode(..., errors='replace') # Replaces bad chars with U+FFFD.
str.encode(..., errors='ignore') # Silently drops bad chars.
str.encode(..., errors='xmlcharrefreplace') # Replaces with XML &-encoding
str.encode(..., errors='backslashreplace') # Replaces with \u.... syntax.
Or write your own and hook it into the standard system with https://docs.python.org/3.8/library/codecs.html#codecs.re...

Python is, after all, a "batteries included" language. This is a solved problem.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 23:44 UTC (Wed) by togga (subscriber, #53103) [Link] (13 responses)

Of which none gives back the original pathname when passed to another program that tries to find the file?

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 1:17 UTC (Thu) by roc (subscriber, #30627) [Link] (10 responses)

To be fair, that's an unsolvable problem, because there's no way to know how any given program will unescape file names.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 9:17 UTC (Thu) by nim-nim (subscriber, #34454) [Link] (6 responses)

It’s only an unsolvable problem if you let your code write malformed filenames.

Defining standard ways to process filenames (text) is the whole point of the unicode standard. Remove standard compliance, and you remove the ability to safely process the result.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 10:51 UTC (Thu) by roc (subscriber, #30627) [Link] (5 responses)

It would be great if we could impose a requirement that all filenames are valid Unicode, but unfortunately that cat left the building a long time ago. Operating systems don't enforce that, and non-Unicode filenames exist.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 12:46 UTC (Thu) by nim-nim (subscriber, #34454) [Link] (4 responses)

So what? Operating systems will fail enforcing against all kinds of brokeness and malware, that’s no reason to write broken files or malware.

Own up to the things your code does, do not hide behind lack of OS enforcement.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 13:46 UTC (Thu) by smurf (subscriber, #17840) [Link]

The assumption is that some people created source archives with Latin1-or-worse-encoded file names. They're embedded in Mercurial archives, and you need to be able to reproduce them exactly when checking them out. Replacing the file name with its UTF-8 equivalent is not an option if you want to 1:1 reproduce these files.

That being said, I seriously wonder how many of these archives actually exist and whether spending a lot of engineering time on fixing a legacy problem that simply doesn't exist these days – nobody who's even remotely sane still creates new files with non-UTF-8 file names – is a good idea. The far-easier solution might be "here's a tool that goes through your archive and re-encodes your broken file names, you need a flag day before you can use the latest Mercurial, sorry about that but non-UTF8 file names are broken by design and no longer supported".

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 14:54 UTC (Thu) by mathstuf (subscriber, #69389) [Link] (2 responses)

No, but if `ls` were written in Python (2 or 3), I wouldn't want it to not be able to list files that can be created by programs that do treat filenames as a bag of bytes and deliberately makes filenames invisible to common tools. Imagine malware hiding behind a filename of `\xff` on your filesystem. Should my Python tools be blind to it or accept the reality that, in general, filenames suck and the status quo at least needs support (though not necessarily be encouraged).

System behaviors are dictated by the platform. Any tools doing an ostrich impression related to "broken" or "malformed" filenames loses a lot of usability in system recovery and introspection.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 15:17 UTC (Thu) by anselm (subscriber, #2796) [Link] (1 responses)

No, but if `ls` were written in Python (2 or 3), I wouldn't want it to not be able to list files that can be created by programs that do treat filenames as a bag of bytes and deliberately makes filenames invisible to common tools.

This is something of a red herring because of surrogateescape. Python won't make filenames invisible just because they contain non-UTF-8 bytes.

In any case as far as I'm concerned, ls (whether written in Python or not) should issue obvious warnings if it encounters file names whose encoding is invalid according to the current locale (in this day and age, usually something using UTF-8).

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 15:32 UTC (Thu) by mathstuf (subscriber, #69389) [Link]

Well, surrogateescape and the other modes should really only happen at the display boundary. Internally, storing paths as a bag of bytes is the way it should be done (and 16bit units on Windows). It's only at the display side that things need munged for safety. Of course, sometimes you have the display (stdout) as the communication medium and an escaping strategy needs to exist there. Flags such as `-print0` and the like bypass that, but not everything likes to communicate with that.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 19, 2020 10:40 UTC (Sun) by togga (subscriber, #53103) [Link] (2 responses)

Yet still Python2 solved this by treating everything as a bag of bytes. I didn't even know this was a problem until I tried Python3 for the first time. I'd say that the problem itself is imposed to users by Python3. This, and many more similar problems is why Python3 became such a pain.

Like Python2 (at least previous versions) let developers choose which transforms are valid to them and provide them in the included batteries.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 19, 2020 11:28 UTC (Sun) by smurf (subscriber, #17840) [Link] (1 responses)

Well, you obviously never added some bytes to a string in some of your code (or worse had a library do it) which then splatted you with Mojibake in some other – almost, but not quite, entirely unrelated – procedure.

Identifying problems like this is no fun, let alone fixing them, but it's even less fun when the language silently accepts said nonsense and cannot be taught not to.

It's not as if the Python people just threw some dice labelled "fun incompatibilities", and "make strings incompatible with bytes" came up on top. This change was intended to solve real problems. We can debate until we're all blue in the face whether that was the right way to do it and whether the resulting incompatibilities were justified and whether "surrogateescape" should be the default for UTF8ifying random bytes you can reasonably expect to be ASCII these days, but without acknowledging said real problems this isn't going anywhere.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 19, 2020 18:36 UTC (Sun) by anselm (subscriber, #2796) [Link]

We can debate until we're all blue in the face whether that was the right way to do it and whether the resulting incompatibilities were justified and whether "surrogateescape" should be the default for UTF8ifying random bytes you can reasonably expect to be ASCII these days, but without acknowledging said real problems this isn't going anywhere.

Python has recently (for Python values of “recently”, i.e., in Python 3.4) acquired a pathlib module that purports to enable system-independent handling of file and directory names. Presumably the way forward towards fixing the whole mess as far as file names are concerned is to handle non-UTF-8 file names in this module; they could be kept as “bags of bytes” under the hood, with best-effort conversions to UTF-8 or bytes available but not mandatory. The Path class already includes methods that will open, read, and write files and list the content of directories (returning more Path objects) etc., so one could presumably go quite far without ever having to convert a path name to UTF-8.

The problem is that there are various places in the library that expect path names as strings and can't deal with Path objects, and these would need to be fixed. As I said, it might be a possible solution for the future.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 6:35 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (1 responses)

That is not a meaningful question, because encode() returns an object of a different type (str -> bytes), so you can't use the result with the same API that you started with, unless it's a polymorphic API.* You have to call str.decode() to undo the operation, and I don't think there's a built-in way to automatically reverse those error handlers (other than surrogateescape which was specifically designed for round-tripping). But you don't need to do that in the first place, because both str and bytes are immutable classes, so you still have the original object. If you lose the original object, I don't think you have the right to complain that it "got lost." Just don't lose it, and then you don't need to worry about reversing the operation.

(Regardless, xmlcharrefreplace and backslashreplace are both *mostly* lossless, and in the appropriate context, can be fully lossless, so long as you escape all ampersands or backslashes respectively. If you are outputting XML or HTML, then of course you have to escape ampersands anyway, which is obviously what xmlcharrefreplace was intended for. Similarly, backslash replacement is not a very sensible thing to do, unless you are working in a context where backslashes are normally escaped.)

* For example, Python's filesystem API calls os.fsencode() and os.fsdecode() automatically to translate between the operating system's preferred type and whatever type the user passes, but you can still call these manually to pass the errors argument or if you decide you actually wanted the other type.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 12:51 UTC (Thu) by excors (subscriber, #95769) [Link]

If you are outputting filenames as XML, it looks like you can't safely use xmlcharrefreplace, because e.g. "\udccc".encode("utf-8", errors="xmlcharrefreplace") returns b'&#56524;' which is not well-formed XML and will cause the XML parser to reject the whole document. (Character references must refer to one of https://www.w3.org/TR/xml/#NT-Char). And in HTML the parser will convert &#56524; into U+FFFD, so it's not lossless. You'll need a different scheme to encode Python's filename strings as XML or HTML.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 1:16 UTC (Thu) by roc (subscriber, #30627) [Link]

Thanks, that's helpful.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 5:54 UTC (Wed) by dvdeug (guest, #10998) [Link] (1 responses)

> Who does that escaping, though?

GNU Coreutils does.

> In practice, no-one's going to do it

Well, then, it really does seem like much ado about nothing. In a serious engineered program, you basically never output an unfiltered string; every line has to be marked for translation. This has always included filenames, which have security issues when dumped to a terminal, in Python 2 or 3.

Sure, I'll believe Rust has better ways of doing it. But if you're comparing Python 2 versus Python 3, it's just not that big a difference, either in normal usage or best practices.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 15, 2020 9:03 UTC (Wed) by roc (subscriber, #30627) [Link]

> > Who does that escaping, though?
> GNU Coreutils does.

In Python, I meant.

> In a serious engineered program, you basically never output an unfiltered string

I guess I've never seen a seriously engineered Python program. I suppose that's not unexpected.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 18:58 UTC (Fri) by cortana (subscriber, #24596) [Link]

I make it a habit to rely on repr(filename) when printing to log messages etc. It's ugly but it shouldn't risk any misunderstanding.