Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Posted Jan 14, 2020 22:11 UTC (Tue) by roc (subscriber, #30627)In reply to: Szorc: Mercurial's Journey to and Reflections on Python 3 by dvdeug
Parent article: Szorc: Mercurial's Journey to and Reflections on Python 3
Who does that escaping, though? What API would you even use to do it? In practice, no-one's going to do it until some user hits a "weird"-but-valid-UTF8 filename, then they're going to hack in some escaping that relies on UTF8 encoding not failing, then maybe one day some user hits a non-Unicode filename and then they hack in more escaping for strings containing lone surrogates.
That's not good if you want your software to be reliable. I don't think it makes sense to expect a dynamically-typed language like Python to be good for writing reliable software, but Python's specific choices make it unnecessarily unreliable.
As an example of a better way to do things: in Rust, the Path type is not a string and does not implement Display; if you want to print one, you call path.display() which does the necessary escaping. (It does not return a string, but does return something that implements Display, i.e. can be printed or converted to a string).
Posted Jan 15, 2020 1:01 UTC (Wed)
by roc (subscriber, #30627)
[Link] (9 responses)
Posted Jan 15, 2020 20:27 UTC (Wed)
by nim-nim (subscriber, #34454)
[Link] (8 responses)
*That* means any form of argument passing from rust to other software will fail in strange and underwhelming ways.
Posted Jan 15, 2020 20:30 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (6 responses)
Posted Jan 16, 2020 9:12 UTC (Thu)
by nim-nim (subscriber, #34454)
[Link] (5 responses)
Posted Jan 16, 2020 10:28 UTC (Thu)
by smurf (subscriber, #17840)
[Link] (4 responses)
The idea that file names are somehow privileged to not require that went out of the window a long time ago. It doesn't matter one whit whether the code printing said file name is written in Python, Rust, Go, C++, Z80 assembly, or LOLCODE.
If you want a standard way to carry non-UTF8 pseudo-printable data (e.g. Latin1 filenames from the stone ages), no problem, either use the surrogateescape method or do proper shell quoting. The "write the non-UTF8 data" method is fine only when limited to streams that are known to be binary. "find -print0" comes to mind.
Posted Jan 16, 2020 16:21 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (3 responses)
Posted Jan 16, 2020 17:43 UTC (Thu)
by smurf (subscriber, #17840)
[Link] (2 responses)
None of this is rocket science. Headers are supposed to be valid ASCII strings, after all, so why blame the people who try to adhere to the standard? Yes this could have been easier from the beginning, but that's why Python 3.8 is a whole lot better at this than 3.0.
Posted Jan 16, 2020 17:53 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
Posted Jan 17, 2020 6:11 UTC (Fri)
by smurf (subscriber, #17840)
[Link]
Posted Jan 16, 2020 10:48 UTC (Thu)
by roc (subscriber, #30627)
[Link]
By far the most common way to pass a path to a program is on its command line. In Rust you pass a command-line argument by calling Command::arg(...) to add the argument to the command you're building, and Command::arg(...) accepts Paths. Each platform has a standard way to pass arbitrary filenames as command-line parameters, and Rust does what the platform requires.
A few programs accept arbitrary paths as strings on stdin; they need to define how those paths are encoded on stdin. On Linux the program could read null-terminated strings of bytes; then in Rust you would use std::os::unix::ffi::OsStrExt::as_bytes() to extract the raw bytes from a Path and write them to the pipe. That code wouldn't even compile on Windows, which makes sense because a null-terminated string of bytes is not a reasonable way to represent a Windows path. A Windows program accepting paths as strings on stdin needs to define a different encoding, e.g. null-terminated strings of WCHAR, in which case the Rust program would use std::os::windows::ffi::OsStrExt::encode_wide() to produce such strings.
Rust makes it about as easy as possible to reliably pass non-UTF8 strings to other programs.
Posted Jan 15, 2020 4:22 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link] (17 responses)
repr()?
Posted Jan 15, 2020 9:01 UTC (Wed)
by roc (subscriber, #30627)
[Link] (16 responses)
Posted Jan 15, 2020 22:25 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link] (15 responses)
str.encode(..., errors='replace') # Replaces bad chars with U+FFFD.
Python is, after all, a "batteries included" language. This is a solved problem.
Posted Jan 15, 2020 23:44 UTC (Wed)
by togga (subscriber, #53103)
[Link] (13 responses)
Posted Jan 16, 2020 1:17 UTC (Thu)
by roc (subscriber, #30627)
[Link] (10 responses)
Posted Jan 16, 2020 9:17 UTC (Thu)
by nim-nim (subscriber, #34454)
[Link] (6 responses)
Defining standard ways to process filenames (text) is the whole point of the unicode standard. Remove standard compliance, and you remove the ability to safely process the result.
Posted Jan 16, 2020 10:51 UTC (Thu)
by roc (subscriber, #30627)
[Link] (5 responses)
Posted Jan 16, 2020 12:46 UTC (Thu)
by nim-nim (subscriber, #34454)
[Link] (4 responses)
Own up to the things your code does, do not hide behind lack of OS enforcement.
Posted Jan 16, 2020 13:46 UTC (Thu)
by smurf (subscriber, #17840)
[Link]
That being said, I seriously wonder how many of these archives actually exist and whether spending a lot of engineering time on fixing a legacy problem that simply doesn't exist these days – nobody who's even remotely sane still creates new files with non-UTF-8 file names – is a good idea. The far-easier solution might be "here's a tool that goes through your archive and re-encodes your broken file names, you need a flag day before you can use the latest Mercurial, sorry about that but non-UTF8 file names are broken by design and no longer supported".
Posted Jan 16, 2020 14:54 UTC (Thu)
by mathstuf (subscriber, #69389)
[Link] (2 responses)
System behaviors are dictated by the platform. Any tools doing an ostrich impression related to "broken" or "malformed" filenames loses a lot of usability in system recovery and introspection.
Posted Jan 16, 2020 15:17 UTC (Thu)
by anselm (subscriber, #2796)
[Link] (1 responses)
This is something of a red herring because of surrogateescape. Python won't make filenames invisible just because they contain non-UTF-8 bytes.
In any case as far as I'm concerned, ls (whether written in Python or not) should issue obvious warnings if it encounters file names whose encoding is invalid according to the current locale (in this day and age, usually something using UTF-8).
Posted Jan 16, 2020 15:32 UTC (Thu)
by mathstuf (subscriber, #69389)
[Link]
Posted Jan 19, 2020 10:40 UTC (Sun)
by togga (subscriber, #53103)
[Link] (2 responses)
Like Python2 (at least previous versions) let developers choose which transforms are valid to them and provide them in the included batteries.
Posted Jan 19, 2020 11:28 UTC (Sun)
by smurf (subscriber, #17840)
[Link] (1 responses)
Identifying problems like this is no fun, let alone fixing them, but it's even less fun when the language silently accepts said nonsense and cannot be taught not to.
It's not as if the Python people just threw some dice labelled "fun incompatibilities", and "make strings incompatible with bytes" came up on top. This change was intended to solve real problems. We can debate until we're all blue in the face whether that was the right way to do it and whether the resulting incompatibilities were justified and whether "surrogateescape" should be the default for UTF8ifying random bytes you can reasonably expect to be ASCII these days, but without acknowledging said real problems this isn't going anywhere.
Posted Jan 19, 2020 18:36 UTC (Sun)
by anselm (subscriber, #2796)
[Link]
Python has recently (for Python values of “recently”, i.e., in Python 3.4) acquired a pathlib module that purports to enable system-independent handling of file and directory names. Presumably the way forward towards fixing the whole mess as far as file names are concerned is to handle non-UTF-8 file names in this module; they could be kept as “bags of bytes” under the hood, with best-effort conversions to UTF-8 or bytes available but not mandatory. The Path class already includes methods that will open, read, and write files and list the content of directories (returning more Path objects) etc., so one could presumably go quite far without ever having to convert a path name to UTF-8.
The problem is that there are various places in the library that expect path names as strings and can't deal with Path objects, and these would need to be fixed. As I said, it might be a possible solution for the future.
Posted Jan 16, 2020 6:35 UTC (Thu)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
(Regardless, xmlcharrefreplace and backslashreplace are both *mostly* lossless, and in the appropriate context, can be fully lossless, so long as you escape all ampersands or backslashes respectively. If you are outputting XML or HTML, then of course you have to escape ampersands anyway, which is obviously what xmlcharrefreplace was intended for. Similarly, backslash replacement is not a very sensible thing to do, unless you are working in a context where backslashes are normally escaped.)
* For example, Python's filesystem API calls os.fsencode() and os.fsdecode() automatically to translate between the operating system's preferred type and whatever type the user passes, but you can still call these manually to pass the errors argument or if you decide you actually wanted the other type.
Posted Jan 16, 2020 12:51 UTC (Thu)
by excors (subscriber, #95769)
[Link]
Posted Jan 16, 2020 1:16 UTC (Thu)
by roc (subscriber, #30627)
[Link]
Posted Jan 15, 2020 5:54 UTC (Wed)
by dvdeug (guest, #10998)
[Link] (1 responses)
GNU Coreutils does.
> In practice, no-one's going to do it
Well, then, it really does seem like much ado about nothing. In a serious engineered program, you basically never output an unfiltered string; every line has to be marked for translation. This has always included filenames, which have security issues when dumped to a terminal, in Python 2 or 3.
Sure, I'll believe Rust has better ways of doing it. But if you're comparing Python 2 versus Python 3, it's just not that big a difference, either in normal usage or best practices.
Posted Jan 15, 2020 9:03 UTC (Wed)
by roc (subscriber, #30627)
[Link]
In Python, I meant.
> In a serious engineered program, you basically never output an unfiltered string
I guess I've never seen a seriously engineered Python program. I suppose that's not unexpected.
Posted Jan 17, 2020 18:58 UTC (Fri)
by cortana (subscriber, #24596)
[Link]
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
And what if you need to write a transparent proxy that needs to cope with non-UTF-8 headers?
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
The reality (that stubborn thing that doesn't go away) has agents that don't obey the standard. So a transparent proxy must accommodate it.
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
str.encode(..., errors='ignore') # Silently drops bad chars.
str.encode(..., errors='xmlcharrefreplace') # Replaces with XML &-encoding
str.encode(..., errors='backslashreplace') # Replaces with \u.... syntax.
Or write your own and hook it into the standard system with https://docs.python.org/3.8/library/codecs.html#codecs.re...
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
No, but if `ls` were written in Python (2 or 3), I wouldn't want it to not be able to list files that can be created by programs that do treat filenames as a bag of bytes and deliberately makes filenames invisible to common tools.
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
We can debate until we're all blue in the face whether that was the right way to do it and whether the resulting incompatibilities were justified and whether "surrogateescape" should be the default for UTF8ifying random bytes you can reasonably expect to be ASCII these days, but without acknowledging said real problems this isn't going anywhere.
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
> GNU Coreutils does.
Szorc: Mercurial's Journey to and Reflections on Python 3