Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 8:43 UTC (Thu) by HelloWorld (guest, #56129)
In reply to: Szorc: Mercurial's Journey to and Reflections on Python 3 by marcH
Parent article: Szorc: Mercurial's Journey to and Reflections on Python 3

The point is that that doesn't matter at all. There are file systems that contain non-UTF-8 file names, and Python should be able to read and write these.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 8:53 UTC (Thu) by nim-nim (subscriber, #34454) [Link] (39 responses)

The point is that it does not matter at all. Non-UTF-8 filenames will crash and burn in interesting ways in apps and scripts (and the crash and burning *can* *not* be avoided given that filename argument passing is widely used in all systems at all levels).

Therefore "being able to write these" means "being able to crash other apps". It’s an hostile behavior, not really on par with Python objectives.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 10:17 UTC (Thu) by roc (subscriber, #30627) [Link] (35 responses)

> the crash and burning *can* *not* be avoided given that filename argument passing is widely used in all systems at all levels

Depends on what you mean by "cannot be avoided". All platforms that I know of allow passing any filename as a command-line argument. On Linux, it is easy to write a C or Rust program that spawns another program, passing a non-UTF8 filename as a command line argument. It is easy to write the spawned program in C or Rust and have it open that file. In fact, the idiomatic C and Rust code will handle non-UTF8 filenames correctly.

That C code won't work on Windows, you'll have to use wmain() or something, but the Rust code would work on Windows too.

So if you use the right languages "crashing and burning" *can* be avoided without the developer even having to work hard.

If you mean "cannot be avoided because most programs are buggy with non-UTF8 filenames, because they are use languages and libraries that don't handle non-UTF8 filenames well", that's true, *and we need to fix or move away from those languages and libraries*.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 12:49 UTC (Thu) by nim-nim (subscriber, #34454) [Link] (1 responses)

Your app do not own the filesystem. It‘s shared data space. You do not control how other programs read and process filenames. You do not control what other programs the system used installed and is using.

Do not feed them time bombs.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 21:17 UTC (Thu) by roc (subscriber, #30627) [Link]

> Your app do not own the filesystem. It‘s shared data space. You do not control how other programs read and process filenames. You do not control what other programs the system used installed and is using.

That's exactly why your app needs to be able to cope with any garbage filenames it finds there.

> Do not feed them time bombs.

I'm not arguing that non-Unicode filenames are a good thing or that apps should create them gratuitously. But widely-deployed apps and tools should not break when they encounter them.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 12:51 UTC (Thu) by anselm (subscriber, #2796) [Link] (27 responses)

So if you use the right languages "crashing and burning" *can* be avoided without the developer even having to work hard.

I personally would like my file names to work with the shell and standard utilities. I'm not about to write a C or Rust program just to copy a bunch of files, because their names are in a weird encoding that can't be typed in or will otherwise mess up my command lines.

In the 2020s, it's a reasonable assumption that file names will be encoded in UTF-8. We've had a few decades to get used to the idea, after all. If there are outlandish legacy file systems that insist on doing something else, then as far as I'm concerned these file systems are the problem and they ought to be fixed.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 16:17 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

There are plenty of other examples where Py3 falls flat because of its string insistence. For example, we had a problem with a transparent proxy that needed to work with non-UTF-8 headers.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 16:57 UTC (Fri) by cortana (subscriber, #24596) [Link] (3 responses)

I'm honestly not sealioning but: I thought HTTP headers were Latin-1. So they should be bytestrings in Python, not strings?

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 17:05 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

The reality is that nobody uses the RFC-specified method of encoding non-Latin-1 characters for HTTP headers. So in reality there are tons of agents sending headers in local codepages or with US ASCII characters.

This actually works fine with most servers.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 19:12 UTC (Fri) by excors (subscriber, #95769) [Link] (1 responses)

For a real-world example of handling HTTP headers, see XMLHttpRequest.getResponseHeader(). That's defined to return a ByteString (https://xhr.spec.whatwg.org/#ref-for-dom-xmlhttprequest-g...), which is converted to a JavaScript String by effectively decoding as Latin-1 (i.e. each byte is translated directly into a single 16-bit JS character) (https://heycam.github.io/webidl/#es-ByteString). When setting a header, you should get a TypeError exception if the JS String contains any character above U+00FF.

The only restrictions on a header value (https://fetch.spec.whatwg.org/#concept-header-value) are that it can't contain 0x00, 0x0D or 0x0A, and can't have leading/trailing 0x20 or 0x09. (And browsers only agreed on rejecting 0x00 quite recently.)

So it's pretty much just bytes, and if you want to try interpreting it as Unicode then that's at your own risk.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 19:27 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

Hah. I thought that HTTP2 fixed this, but apparently it's not: https://tools.ietf.org/html/rfc7230#section-3.2 - still allows "obs-text" which is basically any character.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 20:21 UTC (Thu) by rodgerd (guest, #58896) [Link]

It's unfortunate - to put it mildly - how many people seem wedded to "Speak ASCII or die" colonialism in their code, and then pivot to "but what about esoteric nonsense?" to try to block any progress, no?

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 21:28 UTC (Thu) by roc (subscriber, #30627) [Link] (20 responses)

> I'm not about to write a C or Rust program just to copy a bunch of files

`cp` and many other utilities handle non-Unicode filenames correctly. That's not surprising; C programs that accept filenames in argv[] and treats them as a null-terminated char strings should work.

We've had a couple of decades to try to enforce that filenames are valid UTF8 and I don't know of any Linux filesystem that does. Apparently that is not viable.

> as far as I'm concerned these file systems are the problem and they ought to be fixed.

Sounds good to me, but reality disagrees.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 1:46 UTC (Fri) by anselm (subscriber, #2796) [Link] (4 responses)

`cp` and many other utilities handle non-Unicode filenames correctly.

True, but you need to feed them such names in the first place. Given that, these days, Linux systems normally use UTF-8-based locales, non-Unicode filenames aren't going to be a whole lot of fun on a shell command line, or in the output of ls, long before Python 3 even comes into play.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 9:04 UTC (Fri) by mbunkus (subscriber, #87248) [Link]

zsh with tab-completion works nicely (I would think bash with tab-completion, too). It's my go to solution for fixing file names with invalid UTF-8 encodings.

Just last week I had such a file name generated by my email program when saving an attachment from a mail created by yet another bloody email program that fucks up attachment file name encoding. And the week before by unzipping a ZIP created on a German Windows.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 21:46 UTC (Fri) by Jandar (subscriber, #85683) [Link] (2 responses)

I keep to encounter Linux systems running application using 8-bit national encodings. The same appears in Samba exported directories from decades old software controlling equally old instruments.

Seeing systems with only UTF-8 filenames is a rarity for me.

Enforcing UTF-8 only filenames is a complete no-go, even considering it is crazy.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 22:47 UTC (Fri) by marcH (subscriber, #57642) [Link] (1 responses)

> I keep to encounter Linux systems running application using 8-bit national encodings.

Interesting, how does software on these systems typically know how to decode, display, exchange and generally deal with these encodings?

I understand Python itself enforces explicit encodings, not UTF-8.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 19, 2020 10:01 UTC (Sun) by Jandar (subscriber, #85683) [Link]

I have no information about any Python programs on these systems. C programs using setlocate(3) seem to have no major problems. If once in a while mojibake occurs in filenames like "qwert�uiop" this is insignificant in comparison to being unable to handle these files at all.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 1:56 UTC (Fri) by anselm (subscriber, #2796) [Link] (14 responses)

We've had a couple of decades to try to enforce that filenames are valid UTF8 and I don't know of any Linux filesystem that does.

This is really a user discipline/hygiene issue more than a Linux file system issue. In the 1980s, the official recommendation was that portable file names should stick to ASCII letters, digits, and a few choice special characters such as the dot, dash, and underscore – this wasn't enforced by the file system, but reasonable people adhered to this and stayed out of trouble. I don't have a huge problem with a similar recommendation that in the 21st century, reasonable people should stick to UTF-8 for portable file names even if the file system doesn't enforce it. Sure, there are careless ignorant bozos who will sh*t all over a file system given half a chance, but they need to be taught manners in any case. Let them suffer instead of the reasonable people.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 2:07 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (13 responses)

> ignorant bozos who will sh*t all over a file system
Like people using ShiftJIS and writing file names in Japanese?

Or Russian people using KOI-8 encoding on Linux?

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 2:16 UTC (Fri) by anselm (subscriber, #2796) [Link] (12 responses)

If you want to do that sort of thing, set your locale to one based on the appropriate encoding and not UTF-8. Even Python 3 should then do the Right Thing.

It's insisting that these legacy encodings should somehow “work” in a UTF-8-based locale that is at the root of the problem. Unfortunately file names don't carry explicit encoding information and so it isn't at all clear how that is supposed to play out in general – even the shell and the standard utilities will have issues with such file names in an UTF-8-based locale.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 10:48 UTC (Fri) by farnz (subscriber, #17727) [Link] (11 responses)

The problem is that filenames get shared between people. I use a UTF-8 locale, because my primary language is English, and thus any ASCII-compatible encoding does a good job of encoding my language; UTF-8 just adds a lot of characters that I rarely use. However, I also interact with people who still use KOI-8 family character encodings, because they have legacy tooling that knows that there is one true 8-bit encoding for characters, and they neither want to update that tooling, nor stop using their native languages with it.

Thus, even though I use UTF-8, and it's the common charset at work, I still have to deal with KOI-8 from some sources. When I want to make sense of the filename, I need to transcode it to UTF-8, but when I just want to manipulate the contents, I don't actually care - even if I translated it to UTF-8, all I'd get is a block of Cyrillic characters that I can only decode letter-by-letter at a very slow rate, so basically a black box. I might as well keep the bag of bytes in its original form…

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 13:36 UTC (Fri) by anselm (subscriber, #2796) [Link] (1 responses)

When I want to make sense of the filename, I need to transcode it to UTF-8, but when I just want to manipulate the contents, I don't actually care - even if I translated it to UTF-8, all I'd get is a block of Cyrillic characters that I can only decode letter-by-letter at a very slow rate, so basically a black box. I might as well keep the bag of bytes in its original form…

If you don't care what the file names look like, you shouldn't have an issue with Python using surrogate escapes, which it will if it encounters a file name that's not UTF-8.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 16:51 UTC (Fri) by excors (subscriber, #95769) [Link]

> If you don't care what the file names look like, you shouldn't have an issue with Python using surrogate escapes, which it will if it encounters a file name that's not UTF-8.

You'll have an issue in Python when you say print("Opening file %s" % sys.argv[1]) or print(*os.listdir()), and it throws UnicodeEncodeError instead of printing something that looks nearly correct.

You can see the file in ls, tab-complete it in bash, pass it to Python on the command line, pass it to open() in Python, and it works; but then you call an API like print() that doesn't use surrogateescape by default and it fails. (It works in Python 2 where everything is bytes, though of course Python 2 has its own set of encoding problems.)

Anyway, I think this thread started with the comment that Mercurial's maintainers didn't want to "use Unicode for filenames", and I still think that's not nearly as simple or good an idea as it sounds. Filenames are special things that need special handling, and surrogateescape is not a robust solution. Any program that deals seriously with files (like a VCS) ought to do things properly, and Python doesn't provide the tools to help with that, which is a reason to discourage use of Python (especially Python 3) for programs like Mercurial.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 15:05 UTC (Fri) by marcH (subscriber, #57642) [Link] (8 responses)

> However, I also interact with people who still use KOI-8 family character encodings, because they have legacy tooling that knows that there is one true 8-bit encoding for characters, and they neither want to update that tooling, nor stop using their native languages with it.

These files could be created on a KOI-8-only partition and their names automatically converted when copied out of it?

I'm surprised they haven't looked into this issue because it affects not just you but everyone else, maybe even themselves.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 15:45 UTC (Fri) by anselm (subscriber, #2796) [Link] (7 responses)

These files could be created on a KOI-8-only partition and their names automatically converted when copied out of it?

Technically there is no such thing as a “KOI-8-only partition” because Linux file systems don't care about character encoding in the first place. Of course you can establish a convention among the users of your system(s) that a certain directory (or set of directories) contains files with KOI-8-encoded names; it doesn't need to be a whole partition. But you will have to remember which is which because Linux isn't going to help you keep track.

Of course there's always convmv to convert file names from one encoding to another, and presumably someone could come up with a clever method to overlay-mount a directory with file names known to be in encoding X so that they appear as if they were in encoding Y. But arguably in the year 2020 the method of choice is to move all file names over to UTF-8 and be done (and fix or replace old software that insists on using a legacy encoding). It's also worth remembering that many legacy encodings are proper supersets of ASCII, so people who anticipate that their files will be processed on an UTF-8-based system could simply, out of basic courtesy and professionalism, stick to the POSIX portable-filename character set and save their colleagues the hassle of having to do conversions.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 16:35 UTC (Fri) by marcH (subscriber, #57642) [Link] (6 responses)

> Technically there is no such thing as a “KOI-8-only partition” because Linux file systems don't care about character encoding in the first place.

How do you know they use Linux? Even if they do, they could/should still use VFAT on Linux which does have iocharset, codepage and what not.

And now case insensitivity even - much trickier than filename encoding.

Or NTFS maybe.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 16:51 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (5 responses)

> How do you know they use Linux?
KOI-8 was the encoding widely used in Linux for Russian language. Win1251 was used in Windows.

There was also DOS (original and alternative) and ISO code pages, but they were rarely used.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 17:35 UTC (Fri) by marcH (subscriber, #57642) [Link] (4 responses)

Interesting, thanks!

So how did Linux and Windows users exchange files in Russia? Not?

The question of what software layer should force users to explicit the encodings they use is not obvious, I think we can all agree to disagree on where. If it's enforced "too low" it breaks too many use cases. Enforcing it "too high" is like not enforcing it at all. In any case I'm glad "something" is breaking stuff and forcing people to start cleaning up "bag of bytes" filename messes.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 17:49 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

> So how did Linux and Windows users exchange files in Russia? Not?
Using codepage converters. But it was so bad that by early 2000-s all the browsers supported automatic encoding detection, using frequency analysis to guess the code page.

At this time most often used versions of Windows (95 and 98) also didn't support Unicode, adding to the problem.

This was mostly fixed by the late 2000-s with the advent of UTF-8 and Windows versions with UCS-2 support.

However, I still have a historic CVS repo with KOI-8 names in it. So it's pretty clear that something like Mercurial needs to support these niche users.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 18:06 UTC (Fri) by marcH (subscriber, #57642) [Link] (2 responses)

> So it's pretty clear that something like Mercurial needs to support these niche users.

A cleanup flag day is IMHO the best trade off.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 18, 2020 22:40 UTC (Sat) by togga (subscriber, #53103) [Link] (1 responses)

"A cleanup flag day is IMHO the best trade off."

Tradeoff for what? Giving an incalculable number of users problems for sticking with a broken language?

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 18, 2020 22:48 UTC (Sat) by marcH (subscriber, #57642) [Link]

> Tradeoff for what? Giving an incalculable number of users problems for sticking with a broken language?

s/language/encodings/

This entire debate summarized in less than 25 characters. My pleasure.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 13:49 UTC (Thu) by smurf (subscriber, #17840) [Link] (2 responses)

Yeah, sure you can feed any random string to argv[], but the equally important case for file names is that somebody tries to type or paste them into their favorite editor (or its command line).

If you no longer have any way to type them because, surprise, your environment has been UTF8 for the last decade or so, then you'll need an otherwise-transparent encoding that can be pasted (or generated manually via \Udcxx), and that doesn't clash with the rest of your environment (your locale is utf-8 – and that's unlikely to change). Surrogateescape works for that. It should even be copy+paste-able.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 22, 2020 13:01 UTC (Wed) by niner (subscriber, #26151) [Link] (1 responses)

Create a file with a clearly non-UTF-8 name:
nine@sphinx:~> perl6 -e 'spurt ("testfile".encode ~ Buf.new(0xff, 0xff)).decode("utf8-c8"), "foo"'

The shell dutifully shows the name with surrogate characters:
nine@sphinx:~> ll testfile*
-rw-r--r-- 1 nine users 6 17. Sep 2014 testfile.latin-1
-rw-r--r-- 1 nine users 3 22. Jän 13:42 testfile??

Get that name from a directory listing, treating it like a string with a regex grep:
nine@sphinx:~> perl6 -e 'dir(".").grep(/testfile/).say'
("testfile.latin-1".IO "testfile􏿽xFF􏿽xFF".IO)

And just for fun: select+paste the file name in konsole:
nine@sphinx:~> cat testfile??
foo

Actually it looks like file names with "broken" encodings work pretty well. It's only Python 3 that stumbles:

nine@sphinx:~> python3
Python 3.7.3 (default, Apr 09 2019, 05:18:21) [GCC] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> for f in [f for f in os.listdir(".") if "testfile" in f]: print(f)
...
testfile.latin-1
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 8-9: surrogates not allowed

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 22, 2020 23:00 UTC (Wed) by Jandar (subscriber, #85683) [Link]

> nine@sphinx:~> cat testfile??
> foo

'?' is a special character for pattern matching in sh.

$ echo foo >testfilexx
$ cat testfile??
foo

So maybe this wasn't a correct test to see if your filename worked with copy&paste.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 16:34 UTC (Thu) by marcH (subscriber, #57642) [Link] (1 responses)

> If you mean "cannot be avoided because most programs are buggy with non-UTF8 filenames, because they are use languages and libraries that don't handle non-UTF8 filenames well", that's true, *and we need to fix or move away from those languages and libraries*.

Sure. The entire software world is going to fix all its filename bugs and assumptions just because some people name their files on some filesystems in funny ways. The programs that don't get fixed will die. That plan is so much simpler and easier than renaming files. /s

Oh, and all the developers who were repeatedly told to "sanitize your input" to protect themselves and the buggy programs above are all going to relax their checks a bit too.

Best of luck!

If you can't be happy, be at least realistic.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 21:49 UTC (Thu) by roc (subscriber, #30627) [Link]

Not the entire software world, no.

But it is realistic to expect that common utilities handle arbitrary filenames correctly (the most common ones do). And it realistic to expect that common languages and libraries make idiomatic filename-handling code handle arbitrary filenames correctly, because many do (including C, Go, Rust, Python2, and even some parts of Python3).

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 16:05 UTC (Thu) by HelloWorld (guest, #56129) [Link] (2 responses)

So you're saying that Python shouldn't be able to deal with users' files because *other* programs may (or may not) have a problem with that? What kind of logic is that?!

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 16:36 UTC (Thu) by marcH (subscriber, #57642) [Link] (1 responses)

> So you're saying that Python shouldn't be able to deal with users' files because *other* programs may (or may not) have a problem with that? What kind of logic is that?!

Not caring about funky filenames because most other programs don't care either: seems perfectly logic to me. You're confusing likeliness and logic.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 16, 2020 17:15 UTC (Thu) by marcH (subscriber, #57642) [Link]

Speaking of likeliness and happiness, let me share my personal preference. I'll stay brief or let's say briefer; seems doable.

I'm very happy that Python catches funky filenames at a relatively low-level with a clear, generic, usual, googlable and stackoverflowable exception rather than with some obscure crash and/or security issue specific to each Python program. These references about "garbage-in, garbage-out" surrogates that I don't have time to read scare me, I wish there were a way to turn them off.

I do not claim Python made all the right unicode decisions, I don't know what. I bet not, nothing's perfect. This comment is only about funky file names.