Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 2:16 UTC (Fri) by anselm (subscriber, #2796)
In reply to: Szorc: Mercurial's Journey to and Reflections on Python 3 by Cyberax
Parent article: Szorc: Mercurial's Journey to and Reflections on Python 3

If you want to do that sort of thing, set your locale to one based on the appropriate encoding and not UTF-8. Even Python 3 should then do the Right Thing.

It's insisting that these legacy encodings should somehow “work” in a UTF-8-based locale that is at the root of the problem. Unfortunately file names don't carry explicit encoding information and so it isn't at all clear how that is supposed to play out in general – even the shell and the standard utilities will have issues with such file names in an UTF-8-based locale.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 10:48 UTC (Fri) by farnz (subscriber, #17727) [Link] (11 responses)

The problem is that filenames get shared between people. I use a UTF-8 locale, because my primary language is English, and thus any ASCII-compatible encoding does a good job of encoding my language; UTF-8 just adds a lot of characters that I rarely use. However, I also interact with people who still use KOI-8 family character encodings, because they have legacy tooling that knows that there is one true 8-bit encoding for characters, and they neither want to update that tooling, nor stop using their native languages with it.

Thus, even though I use UTF-8, and it's the common charset at work, I still have to deal with KOI-8 from some sources. When I want to make sense of the filename, I need to transcode it to UTF-8, but when I just want to manipulate the contents, I don't actually care - even if I translated it to UTF-8, all I'd get is a block of Cyrillic characters that I can only decode letter-by-letter at a very slow rate, so basically a black box. I might as well keep the bag of bytes in its original form…

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 13:36 UTC (Fri) by anselm (subscriber, #2796) [Link] (1 responses)

When I want to make sense of the filename, I need to transcode it to UTF-8, but when I just want to manipulate the contents, I don't actually care - even if I translated it to UTF-8, all I'd get is a block of Cyrillic characters that I can only decode letter-by-letter at a very slow rate, so basically a black box. I might as well keep the bag of bytes in its original form…

If you don't care what the file names look like, you shouldn't have an issue with Python using surrogate escapes, which it will if it encounters a file name that's not UTF-8.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 16:51 UTC (Fri) by excors (subscriber, #95769) [Link]

> If you don't care what the file names look like, you shouldn't have an issue with Python using surrogate escapes, which it will if it encounters a file name that's not UTF-8.

You'll have an issue in Python when you say print("Opening file %s" % sys.argv[1]) or print(*os.listdir()), and it throws UnicodeEncodeError instead of printing something that looks nearly correct.

You can see the file in ls, tab-complete it in bash, pass it to Python on the command line, pass it to open() in Python, and it works; but then you call an API like print() that doesn't use surrogateescape by default and it fails. (It works in Python 2 where everything is bytes, though of course Python 2 has its own set of encoding problems.)

Anyway, I think this thread started with the comment that Mercurial's maintainers didn't want to "use Unicode for filenames", and I still think that's not nearly as simple or good an idea as it sounds. Filenames are special things that need special handling, and surrogateescape is not a robust solution. Any program that deals seriously with files (like a VCS) ought to do things properly, and Python doesn't provide the tools to help with that, which is a reason to discourage use of Python (especially Python 3) for programs like Mercurial.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 15:05 UTC (Fri) by marcH (subscriber, #57642) [Link] (8 responses)

> However, I also interact with people who still use KOI-8 family character encodings, because they have legacy tooling that knows that there is one true 8-bit encoding for characters, and they neither want to update that tooling, nor stop using their native languages with it.

These files could be created on a KOI-8-only partition and their names automatically converted when copied out of it?

I'm surprised they haven't looked into this issue because it affects not just you but everyone else, maybe even themselves.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 15:45 UTC (Fri) by anselm (subscriber, #2796) [Link] (7 responses)

These files could be created on a KOI-8-only partition and their names automatically converted when copied out of it?

Technically there is no such thing as a “KOI-8-only partition” because Linux file systems don't care about character encoding in the first place. Of course you can establish a convention among the users of your system(s) that a certain directory (or set of directories) contains files with KOI-8-encoded names; it doesn't need to be a whole partition. But you will have to remember which is which because Linux isn't going to help you keep track.

Of course there's always convmv to convert file names from one encoding to another, and presumably someone could come up with a clever method to overlay-mount a directory with file names known to be in encoding X so that they appear as if they were in encoding Y. But arguably in the year 2020 the method of choice is to move all file names over to UTF-8 and be done (and fix or replace old software that insists on using a legacy encoding). It's also worth remembering that many legacy encodings are proper supersets of ASCII, so people who anticipate that their files will be processed on an UTF-8-based system could simply, out of basic courtesy and professionalism, stick to the POSIX portable-filename character set and save their colleagues the hassle of having to do conversions.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 16:35 UTC (Fri) by marcH (subscriber, #57642) [Link] (6 responses)

> Technically there is no such thing as a “KOI-8-only partition” because Linux file systems don't care about character encoding in the first place.

How do you know they use Linux? Even if they do, they could/should still use VFAT on Linux which does have iocharset, codepage and what not.

And now case insensitivity even - much trickier than filename encoding.

Or NTFS maybe.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 16:51 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (5 responses)

> How do you know they use Linux?
KOI-8 was the encoding widely used in Linux for Russian language. Win1251 was used in Windows.

There was also DOS (original and alternative) and ISO code pages, but they were rarely used.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 17:35 UTC (Fri) by marcH (subscriber, #57642) [Link] (4 responses)

Interesting, thanks!

So how did Linux and Windows users exchange files in Russia? Not?

The question of what software layer should force users to explicit the encodings they use is not obvious, I think we can all agree to disagree on where. If it's enforced "too low" it breaks too many use cases. Enforcing it "too high" is like not enforcing it at all. In any case I'm glad "something" is breaking stuff and forcing people to start cleaning up "bag of bytes" filename messes.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 17:49 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

> So how did Linux and Windows users exchange files in Russia? Not?
Using codepage converters. But it was so bad that by early 2000-s all the browsers supported automatic encoding detection, using frequency analysis to guess the code page.

At this time most often used versions of Windows (95 and 98) also didn't support Unicode, adding to the problem.

This was mostly fixed by the late 2000-s with the advent of UTF-8 and Windows versions with UCS-2 support.

However, I still have a historic CVS repo with KOI-8 names in it. So it's pretty clear that something like Mercurial needs to support these niche users.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 17, 2020 18:06 UTC (Fri) by marcH (subscriber, #57642) [Link] (2 responses)

> So it's pretty clear that something like Mercurial needs to support these niche users.

A cleanup flag day is IMHO the best trade off.

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 18, 2020 22:40 UTC (Sat) by togga (subscriber, #53103) [Link] (1 responses)

"A cleanup flag day is IMHO the best trade off."

Tradeoff for what? Giving an incalculable number of users problems for sticking with a broken language?

Szorc: Mercurial's Journey to and Reflections on Python 3

Posted Jan 18, 2020 22:48 UTC (Sat) by marcH (subscriber, #57642) [Link]

> Tradeoff for what? Giving an incalculable number of users problems for sticking with a broken language?

s/language/encodings/

This entire debate summarized in less than 25 characters. My pleasure.