Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Posted Jan 15, 2020 20:04 UTC (Wed) by HelloWorld (guest, #56129)In reply to: Szorc: Mercurial's Journey to and Reflections on Python 3 by nim-nim
Parent article: Szorc: Mercurial's Journey to and Reflections on Python 3
Posted Jan 16, 2020 7:56 UTC (Thu)
by marcH (subscriber, #57642)
[Link] (42 responses)
You didn't go far enough and missed that bit:
> Those are file*names* not opaque identifiers. They are supposed to be interpreted by humans (and therefore decoded) in a wide range of tools
Users don't care who's in charge of _their_ files, kernel or whatever else.
Posted Jan 16, 2020 8:01 UTC (Thu)
by marcH (subscriber, #57642)
[Link]
Posted Jan 16, 2020 8:43 UTC (Thu)
by HelloWorld (guest, #56129)
[Link] (40 responses)
Posted Jan 16, 2020 8:53 UTC (Thu)
by nim-nim (subscriber, #34454)
[Link] (39 responses)
Therefore "being able to write these" means "being able to crash other apps". It’s an hostile behavior, not really on par with Python objectives.
Posted Jan 16, 2020 10:17 UTC (Thu)
by roc (subscriber, #30627)
[Link] (35 responses)
Depends on what you mean by "cannot be avoided". All platforms that I know of allow passing any filename as a command-line argument. On Linux, it is easy to write a C or Rust program that spawns another program, passing a non-UTF8 filename as a command line argument. It is easy to write the spawned program in C or Rust and have it open that file. In fact, the idiomatic C and Rust code will handle non-UTF8 filenames correctly.
That C code won't work on Windows, you'll have to use wmain() or something, but the Rust code would work on Windows too.
So if you use the right languages "crashing and burning" *can* be avoided without the developer even having to work hard.
If you mean "cannot be avoided because most programs are buggy with non-UTF8 filenames, because they are use languages and libraries that don't handle non-UTF8 filenames well", that's true, *and we need to fix or move away from those languages and libraries*.
Posted Jan 16, 2020 12:49 UTC (Thu)
by nim-nim (subscriber, #34454)
[Link] (1 responses)
Do not feed them time bombs.
Posted Jan 16, 2020 21:17 UTC (Thu)
by roc (subscriber, #30627)
[Link]
That's exactly why your app needs to be able to cope with any garbage filenames it finds there.
> Do not feed them time bombs.
I'm not arguing that non-Unicode filenames are a good thing or that apps should create them gratuitously. But widely-deployed apps and tools should not break when they encounter them.
Posted Jan 16, 2020 12:51 UTC (Thu)
by anselm (subscriber, #2796)
[Link] (27 responses)
I personally would like my file names to work with the shell and standard utilities. I'm not about to write a C or Rust program just to copy a bunch of files, because their names are in a weird encoding that can't be typed in or will otherwise mess up my command lines.
In the 2020s, it's a reasonable assumption that file names will be encoded in UTF-8. We've had a few decades to get used to the idea, after all. If there are outlandish legacy file systems that insist on doing something else, then as far as I'm concerned these file systems are the problem and they ought to be fixed.
Posted Jan 16, 2020 16:17 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (4 responses)
Posted Jan 17, 2020 16:57 UTC (Fri)
by cortana (subscriber, #24596)
[Link] (3 responses)
Posted Jan 17, 2020 17:05 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link]
This actually works fine with most servers.
Posted Jan 17, 2020 19:12 UTC (Fri)
by excors (subscriber, #95769)
[Link] (1 responses)
The only restrictions on a header value (https://fetch.spec.whatwg.org/#concept-header-value) are that it can't contain 0x00, 0x0D or 0x0A, and can't have leading/trailing 0x20 or 0x09. (And browsers only agreed on rejecting 0x00 quite recently.)
So it's pretty much just bytes, and if you want to try interpreting it as Unicode then that's at your own risk.
Posted Jan 17, 2020 19:27 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Jan 16, 2020 20:21 UTC (Thu)
by rodgerd (guest, #58896)
[Link]
Posted Jan 16, 2020 21:28 UTC (Thu)
by roc (subscriber, #30627)
[Link] (20 responses)
`cp` and many other utilities handle non-Unicode filenames correctly. That's not surprising; C programs that accept filenames in argv[] and treats them as a null-terminated char strings should work.
We've had a couple of decades to try to enforce that filenames are valid UTF8 and I don't know of any Linux filesystem that does. Apparently that is not viable.
> as far as I'm concerned these file systems are the problem and they ought to be fixed.
Sounds good to me, but reality disagrees.
Posted Jan 17, 2020 1:46 UTC (Fri)
by anselm (subscriber, #2796)
[Link] (4 responses)
True, but you need to feed them such names in the first place. Given that, these days, Linux systems normally use UTF-8-based locales, non-Unicode filenames aren't going to be a whole lot of fun on a shell command line, or in the output of ls, long before Python 3 even comes into play.
Posted Jan 17, 2020 9:04 UTC (Fri)
by mbunkus (subscriber, #87248)
[Link]
Just last week I had such a file name generated by my email program when saving an attachment from a mail created by yet another bloody email program that fucks up attachment file name encoding. And the week before by unzipping a ZIP created on a German Windows.
Posted Jan 17, 2020 21:46 UTC (Fri)
by Jandar (subscriber, #85683)
[Link] (2 responses)
Seeing systems with only UTF-8 filenames is a rarity for me.
Enforcing UTF-8 only filenames is a complete no-go, even considering it is crazy.
Posted Jan 17, 2020 22:47 UTC (Fri)
by marcH (subscriber, #57642)
[Link] (1 responses)
Interesting, how does software on these systems typically know how to decode, display, exchange and generally deal with these encodings?
I understand Python itself enforces explicit encodings, not UTF-8.
Posted Jan 19, 2020 10:01 UTC (Sun)
by Jandar (subscriber, #85683)
[Link]
Posted Jan 17, 2020 1:56 UTC (Fri)
by anselm (subscriber, #2796)
[Link] (14 responses)
This is really a user discipline/hygiene issue more than a Linux file system issue. In the 1980s, the official recommendation was that portable file names should stick to ASCII letters, digits, and a few choice special characters such as the dot, dash, and underscore – this wasn't enforced by the file system, but reasonable people adhered to this and stayed out of trouble. I don't have a huge problem with a similar recommendation that in the 21st century, reasonable people should stick to UTF-8 for portable file names even if the file system doesn't enforce it. Sure, there are careless ignorant bozos who will sh*t all over a file system given half a chance, but they need to be taught manners in any case. Let them suffer instead of the reasonable people.
Posted Jan 17, 2020 2:07 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (13 responses)
Or Russian people using KOI-8 encoding on Linux?
Posted Jan 17, 2020 2:16 UTC (Fri)
by anselm (subscriber, #2796)
[Link] (12 responses)
If you want to do that sort of thing, set your locale to one based on the appropriate encoding and not UTF-8. Even Python 3 should then do the Right Thing.
It's insisting that these legacy encodings should somehow “work” in a UTF-8-based locale that is at the root of the problem. Unfortunately file names don't carry explicit encoding information and so it isn't at all clear how that is supposed to play out in general – even the shell and the standard utilities will have issues with such file names in an UTF-8-based locale.
Posted Jan 17, 2020 10:48 UTC (Fri)
by farnz (subscriber, #17727)
[Link] (11 responses)
The problem is that filenames get shared between people. I use a UTF-8 locale, because my primary language is English, and thus any ASCII-compatible encoding does a good job of encoding my language; UTF-8 just adds a lot of characters that I rarely use. However, I also interact with people who still use KOI-8 family character encodings, because they have legacy tooling that knows that there is one true 8-bit encoding for characters, and they neither want to update that tooling, nor stop using their native languages with it.
Thus, even though I use UTF-8, and it's the common charset at work, I still have to deal with KOI-8 from some sources. When I want to make sense of the filename, I need to transcode it to UTF-8, but when I just want to manipulate the contents, I don't actually care - even if I translated it to UTF-8, all I'd get is a block of Cyrillic characters that I can only decode letter-by-letter at a very slow rate, so basically a black box. I might as well keep the bag of bytes in its original form…
Posted Jan 17, 2020 13:36 UTC (Fri)
by anselm (subscriber, #2796)
[Link] (1 responses)
If you don't care what the file names look like, you shouldn't have an issue with Python using surrogate escapes, which it will if it encounters a file name that's not UTF-8.
Posted Jan 17, 2020 16:51 UTC (Fri)
by excors (subscriber, #95769)
[Link]
You'll have an issue in Python when you say print("Opening file %s" % sys.argv[1]) or print(*os.listdir()), and it throws UnicodeEncodeError instead of printing something that looks nearly correct.
You can see the file in ls, tab-complete it in bash, pass it to Python on the command line, pass it to open() in Python, and it works; but then you call an API like print() that doesn't use surrogateescape by default and it fails. (It works in Python 2 where everything is bytes, though of course Python 2 has its own set of encoding problems.)
Anyway, I think this thread started with the comment that Mercurial's maintainers didn't want to "use Unicode for filenames", and I still think that's not nearly as simple or good an idea as it sounds. Filenames are special things that need special handling, and surrogateescape is not a robust solution. Any program that deals seriously with files (like a VCS) ought to do things properly, and Python doesn't provide the tools to help with that, which is a reason to discourage use of Python (especially Python 3) for programs like Mercurial.
Posted Jan 17, 2020 15:05 UTC (Fri)
by marcH (subscriber, #57642)
[Link] (8 responses)
These files could be created on a KOI-8-only partition and their names automatically converted when copied out of it?
I'm surprised they haven't looked into this issue because it affects not just you but everyone else, maybe even themselves.
Posted Jan 17, 2020 15:45 UTC (Fri)
by anselm (subscriber, #2796)
[Link] (7 responses)
Technically there is no such thing as a “KOI-8-only partition” because Linux file systems don't care about character encoding in the first place. Of course you can establish a convention among the users of your system(s) that a certain directory (or set of directories) contains files with KOI-8-encoded names; it doesn't need to be a whole partition. But you will have to remember which is which because Linux isn't going to help you keep track.
Of course there's always convmv to convert file names from one encoding to another, and presumably someone could come up with a clever method to overlay-mount a directory with file names known to be in encoding X so that they appear as if they were in encoding Y. But arguably in the year 2020 the method of choice is to move all file names over to UTF-8 and be done (and fix or replace old software that insists on using a legacy encoding). It's also worth remembering that many legacy encodings are proper supersets of ASCII, so people who anticipate that their files will be processed on an UTF-8-based system could simply, out of basic courtesy and professionalism, stick to the POSIX portable-filename character set and save their colleagues the hassle of having to do conversions.
Posted Jan 17, 2020 16:35 UTC (Fri)
by marcH (subscriber, #57642)
[Link] (6 responses)
How do you know they use Linux? Even if they do, they could/should still use VFAT on Linux which does have iocharset, codepage and what not.
And now case insensitivity even - much trickier than filename encoding.
Or NTFS maybe.
Posted Jan 17, 2020 16:51 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (5 responses)
There was also DOS (original and alternative) and ISO code pages, but they were rarely used.
Posted Jan 17, 2020 17:35 UTC (Fri)
by marcH (subscriber, #57642)
[Link] (4 responses)
So how did Linux and Windows users exchange files in Russia? Not?
The question of what software layer should force users to explicit the encodings they use is not obvious, I think we can all agree to disagree on where. If it's enforced "too low" it breaks too many use cases. Enforcing it "too high" is like not enforcing it at all. In any case I'm glad "something" is breaking stuff and forcing people to start cleaning up "bag of bytes" filename messes.
Posted Jan 17, 2020 17:49 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (3 responses)
At this time most often used versions of Windows (95 and 98) also didn't support Unicode, adding to the problem.
This was mostly fixed by the late 2000-s with the advent of UTF-8 and Windows versions with UCS-2 support.
However, I still have a historic CVS repo with KOI-8 names in it. So it's pretty clear that something like Mercurial needs to support these niche users.
Posted Jan 17, 2020 18:06 UTC (Fri)
by marcH (subscriber, #57642)
[Link] (2 responses)
A cleanup flag day is IMHO the best trade off.
Posted Jan 18, 2020 22:40 UTC (Sat)
by togga (subscriber, #53103)
[Link] (1 responses)
Tradeoff for what? Giving an incalculable number of users problems for sticking with a broken language?
Posted Jan 18, 2020 22:48 UTC (Sat)
by marcH (subscriber, #57642)
[Link]
s/language/encodings/
This entire debate summarized in less than 25 characters. My pleasure.
Posted Jan 16, 2020 13:49 UTC (Thu)
by smurf (subscriber, #17840)
[Link] (2 responses)
If you no longer have any way to type them because, surprise, your environment has been UTF8 for the last decade or so, then you'll need an otherwise-transparent encoding that can be pasted (or generated manually via \Udcxx), and that doesn't clash with the rest of your environment (your locale is utf-8 – and that's unlikely to change). Surrogateescape works for that. It should even be copy+paste-able.
Posted Jan 22, 2020 13:01 UTC (Wed)
by niner (subscriber, #26151)
[Link] (1 responses)
The shell dutifully shows the name with surrogate characters:
Get that name from a directory listing, treating it like a string with a regex grep:
And just for fun: select+paste the file name in konsole:
Actually it looks like file names with "broken" encodings work pretty well. It's only Python 3 that stumbles:
nine@sphinx:~> python3
Posted Jan 22, 2020 23:00 UTC (Wed)
by Jandar (subscriber, #85683)
[Link]
'?' is a special character for pattern matching in sh.
$ echo foo >testfilexx
So maybe this wasn't a correct test to see if your filename worked with copy&paste.
Posted Jan 16, 2020 16:34 UTC (Thu)
by marcH (subscriber, #57642)
[Link] (1 responses)
Sure. The entire software world is going to fix all its filename bugs and assumptions just because some people name their files on some filesystems in funny ways. The programs that don't get fixed will die. That plan is so much simpler and easier than renaming files. /s
Oh, and all the developers who were repeatedly told to "sanitize your input" to protect themselves and the buggy programs above are all going to relax their checks a bit too.
Best of luck!
If you can't be happy, be at least realistic.
Posted Jan 16, 2020 21:49 UTC (Thu)
by roc (subscriber, #30627)
[Link]
But it is realistic to expect that common utilities handle arbitrary filenames correctly (the most common ones do). And it realistic to expect that common languages and libraries make idiomatic filename-handling code handle arbitrary filenames correctly, because many do (including C, Go, Rust, Python2, and even some parts of Python3).
Posted Jan 16, 2020 16:05 UTC (Thu)
by HelloWorld (guest, #56129)
[Link] (2 responses)
Posted Jan 16, 2020 16:36 UTC (Thu)
by marcH (subscriber, #57642)
[Link] (1 responses)
Not caring about funky filenames because most other programs don't care either: seems perfectly logic to me. You're confusing likeliness and logic.
Posted Jan 16, 2020 17:15 UTC (Thu)
by marcH (subscriber, #57642)
[Link]
I'm very happy that Python catches funky filenames at a relatively low-level with a clear, generic, usual, googlable and stackoverflowable exception rather than with some obscure crash and/or security issue specific to each Python program. These references about "garbage-in, garbage-out" surrogates that I don't have time to read scare me, I wish there were a way to turn them off.
I do not claim Python made all the right unicode decisions, I don't know what. I bet not, nothing's perfect. This comment is only about funky file names.
Posted Jan 16, 2020 17:20 UTC (Thu)
by marcH (subscriber, #57642)
[Link]
"were"? https://lwn.net/Articles/784041/ Case-insensitive ext4
Now _that_ (case sensitivity) really never belonged to a kernel IMHO. Realism?
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
https://lwn.net/Articles/325304/ "Wheeler: Fixing Unix/Linux/POSIX Filenames" 2009
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
So if you use the right languages "crashing and burning" *can* be avoided without the developer even having to work hard.
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
`cp` and many other utilities handle non-Unicode filenames correctly.
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
We've had a couple of decades to try to enforce that filenames are valid UTF8 and I don't know of any Linux filesystem that does.
Szorc: Mercurial's Journey to and Reflections on Python 3
Like people using ShiftJIS and writing file names in Japanese?
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
When I want to make sense of the filename, I need to transcode it to UTF-8, but when I just want to manipulate the contents, I don't actually care - even if I translated it to UTF-8, all I'd get is a block of Cyrillic characters that I can only decode letter-by-letter at a very slow rate, so basically a black box. I might as well keep the bag of bytes in its original form…
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
These files could be created on a KOI-8-only partition and their names automatically converted when copied out of it?
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
KOI-8 was the encoding widely used in Linux for Russian language. Win1251 was used in Windows.
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Using codepage converters. But it was so bad that by early 2000-s all the browsers supported automatic encoding detection, using frequency analysis to guess the code page.
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
nine@sphinx:~> perl6 -e 'spurt ("testfile".encode ~ Buf.new(0xff, 0xff)).decode("utf8-c8"), "foo"'
nine@sphinx:~> ll testfile*
-rw-r--r-- 1 nine users 6 17. Sep 2014 testfile.latin-1
-rw-r--r-- 1 nine users 3 22. Jän 13:42 testfile??
nine@sphinx:~> perl6 -e 'dir(".").grep(/testfile/).say'
("testfile.latin-1".IO "testfilexFFxFF".IO)
nine@sphinx:~> cat testfile??
foo
Python 3.7.3 (default, Apr 09 2019, 05:18:21) [GCC] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> for f in [f for f in os.listdir(".") if "testfile" in f]: print(f)
...
testfile.latin-1
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 8-9: surrogates not allowed
Szorc: Mercurial's Journey to and Reflections on Python 3
> foo
$ cat testfile??
foo
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3
Szorc: Mercurial's Journey to and Reflections on Python 3