Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 16:21 UTC (Tue) by mb (subscriber, #50428)
In reply to: Watson: Launchpad now runs on Python 3 by martin.langhoff
Parent article: Watson: Launchpad now runs on Python 3

>It still blows up in random ways.

What does that mean?
Do you have an example?

Watson: Launchpad now runs on Python 3

Posted Aug 3, 2021 21:32 UTC (Tue) by martin.langhoff (guest, #61417) [Link] (8 responses)

> What does that mean?
> Do you have an example?

Just today, some XML files that seem to be UTF-16 throw a UnicodeDecodeError: utf-16-le can't decode bytes ... illegal encoding.

There will be an explanation for this -- an infrequent codepath, maybe some open() or decode() I haven't hunted down yet, because Python 3 needs significantly more handholding when opening a file than Py2.

surrogateescape

Posted Aug 4, 2021 1:35 UTC (Wed) by tialaramex (subscriber, #21167) [Link] (7 responses)

I very much doubt that "surrogateescape" is what you actually need here.

The idea of "surrogateescape" is that hey, if we could smuggle gibberish through the decoder/ encoder pipeline, and the rest of our code never looks at the resulting non-sense, then we can have what we "want" of just pretending it's ASCII. This is a bad idea, and for security reasons it blows up in some cases, so it can't achieve this false panacea anyway.

Without knowing way too many details about your application it's impossible to be certain, but I'd suggest "replace". Here's the consequences of "replace" so you can assess if they're closer to what you need:

1. Any time your guess doesn't work and the decoder is looking at stuff that is not in fact Unicode text in whatever encoding you hoped for, sequences that don't decode turn into the Unicode code point U+FFFD, a code point reserved specifically for this purpose. This code point is definitely not: a letter, a digit, punctuation, whitespace, part of a variable name, an ASCII control character, or anything else, but it is definitely itself. There's an excellent chance your code needs no extra work to cope with this which is nice. If it does, this code point can (but usually doesn't) exist in Unicode data anyway, so you likely should fix that. No exceptions, no decode errors, just Unicode data, some of which may be U+FFFD which again, was already valid Unicode.

2. If your user/customers see this replacement character U+FFFD it looks like this: � and that's pretty obviously not what they meant. Or I guess, in a few cases maybe it's exactly what they meant. Maybe they're writing a Unicode decoder. But unlike escaping and other nonsense it's very obvious what's going on here. You can even Google for it.

3. On output, in the worst case where you were sure the output should be ASCII, Python has no choice but to choose ? because that's the best it can do. The good news is that humans can generally realise that function_???????_??? means something went wrong. The better news is that you probably don't see this too often when you're outputting ASCII. Any time your output is some flavour of Unicode, Python can emit an actual U+FFFD.

Now, personally I think "surrogateescape" is as much Python's fault as yours. A programming language should encourage its users not to set themselves on fire. Providing "I'm sure I know what I'm doing" features is near guaranteed to be contrary to that purpose. If you're C++ then you have some excuse, but I don't see how Python does. PEP 383 should not have been accepted.

surrogateescape

Posted Aug 4, 2021 2:07 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (6 responses)

The sole (intended) purpose of surrogateescape is to work around the "feature" of Unix where filenames can be any arbitrary string of bytes, but at the same time they are usually UTF-8 (or, on older systems, some ASCII-superset 8-bit encoding like ISO-8859-1). If you want to enumerate a directory and open the files in it, you need to be able to smuggle mis-encoded garbage through your strings and have exactly the same mis-encoded garbage come out the other side, because if the byte sequences don't match, the filesystem won't be able to find the file you wanted.

Of course, the *correct* way to do that is to use a high-level library like pathlib instead, so that you don't have to fiddle around with individual strings in the first place, and it can do all the ugly string magic that Unix requires. So in practice, nobody has any business using surrogateescape, unless you're implementing a pathlib-like library and know exactly what you are doing.

Alternatively, the correct way is to keep your paths as bytes, but then you can't do most string-like operations on them, which is painful. Also, the Windows API will not give you filenames in any format other than UTF-16, unless you want to use some random 8-bit encoding that Microsoft confusingly refers to as "ANSI," but which in fact could be almost any Windows code page depending on the user's locale. Faced with two bad options, Python chose UTF-16, which means it is now in the position of having to convert those UTF-16 strings (which can also contain invalid code points and IIRC even mismatched surrogate pairs!) in such a way that your code (which you wrote to run correctly on a Unix platform) doesn't break on them. Hence, "do nothing and return raw bytes" was never really a good option at the language level, for a language that wants to be cross-platform.

surrogateescape

Posted Aug 4, 2021 17:08 UTC (Wed) by nybble41 (subscriber, #55106) [Link] (5 responses)

> If you want to enumerate a directory and open the files in it, you need to be able to smuggle mis-encoded garbage through your strings and have exactly the same mis-encoded garbage come out the other side, because if the byte sequences don't match, the filesystem won't be able to find the file you wanted.

The better answer here would be to treat filenames as opaque blobs of unstructured data, and perform any necessary conversion at the UI level—without surrogates. The same goes for other interfaces such as the argument list and environment variables where the data is not guaranteed to be UTF-8. It's not "mis-encoded garbage", you're just applying an inappropriate decoding. There is no good reason to attempt UTF-8 decoding while enumerating a directory when you're just getting data from one filesystem API and passing it to another without presenting any of this to the user.

If you do need to display the data to the user, or otherwise treat it as human-readable text e.g. for collation, *at that point* attempt to decode it as UTF-8 (or whatever the actual locale is set to) and substitute the U+FFFD replacement character for anything that can't be decoded. This does imply that there is probably no way to name certain files through, say, a GUI text box, but one should at least be able to select them from a file chooser if the program is properly designed.

Not being able to do string operations on filenames isn't really much of a handicap. This is like complaining that you can't use strcmp() to compare arbitrary binary data; filenames aren't strings, so it follows that string operations are not well-defined on filenames. Apart from concatenation, which works in (almost) every encoding, the only other operations you need are platform-specific anyway: joining with a platform-specific path separator, splitting on the same separator (or platform-specific alternatives, like '\' and '/' on Windows), and pattern matching. These operations are better handled by something like pathlib than by string functions. If you *really* need to treat a filename as a string for some reason (e.g. to perform a search) then go ahead and convert to UTF-8, but with the understanding that this conversion is lossy and converting the result back won't necessarily give you the same filename.

surrogateescape

Posted Aug 4, 2021 17:45 UTC (Wed) by Wol (subscriber, #4433) [Link]

> If you *really* need to treat a filename as a string for some reason (e.g. to perform a search) then go ahead and convert to UTF-8, but with the understanding that this conversion is lossy and converting the result back won't necessarily give you the same filename.

So do what DataBASIC does (which can get confusing) and have multiple representations of the data within a single variable. The canonical form is always a string, but it can be a file id, a number, etc etc so the variable is internally a structure. You can compare against the utf-8 version to check whether it's what you're looking for, and then go back to the original when you actually want to access the file.

Cheers,
Wol

surrogateescape

Posted Aug 4, 2021 18:40 UTC (Wed) by mathstuf (subscriber, #69389) [Link] (3 responses)

> This does imply that there is probably no way to name certain files through, say, a GUI text box, but one should at least be able to select them from a file chooser if the program is properly designed.

You can see some fruits of these "do it for the UI" munging and then losing track of reality in `explorer.exe`. First, make a file named `NUL` available in some way (usually via a Samba share hosted on Linux). Explorer doesn't like this, so it gives it some other mangled name (let's say `zxcv` for example). Whatever it is, make another file with *that* name in the share on the host. Explorer will render these as the same filename and deleting *either one* will delete the file actually named `zxcv` after which you can delete the `NUL` file by selection. I have no idea what the file open dialog ends up doing though. Or what happens if one is a directory and the other is a file for that matter.

surrogateescape

Posted Aug 4, 2021 19:20 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (1 responses)

You don't need a Samba share to get into that sort of trouble. Just write a Python (or any language) program that creates a file with a name like the following:

\\?\C:\Path\To\File

It has to be an absolute path, and it has to start with that funny \\?\ prefix. This will let you smuggle almost any string under the sun through the vast majority of Windows's sanity checks, as long as it's valid UTF-16 ("Unicode" to use Microsoft's confusing terminology) and the underlying volume is NTFS (and not FAT or some other legacy filesystem). It bypasses the MAX_PATH check, and purportedly even allows you to use names like "." or ".." without path resolution interpreting them as current/parent directory.

Of course, nothing in the Microsoft ecosystem can handle such files correctly. Explorer is completely lost, most programs will get confused and tell you the file does not exist, etc.

surrogateescape

Posted Aug 5, 2021 21:18 UTC (Thu) by tialaramex (subscriber, #21167) [Link]

NTFS doesn't require UTF-16 names. It hardly could, it was invented before UTF-16. Like Linux, file names are a bunch of meaningless symbols, in this case 16-bit rather than 8-bit, with certain rules about which symbols can be in names.

The symbols are in some sense UTF-16 code units, but the actual name needn't be a UTF-16 string. So the four u16s DFFF DFFF D800 0041 make a perfectly reasonable name for NTFS, but obviously that's not a valid UTF-16 string.

Your Windows UI won't like that very much but the filesystem and core OS services think that's a perfectly reasonable name for a file.

This is all getting far off topic. I was rather hoping martin.langhoff might have feedback on my suggestion instead :(

surrogateescape

Posted Aug 6, 2021 0:18 UTC (Fri) by nybble41 (subscriber, #55106) [Link]

> Explorer will render these as the same filename and deleting *either one* will delete the file actually named `zxcv` after which you can delete the `NUL` file by selection.

Yes, that's where the "if the program is properly designed" part comes into play. There would be situations where different filenames were rendered as the same UTF-8—you can get this even with valid UTF-8 if you're not enforcing a particular normalization at the filesystem level—but the file list should be keeping track of the original opaque filename for each file in the list so that when you select a file (distinguished e.g. by modification date) and instruct the tool to delete it the tool deletes the correct file. It shouldn't take the converted UTF-8 name which was only suited for presentation and delete something unrelated which just happens to have a similar UTF-8 version of its name.