Python 3, ASCII, and UTF-8

Posted Dec 18, 2017 23:08 UTC (Mon) by tialaramex (subscriber, #21167)
In reply to: Python 3, ASCII, and UTF-8 by Cyberax
Parent article: Python 3, ASCII, and UTF-8

IMNSHO this "pass it through and pretend that's fine" approach isn't avoiding working with text at all.

Insisting on only processing OIDs, or telephone numbers, or EANs is avoiding working with text. If you try hard enough then insisting on working only with DNS A-labels, or email addresses, might count too. And if you _really_ squint maybe you can try this argument with even filenames [ but you might regret that ], or database schemas.

But that's still so very far from most things. If you're really staying away from "text" Python 3 is fine. OIDs work the same, and you can be annoyed when (in both Python 2.x and 3.x) Python thinks U-labels are the "right" way to think about DNS names even though DNS makes it very clear that this is a Bad Idea™. The problems you've run into exist only because you persist in working with text even while saying that's not what you're doing. That's how we got into this mess in the first place.

There are a bunch of hacks in existence to try to smash Latin text in particular into a form that suits machines better. This can make it seem as though "ASCII just works" but er, no, at most it has meant that Latin-consumed cultures have been brutalised into putting up with the consequences. Most English speakers will grudgingly tolerate a machine insisting they "Add 1 coin(s) to continue" but that's not actually how English works. W and i aren't actually supposed to be the same width, but we've made people sigh and roll their eyes when a machine insists they should be for the convenience of the programmer.

Just as we should almost always choose to have garbage collection over manual memory management because it's "pretty good" and doing better than "pretty good" is annoyingly difficult to it's better to focus on something else, so we should almost always use languages that do something sensible with Unicode rather than just shrugging and crossing our fingers that treating everything as raw bytes will work out OK. Are they perfect? No. Are they better than you were going to manage in the five minutes you had spare? Yes.

Python 3, ASCII, and UTF-8

Posted Dec 18, 2017 23:26 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

> IMNSHO this "pass it through and pretend that's fine" approach isn't avoiding working with text at all.
Yes, it does. In 99% of cases you need not go past ASCII or its UTF-8 subset to parse structural data.

And when you do, quite often you need to use additional rules to treat the text correctly. For example, filesystems are not guaranteed to have names that are valid UTF-8 sequences. DNS IDNs have additional restrictions and so on.

That's why the code written to work with "unicode" is most often incorrect. Its authors simply slather "unicode" stuff without thinking for a second about encodings, combining characters or locales and hope that it works.

Again, it's perfectly OK to have strings in the language that guaranteed that they are at all times valid UTF-8/UCS-2/whatever sequences. What is absolutely NOT OK is pretending that EVERYTHING is text by default. That's why Python3 had to grow a veritable menagerie of workarounds, starting from PathLike and ending with surrogate encoding for stdin.

Python 3, ASCII, and UTF-8

Posted Dec 21, 2017 5:15 UTC (Thu) by rixed (guest, #120242) [Link]

As programmers, we should care about the respective width of letters and proper capitalisation rules according to locales when we use computers for typography. But that is only a tiny fraction of what we use computers for. Other than that, all we need are symbols. Having a fundamentally symbolic process, such as parsing a data structure, logging an information, etc, fail because of typography sounds very questionable.

I wish it was easier to distinguish between strings destined to be rendered for a user to see and vectors of bytes that happen to have a printable representation, for most of data and code itself. Restricting user interaction to html displayed in a web browser and giving up the terminal for UI is a trend going in the right direction, from this point of view.