|
|
Subscribe / Log in / New account

Python 3, ASCII, and UTF-8

By Jake Edge
December 17, 2017

The dreaded UnicodeDecodeError exception is one of the signature "features" of Python 3. It is raised when the language encounters a byte sequence that it cannot decode into a string; strictly treating strings differently from arrays of byte values was something that came with Python 3. Two Python Enhancement Proposals (PEPs) bound for Python 3.7 look toward reducing those errors (and the related UnicodeEncodeError) for environments where they are prevalent—and often unexpected.

Two related problems are being addressed by PEP 538 ("Coercing the legacy C locale to a UTF-8 based locale") and PEP 540 ("Add a new UTF-8 Mode"). The problems stem from the fact that locales are often incorrectly specified and that the default locale (the "POSIX" or "C" locale) specifies an ASCII encoding, which is often not what users actually want. Over time, more and more programs and developers are using UTF-8 and are expecting things to "just work".

PEP 538

PEP 538 was first posted by its author, Nick Coghlan, back in January 2017 on the Python Linux special interest group mailing list; after a stop on python-ideas, it made its way to the python-dev mailing list in May. Naoki Inada had been designated as the "BDFL-delegate" for the PEP (and for PEP 540); a BDFL-delegate sometimes is chosen to make the decision for PEPs that Guido van Rossum, Python's benevolent dictator for life (BDFL), doesn't have the time, interest, or necessary background to pass judgment on. Inada accepted PEP 538 for inclusion in Python 3.7 back in May.

As demonstrated in the voluminous text of PEP 538, many container images do not set the locale of the distribution, which means that Python defaults to the POSIX locale, thus to ASCII. This is unexpected behavior. Developers may well find that their local system handles UTF-8 just fine:

    $ python3 -c 'print("ℙƴ☂ℌøἤ")'
    ℙƴ☂ℌøἤ

However, running that in a container on the same system (using a generic distribution container image) may fail with a UnicodeEncodeError. The LC_CTYPE locale environment variable can fix the problem, but it must be set to C.UTF-8 (or variants that are available on other Unix platforms) inside the container. The PEP notes that new application distribution formats (e.g. Flatpak, Snappy) may suffer from similar problems.

So Python 3.7 and beyond will determine if the POSIX/C locale is active at startup time and will switch its LC_TYPE to the appropriate UTF-8 setting if one is available. It will only do that if the POSIX/C locale has not been explicitly chosen by the user or environment (e.g. using LC_ALL=C), so it will essentially override a default to the POSIX/C locale with UTF-8. That setting will be inherited by any subprocesses or other components that get run from the initial interpreter.

This change could, of course, cause problems for those who are actually expecting (and wanting) to get ASCII-only behavior. The PEP changes the interpreter so it will emit a warning noting that it has changed the locale if it does so. In addition, it will also emit a warning if it does not change the locale because it has been explicitly set to the (legacy) POSIX/C locale. These warnings do not go through the regular warnings module, so that -Werror (which turns warnings into errors) will not cause the program to exit; it was found during testing that doing so led to various problems.

This idea rests on two fundamental assumptions that are laid out in the PEP:

  • in desktop application use cases, the process locale will already be configured appropriately, and if it isn't, then that is an operating system or embedding application level problem that needs to be reported to and resolved by the operating system provider or application developer
  • in network service development use cases (especially those based on Linux containers), the process locale may not be configured at all, and if it isn't, then the expectation is that components will impose their own default encoding the way Rust, Go and Node.js do, rather than trusting the legacy C default encoding of ASCII the way CPython currently does

Beyond that, locale coercion will default to the surrogateescape error handler for sys.stdin and sys.stdout for the new coerced UTF-8 locales. As described in PEP 383, surrogateescape effectively turns strings containing characters that cannot be decoded using the current encoding into their equivalent byte values, rather than raising a decoding exception as the strict error handler would.

The effect of the PEP is meant to do what is actually generally expected by users. The warnings should help alert users (and distributors and the like) if the assumptions are not correct in their environment.

PEP 540

The second PEP has only recently been accepted for 3.7; though it began around the same time as the first, it languished for some time. It was brought back up by its author, Victor Stinner, in early December 2017. But that version was rather large and unwieldy, leading Van Rossum to say:

I am very worried about this long and rambling PEP, and I propose that it not be accepted without a major rewrite to focus on clarity of the specification. The "Unicode just works" summary is more a wish than a proper summary of the PEP.

Stinner acknowledged that and posted a much shorter version a day later. Inada expressed interest in accepting it, but there were some things that still needed to be worked out.

The basic idea is complementary to PEP 538. In fact, there are fairly subtle differences between the two, which has led to some confusion along the way. PEP 540 creates a "UTF-8 mode" that is decoupled from the locale of the system. When UTF-8 mode is active, the interpreter acts much the same as if it had been coerced into a new locale (à la PEP 538), except that it will not export those changes to the environment. Thus, subprocesses will not be affected.

In some ways, UTF-8 mode is more far reaching than locale coercion. For environments where there are either no locales or no suitable locale is found (i.e. no UTF-8 support), locale coercion will not work. But UTF-8 mode will be available to fill the gap.

UTF-8 mode will be disabled by default, unless the POSIX/C locale is active. It can be enabled by way of the "-X utf8" command line option or by setting PYTHONUTF8=1 in the environment (the latter will affect subprocesses, of course). Since the POSIX locale has ASCII encoding, UTF-8 (which is ASCII compatible at some level) is seen as a better encoding choice, much like with PEP 538.

The confusion between the two PEPs led to both referring to the other in order to make it clear why both exist and why both are needed. PEP 538 explains the main difference:

PEP 540 proposes to entirely decouple CPython's default text encoding from the C locale system in that case, allowing text handling inconsistencies to arise between CPython and other locale-aware components running in the same process and in subprocesses. This approach aims to make CPython behave less like a locale-aware application, and more like locale-independent language runtimes like those for Go, Node.js (V8), and Rust.

Beyond that, PEP 538 goes into an example of the Python read-eval-print loop (REPL), which uses GNU Readline. In order for Readline to properly handle UTF-8 input, it must have the locale propagated in LC_TYPE as locale coercion does. On the other hand, some situations may not want to push the locale on subprocesses, as PEP 540 notes:

The benefit of the locale coercion approach is that it helps ensure that encoding handling in binary extension modules and child processes is consistent with Python's encoding handling. The upside of the UTF-8 Mode approach is that it allows an embedding application to change the interpreter's behaviour without having to change the process global locale settings.

The two earliest versions of PEP 540 (it went through four versions on the way to acceptance) proposed changing to the surrogateescape error handler for files returned by open(), but Inada was concerned about that. A common mistake that new Python programmers make is to open a binary file without using the "b" option to open(). Currently, that would typically generate lots of decoding errors, which would lead the programmer to spot their error. Using surrogateescape would silently mask the problem; it is also inconsistent with the locale coercion behavior, which does not change open() at all. After some discussion, Stinner changed the PEP to keep the strict error handler for files returned by open().

So Python 3.7 will work better than previous versions with respect to Unicode character handling. While the switch to Unicode was one of the most significant features of Python 3, there have been some unintended and unexpected hiccups along the way. In addition, the computing landscape has changed in some fundamental ways over that time, especially with the advent of containers. Strict ASCII was once a reasonable choice in many environments, but that has been steadily changing, even before Python 3 was released just over nine years ago.



to post comments

Python 3, ASCII, and UTF-8

Posted Dec 17, 2017 18:17 UTC (Sun) by tialaramex (subscriber, #21167) [Link]

Speaking of Unicode in Python, this: https://bugs.python.org/issue28414 was discussed here on LWN and eventually resulted in this:
https://github.com/python/cpython/pull/3010

[ Summary: Python doesn't do IDN properly, somehow this has infected its implementation of subject dnsName checks, even though the _whole point_ of IDN is that it lives in the presentation layer and shouldn't be gumming up low-level stuff - so if you use TLS/SSL with IDNs in Python what you get is probably not what you wanted ]

It is not clear to me (as someone who can program Python but doesn't work on its core) whether this has or will land. There's a line from November saying "I'll make sure a complete patch lands in 3.7" but hey, they managed to ship Python 3.0 without fixing all the "must have for Python 3" items so like President Bush said, fool me twice, won't get fooled again.

Python 3, ASCII, and UTF-8

Posted Dec 17, 2017 21:13 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link] (74 responses)

Unicode in Python... The gift that keeps on giving.

Meanwhile, Python 2.7 is a great improvement on its successor. Strings are simply treated as sequences of bytes and this works just fine for 99% of applications.

Python 3, ASCII, and UTF-8

Posted Dec 18, 2017 7:26 UTC (Mon) by andrewsh (subscriber, #71043) [Link] (23 responses)

Ňø, îť đöéšń’ŧ.

Python 3, ASCII, and UTF-8

Posted Dec 18, 2017 8:36 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (22 responses)

cyberax@20c9d0485705:~$ echo "Ňø, îť đöéšń’ŧ." | python2.7 -c "import sys; print sys.stdin.read();"
Ňø, îť đöéšń’ŧ.

Python 3, ASCII, and UTF-8

Posted Dec 18, 2017 8:49 UTC (Mon) by epa (subscriber, #39769) [Link] (1 responses)

Does the LWN comment box code use Python 2 or 3?

Python 3, ASCII, and UTF-8

Posted Dec 18, 2017 15:31 UTC (Mon) by corbet (editor, #1) [Link]

LWN is still Python 2. The last dependency just moved forward a month or two ago, so all that's needed is a bit of time to make the switch. That "bit of time" part can be surprisingly hard, though.

Python 3, ASCII, and UTF-8

Posted Dec 18, 2017 9:42 UTC (Mon) by tialaramex (subscriber, #21167) [Link] (19 responses)

[njl@totoro ~]$ echo "Ňø, îť đöéšń’ŧ." | python2.7 -c "import sys; print(sys.stdin.read().upper());"
Ňø, îť đöéšń’ŧ.

(in contrast, since I have a UTF-8 locale)

[njl@totoro ~]$ echo "Ňø, îť đöéšń’ŧ." | python3.6 -c "import sys; print(sys.stdin.read().upper());"
ŇØ, ÎŤ ĐÖÉŠŃ’Ŧ.

No doubt Cyberax has a great future ahead of them in Enterprise Security software or some other sector where it doesn't matter whether your software actually works so long as you can do a demo that looks passable, once the poor customer has handed over their money you can ignore them until it's time to sell them an "upgrade".

Python 3, ASCII, and UTF-8

Posted Dec 18, 2017 10:41 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (18 responses)

How about this:

> cyberax@20c9d0485705:~$ echo "Is this correct?" | python3.6 -c "import sys; print(sys.stdin.read().upper());"
> IS THIS CORRECT?

Whoops. You've caused a murder: https://gizmodo.com/382026/a-cellphones-missing-dot-kills...

See, the letter "I" is actually a Turkish dotless-i and so I expect "i" to capitalize to "İ". I can also give an example with German "ß".

So please, spare me this nonsense. Python's "Unicode" implementation is simply misleading bullshit which lulls developers into thinking that their naïve code works correctly everywhere. It doesn't.

Meanwhile, Python 2.7 has the correct behavior - it doesn't try to do anything outside of well-defined ASCII range. If you really really need questionable Unicode character transformations, you can do that in Py2.7 as well, but it won't do it silently.

Python 3, ASCII, and UTF-8

Posted Dec 18, 2017 11:58 UTC (Mon) by cortana (subscriber, #24596) [Link] (1 responses)

See, the letter "I" is actually a Turkish dotless-i and so I expect "i" to capitalize to "İ"

Perhaps I'm missing something, but this does not appear to be the case:

$ echo "Is this correct?" | hd
00000000  49 73 20 74 68 69 73 20  63 6f 72 72 65 63 74 3f  |Is this correct?|
00000010  0a                                                |.|
00000011
So please, spare me this nonsense. Python's "Unicode" implementation is simply misleading bullshit which lulls developers into thinking that their naïve code works correctly everywhere. It doesn't.

It doesn't matter whether the developer is using Python 2.7, Python 3 or another language: their code must be tested to ensure that it actually works. If it is not tested then the business has fucked up and this needs to be rectified.

Python 3, ASCII, and UTF-8

Posted Dec 18, 2017 18:52 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

> Perhaps I'm missing something, but this does not appear to be the case:
Nope. That's the whole deal - Turkish uses latin "I" but has different capitalization rules.

> It doesn't matter whether the developer is using Python 2.7, Python 3 or another language: their code must be tested to ensure that it actually works. If it is not tested then the business has fucked up and this needs to be rectified.
This is a complete non-answer. Python 3 makes it easy to write code that is mostly correct under assumption that it is ALWAYS correct.

Python 2.7 makes you more aware of that.

Python 3, ASCII, and UTF-8

Posted Dec 18, 2017 13:13 UTC (Mon) by eru (subscriber, #2753) [Link] (2 responses)

Did you set your locale to Turkish before testing that? Capitalization is language-specific (as you probably know).

Python 3, ASCII, and UTF-8

Posted Dec 18, 2017 18:52 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

Nope. Python should do it right by default, correct? It's Unicode(tm)!

Python 3, ASCII, and UTF-8

Posted Dec 19, 2017 1:03 UTC (Tue) by anselm (subscriber, #2796) [Link]

Pretty much every language other than Turkish (and Azerbaijani) capitalises “i” as “I” (or for that matter lowercases “I” as “i”). Doing it like this is therefore a reasonable default unless the computer is explicitly told it is dealing with Turkish (or Azerbaijani) rather than English, French, German, Dutch, Portuguese, Finnish, Latin, Esperanto, …, none of which even have a dotless ı, and where capitalising “i” as “İ” or lowercasing “I” as “ı” would not just be silly but wrong.

Python 3, ASCII, and UTF-8

Posted Dec 18, 2017 13:36 UTC (Mon) by MarcB (subscriber, #101804) [Link] (2 responses)

> See, the letter "I" is actually a Turkish dotless-i and so I expect "i" to capitalize to "İ". I can also give an example with German "ß".

Your expectation is wrong: dotless-lower maps to dotless-upper. Dotted-lower maps to dotted-upper.
echo "Is thıs correct?" | python3.6 -c "import sys; print(sys.stdin.read().upper());"
IS THIS CORRECT?

> I can also give an example with German "ß".

"ß" is special because it has (had) no uppercase form. So what Python should do here, according to Unicode rules, is map to "SS".

Which is exactly what it does:
echo "Iß thiß correct?" | python3.6 -c "import sys; print(sys.stdin.read().upper());"
ISS THISS CORRECT?

Note, that lower("SS") will obviously not produce "ß", but "ss" instead. So this is no round-trip (this is where case-folding comes in).

Python 3, ASCII, and UTF-8

Posted Dec 18, 2017 18:09 UTC (Mon) by mirabilos (subscriber, #84359) [Link]

$ echo "Is this correct?" | LC_ALL=tr_TR.UTF-8 python3.6 -c "import sys; print(sys.stdin.read().upper());"
IS THIS CORRECT?

This is most certainly not correct. (And yes, this system has locales-all.)

Expect Unicode to change about towupper('ß') though, de-DE-2017 changed it to mandate 'ẞ'. (About the only orthography change since 1995 I agree with.)

Python 3, ASCII, and UTF-8

Posted Dec 18, 2017 19:04 UTC (Mon) by Tara_Li (guest, #26706) [Link]

And about here, someone needs to link to Tom Scott's video for Computerphile on Internationalization.

https://www.youtube.com/watch?v=0j74jcxSunY

And while we're at it, let's check out time zones:

https://www.youtube.com/watch?v=-5wpm-gesOY

Python 3, ASCII, and UTF-8

Posted Dec 18, 2017 21:43 UTC (Mon) by ceplm (subscriber, #41334) [Link] (6 responses)

Please, don't tell me your native language is English (or that you are an American). That would so much fall into a stereotypical idiotic American who doesn't care about the world outside of his home.

Python 3, ASCII, and UTF-8

Posted Dec 18, 2017 22:18 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (5 responses)

My native language is Russian, I also speak Ukrainian and Polish, I understand German and I'm studying Mandarin.

So I'm far more aware of encoding problems than most people in the US. After all, back in 90-s there were about 6 concurrently used different Russian encodings (Win1251, Cp866, GOST, Koi8-RU, ISO and of course UTF-8).

That's why I maintain that the only sane way to work with text is to not work with text unless you absolutely HAVE to do it.

Python 3, ASCII, and UTF-8

Posted Dec 18, 2017 23:08 UTC (Mon) by tialaramex (subscriber, #21167) [Link] (2 responses)

IMNSHO this "pass it through and pretend that's fine" approach isn't avoiding working with text at all.

Insisting on only processing OIDs, or telephone numbers, or EANs is avoiding working with text. If you try hard enough then insisting on working only with DNS A-labels, or email addresses, might count too. And if you _really_ squint maybe you can try this argument with even filenames [ but you might regret that ], or database schemas.

But that's still so very far from most things. If you're really staying away from "text" Python 3 is fine. OIDs work the same, and you can be annoyed when (in both Python 2.x and 3.x) Python thinks U-labels are the "right" way to think about DNS names even though DNS makes it very clear that this is a Bad Idea™. The problems you've run into exist only because you persist in working with text even while saying that's not what you're doing. That's how we got into this mess in the first place.

There are a bunch of hacks in existence to try to smash Latin text in particular into a form that suits machines better. This can make it seem as though "ASCII just works" but er, no, at most it has meant that Latin-consumed cultures have been brutalised into putting up with the consequences. Most English speakers will grudgingly tolerate a machine insisting they "Add 1 coin(s) to continue" but that's not actually how English works. W and i aren't actually supposed to be the same width, but we've made people sigh and roll their eyes when a machine insists they should be for the convenience of the programmer.

Just as we should almost always choose to have garbage collection over manual memory management because it's "pretty good" and doing better than "pretty good" is annoyingly difficult to it's better to focus on something else, so we should almost always use languages that do something sensible with Unicode rather than just shrugging and crossing our fingers that treating everything as raw bytes will work out OK. Are they perfect? No. Are they better than you were going to manage in the five minutes you had spare? Yes.

Python 3, ASCII, and UTF-8

Posted Dec 18, 2017 23:26 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

> IMNSHO this "pass it through and pretend that's fine" approach isn't avoiding working with text at all.
Yes, it does. In 99% of cases you need not go past ASCII or its UTF-8 subset to parse structural data.

And when you do, quite often you need to use additional rules to treat the text correctly. For example, filesystems are not guaranteed to have names that are valid UTF-8 sequences. DNS IDNs have additional restrictions and so on.

That's why the code written to work with "unicode" is most often incorrect. Its authors simply slather "unicode" stuff without thinking for a second about encodings, combining characters or locales and hope that it works.

Again, it's perfectly OK to have strings in the language that guaranteed that they are at all times valid UTF-8/UCS-2/whatever sequences. What is absolutely NOT OK is pretending that EVERYTHING is text by default. That's why Python3 had to grow a veritable menagerie of workarounds, starting from PathLike and ending with surrogate encoding for stdin.

Python 3, ASCII, and UTF-8

Posted Dec 21, 2017 5:15 UTC (Thu) by rixed (guest, #120242) [Link]

As programmers, we should care about the respective width of letters and proper capitalisation rules according to locales when we use computers for typography. But that is only a tiny fraction of what we use computers for. Other than that, all we need are symbols. Having a fundamentally symbolic process, such as parsing a data structure, logging an information, etc, fail because of typography sounds very questionable.

I wish it was easier to distinguish between strings destined to be rendered for a user to see and vectors of bytes that happen to have a printable representation, for most of data and code itself. Restricting user interaction to html displayed in a web browser and giving up the terminal for UI is a trend going in the right direction, from this point of view.

Python 3, ASCII, and UTF-8

Posted Dec 19, 2017 6:20 UTC (Tue) by ceplm (subscriber, #41334) [Link]

OK, I am sorry for my suspicion. Then I just don’t agree with you, nothing more. And no, I don't have time at the moment for thorough argument.

Python 3, ASCII, and UTF-8

Posted Dec 19, 2017 6:24 UTC (Tue) by ceplm (subscriber, #41334) [Link]

BTW, I have ported M2Crypto from py2k to py3k (any code reviews on https://gitlab.com/m2crypto/m2crypto/merge_requests/65 would help mankind, bring you eternal glory, and my thankfulness … or at least one of those three), and exactly these people who wants to ignore the problem were target of many of my curses.

Python 3, ASCII, and UTF-8

Posted Dec 21, 2017 6:15 UTC (Thu) by dvdeug (guest, #10998) [Link] (2 responses)

> cyberax@20c9d0485705:~$ echo "Is this correct?" | python2.7 -c "import sys; print sys.stdin.read().upper();"
> IS THIS CORRECT?

So two programs have the exact same behavior, and yet one is wrong and the other is right. Huh. The only way that Turks are going to get proper capitalization of i to İ in most software is to have a Turkish locale and software written in Unicode-supporting languages that automatically take care of that little detail. Otherwise software that's written by people who aren't familiar with Turkish linguistic needs or assume that it won't used for Turkish (YAGNI lovers) will silently cause problems like the one in the link. Case-insensitive search is a real need, for example.

Python 3, ASCII, and UTF-8

Posted Dec 22, 2017 17:04 UTC (Fri) by nim-nim (subscriber, #34454) [Link] (1 responses)

There are lots of cases were correct text processing depends on the correct locale (CJK de-unification, balkanic cyrillic vs russian cyrillic, selecting the correct spell and grammar checker)

Unfortunately, too many free software devs thinks they can avoid locale particularities by setting C.UTF-8 everywhere, or deriving the locale from the input method (any latin layout can write other latin languages, and no I don't want to switch between qwerty and azerty when switching between English and French), or whatever. And they still invent text formats where the locale of a run of text is not explicitly specified

It sort of works, as long as the software is used by devs writing in pidging english. It fails spectacularly when exposed to normal people or normal human text.

Text = locale + encoding remove one and you're full of fail

Python 3, ASCII, and UTF-8

Posted Jan 2, 2018 5:40 UTC (Tue) by dvdeug (guest, #10998) [Link]

And they still invent text formats where the locale of a run of text is not explicitly specified

There's certainly logic there, but when exposed to normal people and normal human text, that often doesn't work. The theory is hard; at Wiktionary, we often have arguments about whether a word is French in English or has been adopted in English. Apple, Inc. is Apple even in many languages where they don't write in Latin script. Even texts where they could be clearly language tagged with human effort, like

> > Esperantists who dream of the final victory are non-existent in the modern day.
> Ne, mi estas finvenkisto!
Okay, so that's one finvenkisto. Most of us no longer dream of Esperanto being the be-all and end-all of world language.

are likely to appear undifferentiated, with none of the posters having bothered to set the locale correctly (by default, that could have come from a German, Canadian, and an Estonian user, and given tags of de_DE, en_CA and ee_ET), and even those locales not having followed into the final post. Only serious texts I'm writing for permanent archival would I bother carefully language tagging; even a published PDF I might not bother, if you couldn't tell in the final copy.

Python 3, ASCII, and UTF-8

Posted Dec 18, 2017 12:03 UTC (Mon) by vstinner (subscriber, #42675) [Link] (25 responses)

> Meanwhile, Python 2.7 is a great improvement on its successor. Strings are simply treated as sequences of bytes and this works just fine for 99% of applications.

If you ignore the actual Python type, Python 3 is already processing stdin and stdout "as bytes" for the POSIX locale: since Python 3.5, stdin and stdout use the "surrogateescape" error handler if the locale is POSIX.

With the PEP 538 and PEP 540, the difference is that not only undecodable bytes are still passed through, but if the input is correctly encoded to UTF-8, Unicode functions work as well. You get the best of the two worlds (bytes of Unix world, and text of Python 3).

The PEP 540 makes this trick usable with any locale, not only the POSIX locale. The new UTF-8 Mode must be enabled explictly since ignoring the locale *can* introduce mojibake (even if it should not in practice, since we are already all using UTF-8, right? :-)).

The UTF-8 Mode allows to write Unix-like applications processing stdin and stdout "as bytes": undecodable bytes are stored as surrogate characters. It's the surrogateescape error handler, PEP 383.

It's not only a matter to decoding error (read stdin). It's also a matter of encoding error: print("<funny unicode characters>") will work as expected (as soon as your terminal decodes stdout from UTF-8). Example:

$ python2 -c 'print(u"euro sign: \u20ac")'
euro sign: €
$ python3.7 -c 'print(u"euro sign: \u20ac")'
euro sign: €

$ LANG= python2 -c 'print(u"euro sign: \u20ac")'
...
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 11: ordinal not in range(128)
$ LANG= python3.7 -c 'print(u"euro sign: \u20ac")' # enjoy PEP 538+PEP 540
euro sign: €

Python 3, ASCII, and UTF-8

Posted Dec 18, 2017 19:49 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (24 responses)

They should have gone a step further and replaced all the "unicode" nonsense with simple honest byte sequences.

Python 3, ASCII, and UTF-8

Posted Dec 19, 2017 8:22 UTC (Tue) by peter-b (guest, #66996) [Link] (23 responses)

It disappoints me that your obstinacy on this point is impossibly crass.

Text is hard. A "byte sequence" is never an adequate representation of text. A word may contain up to 10 codepoints per character. There are different rules, depending on language, for what a "word" is, even.

Treating text as a series of bytes is exactly suggests to me that some subset of the following apply:

1. You never need to deal with meaningful amounts of non-English text.
2. You never need to deal with non-trivial text manipulation, such as interactive editing. Pressing <Backspace> removes the last glyph; how many bytes is that?
3. You never need to deal with parsing or segmenting text *per se*. What's the first word in this sentence?

Your ridiculous Turkish capitalisation strawman shows *exactly* how hard text is to get right if you treat it as a context-free bucket of bytes or codepoints. Unicode merely lets you represent the characters; operations like "uppercase" and "lowercase" are entirely meaningless in most languages, so you actually do need to know the text's context before applying them.

The Rust standard library has 13 different string types in the standard library. A String is a UTF-8 Unicode codepoint sequence. A Vec<u8> is a series of bytes. There are other string types that represent other platform- and context-dependent text representations. It's well thought out (Mozilla's years of experience of dealing with arbitrary text in the trenches really shows), but it's complicated.

"Byte sequences" are neither a simple nor an honest way of handling and processing text. Your argument is the same as that of many people who objected to the excellent features that systemd provides for handling dependency-based service startup: "I don't have that problem, so it's not a valid problem." Please stop it.

Python 3, ASCII, and UTF-8

Posted Dec 19, 2017 10:09 UTC (Tue) by tialaramex (subscriber, #21167) [Link] (1 responses)

I think Cyberax's position is actually the Homer Simpson position

"You tried your best, and you failed miserably. The lesson is: Never try"

We've been here before on many things, it's how lots of people managed to persuade themselves that CVS was fine (remember the era when large successful Free Software projects used CVS? That seems pretty crazy now). Sure, it doesn't do what anybody actually wants, but trivial fixes don't seem to solve all our problems, so let's stop trying.

I would say that a byte sequence _is_ an adequate representation of text, but only in the same sense that a byte sequence is an adequate representation of a symphony or photograph. We should probably not try to write software that tries to process symphonies or photographs one byte at a time just because they were stored as a byte sequence.

Python 3, ASCII, and UTF-8

Posted Dec 20, 2017 1:08 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

> "You tried your best, and you failed miserably. The lesson is: Never try"
Nope. My position is:

1) Don't try to hide complexity by replacing it with simple, neat and wrong solutions.

2) But if you really want to dive deep then do it. Perl 6 is a credible attempt to make Unicode support useful. And they had to do quite a bit of work for it to become real: https://perl6advent.wordpress.com/2015/12/07/day-7-unicod...

Your CVS comparison is off-target. "Unicode" support in Python is kinda like RCS - an actual step back, that introduced more complexity than it solves.

Python 3, ASCII, and UTF-8

Posted Dec 19, 2017 13:08 UTC (Tue) by foom (subscriber, #14868) [Link] (19 responses)

Yes, the point is that a "codepoint" sequence is effectively no more useful as an adequate representation of text than a byte sequence is. All those concepts you mention, like glyph breaks and word breaks, are just as easy to implement and use on top of a utf8"ish" byte buffer as they are on top of a codepoint sequence.

Python3 goes to a lot of effort to convert just about everything, be it actually in a valid known encoding or not, into a codepoint sequence, in the name of Unicode Support. But that is unfortunately mostly wasted effort.

Imagine if, instead of a Unicode String type, python instead just had functions to iterate over and otherwise manipulate a utf8 buffer by whichever grouping you like, byte, codepoint, glyph or word?

It could have ended up with something more like golang, where "string is the set of all strings of 8-bit bytes, conventionally but not necessarily representing UTF-8-encoded text."

And that would've been good.

Python 3, ASCII, and UTF-8

Posted Dec 20, 2017 23:24 UTC (Wed) by togga (guest, #53103) [Link]

"But that is unfortunately mostly wasted effort."

Not only that, it also wastes it's users time and effort. It trades that rare complex tasks might look easy on the surface with common simple tasks getting annoyingly complex.

Python 3, ASCII, and UTF-8

Posted Jan 4, 2018 4:36 UTC (Thu) by ras (subscriber, #33059) [Link] (17 responses)

> Yes, the point is that a "codepoint" sequence is effectively no more useful as an adequate representation of text than a byte sequence is.

I'm not a huge fan of Python3 Unicode implementation, but this over eggs the issue. We have had byte sequences representing text for as long as I can remember (which is longer than I care to admit). This issue was if it really was text, you had to convert those byte sequences into text. There were standards, code pages and what not that told you how to do it of course. But there were so many to choose from.

ISO 10646 solved the problem by dividing it into half. It assigned each unique character a number it called a code point. Then, as a separate exercise, it proposed some ways of encoding those numbers into bytes. Everyone seems to be happy enough to accept the ISO 10646 mapping from characters to numbers, meaning every one is just as happy to with accepting the only true code point for € is 0x20ac as they are accepting the only true code point for 'a' is 0x61. (Well everyone except Cyberax apparently, who wants a code point an Italian i and presumably '\r\n' as well.) But maybe even Cyberax would agree 10646 was a major step forward over code pages. The encoding into bytes will remain a blight until everyone agrees on to just use UTF-8. Given UCS2 is hard wired into Windows and Javascript that might be never. Sigh.

My only issue with Python3 is it is far too eager to convert bytes into text. stdin, stdout, os.listdir(), the default return of open() - they should all be byte streams. I don't know why they thought they could safely convert it to text. Doing so in a 'nix environment looked to me to be recipe for disaster because the LANG= could be different on each run, and that's pretty much how it turned out. They had languages to copy from that did it mostly right, like Java for instance, and they still got it wrong.

Python 3, ASCII, and UTF-8

Posted Jan 4, 2018 6:32 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

One small comment - I'm not against Unicode or UCS.

I'm against making magic "unicode" strings with unspecified encoding and then forcing everything to use these strings.

Python 3, ASCII, and UTF-8

Posted Jan 4, 2018 9:47 UTC (Thu) by anselm (subscriber, #2796) [Link] (5 responses)

I'm against making magic "unicode" strings with unspecified encoding and then forcing everything to use these strings.

The idea behind “Unicode” strings is that they are not encoded at all (so much for “unspecified encoding”) because they don't need to be encoded. When you read text from a file, socket, … (where it tends to be encoded according to UTF-8, UCS2, or whatever, and that encoding is an attribute of the communication channel in question) it is decoded into a Unicode string for internal processing, and when you write it back it is again encoded according to UTF-8, UCS2, or whatever, depending on where it goes. The system assumes that people writing Python programs are more likely to deal with Unicode text rather than sequences of arbitrary bytes, and therefore leans in the former direction by default.

On the whole this is not an entirely unreasonable approach to take. There are problems with (inadvertently) feeding this mechanism streams of arbitrary bytes, which tends to fail due to decoding issues, with fuzzy boundaries between what should be treated as text vs. arbitrary bytes (such as path names on Linux), and with the general quagmire that Unicode quickly becomes when it comes to more complicated scripts. But saying that the way Python 3 handles Unicode is much worse in general than the way Python 2 handled Unicode doesn't ring true. My personal experience after mostly moving over to Python 3 is that now I seem to have much fewer issues with Unicode handling than I used to when I was mostly using Python 2, so as far as I'm concerned, Python 3 is an improvement in this area – but then again I'm one of those programmers who have to deal with Unicode text much more often than with streams of arbitrary bytes.

Python 3, ASCII, and UTF-8

Posted Jan 4, 2018 19:28 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

> The idea behind “Unicode” strings is that they are not encoded at all (so much for “unspecified encoding”) because they don't need to be encoded.
As I said, they are magic strings that solve everything.

As a result, they are strictly not better than simple byte sequences. You can't even concatenate them without getting strange results (just play around with combining characters and RTL switching). Pretty much the only safe thing you can do with Py3 strings is to spit them out verbatim. At which point it's just easier to deal with good old honest raw byte sequences.

And it doesn't have to be so - Perl 6 does Unicode the correct way. They most definitely specify the encoding and normalize the text to make indexing and regexes sane. In particular, combining characters and languages with complex scripts are treated well by normalizing the text into a series of graphemes.

Python 3, ASCII, and UTF-8

Posted Jan 5, 2018 17:51 UTC (Fri) by anselm (subscriber, #2796) [Link] (3 responses)

As far as I'm concerned, it is silly to claim that Python's Unicode strings are “strictly not better” than byte sequences when it is possible, in Python 3, to open a text file using the built-in open() function and read UTF-8-encoded text from that file. This is arguably what people expect to be able to do in 2017, and it is an obvious improvement over Python 2.7, where you had to jump through your choice of flaming hoops (involving, e.g., the codecs module) to be able to accomplish the same thing. (People who would rather read binary data from a file can still open it in binary mode and see the raw bytes.)

So, Python's Unicode strings are better than simple byte sequences because they allow programmers to deal with Unicode code points (as opposed to bytes which might or might not be part of the representation of a Unicode code point). It is true that there are situations which Python doesn't handle well right now, but it is reasonable to expect that these will be sorted out in due course. Having the concept of Unicode strings in the first place, however, is a prerequisite for being able to deal with these cases at all.

It may well be the case that Perl 6's handling of Unicode data is currently better than Python's. They certainly took their own sweet time to figure it out, after all. But given the way things work between the Python and Perl communities, if the Perl people have in fact caught on to something worthwhile, chances are that the same – or very similar – functionality will pop up in Python in due course (and vice versa). So I wouldn't consider the Unicode functionality in current Python to be the last word on the topic.

Python 3, ASCII, and UTF-8

Posted Jan 5, 2018 18:47 UTC (Fri) by jwilk (subscriber, #63328) [Link]

open() by default uses locale encoding, which is rarely what you want.
If you want UTF-8, you need to specify the encoding explicitly.

Python 3, ASCII, and UTF-8

Posted Jan 5, 2018 18:51 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

> in Python 3, to open a text file using the built-in open() function and read UTF-8-encoded text from that file.
I can do so in Py2 as well: 'content = open("file").read().decode("utf-8")'. I still don't see why it justifies huge breaking changes that require mass re-audit of huge amount of code.

> So, Python's Unicode strings are better than simple byte sequences because they allow programmers to deal with Unicode code points
I've yet to hear why I would want to deal with Unicode codepoints instead of bytes everywhere by default.

> It is true that there are situations which Python doesn't handle well right now, but it is reasonable to expect that these will be sorted out in due course.
No they won't, without yet another round of incompatible changes. Py3 is stuck with Uselesscode strings.

Python 3, ASCII, and UTF-8

Posted Jan 5, 2018 23:46 UTC (Fri) by lsl (subscriber, #86508) [Link]

> As far as I'm concerned, it is silly to claim that Python's Unicode strings are “strictly not better” than byte sequences when it is possible, in Python 3, to open a text file using the built-in open() function and read UTF-8-encoded text from that file. This is arguably what people expect to be able to do in 2017, and it is an obvious improvement over Python 2.7,

I can do that using the raw open and read syscalls. Why would I need magic Unicode strings for that?

> So, Python's Unicode strings are better than simple byte sequences because they allow programmers to deal with Unicode code points (as opposed to bytes which might or might not be part of the representation of a Unicode code point).

Except for the fact that arrays of Unicode code points aren't a terribly useful representation most of the time. Byte strings can carry UTF8-encoded Unicode text just as well, in addition to being able to hold everything else users might want to feed into my program and expect it to roundtrip correctly. The odd day where I really need to apply some Unicode-aware text transformation that works on code points I can convert it there and then. Or just package it up such that it accepts and returns UTF-8 strings.

Python 3, ASCII, and UTF-8

Posted Jan 10, 2018 3:03 UTC (Wed) by HelloWorld (guest, #56129) [Link] (9 responses)

> It assigned each unique character a number it called a code point.
A code point is *not* a character. Umlauts like ä, ö and ü can be written in two ways in Unicode, either with a single code point for the whole thing or with one code point for the underlying vowel and one code point for the diacritic. This sequence of two code points is called a grapheme cluster. There are entire Unicode blocks that contain only code points that only make sense as part of grapheme clusters, like Hangul Jamo for Korean. Many variations of emoji are also implemented this way. For this reason I don't think it makes sense to treat strings as sequences of Unicode code points, it should be grapheme clusters, and that's what Perl 6 does while Python, like always, fucked it up (and Java did even worse, because its strings are sequences of 16-Bit "char" values which are not even code points, because there are more than 65k of those by now)

Python 3, ASCII, and UTF-8

Posted Jan 10, 2018 8:29 UTC (Wed) by jem (subscriber, #24231) [Link] (1 responses)

>Java did even worse, because its strings are sequences of 16-Bit "char" values which are not even code points, because there are more than 65k of those by now

Jav moved from UCS-2 (64k points) to UTF-16 (1M+ code points) in version 1.5 (2004). Of course, the transition is not completely transparent to applications, which can still think they are dealing with UCS-2.

Python 3, ASCII, and UTF-8

Posted Jan 10, 2018 11:30 UTC (Wed) by HelloWorld (guest, #56129) [Link]

Yeah, so? The problem is that a ”string“ is defined to be a sequence of things (”code units“) that have no semantic meaning whatsoever. It's basically the same as treating UTF-8 strings as a sequence of bytes (except with 16 bits) and having the user sort out all that pesky code point/grapheme cluster/normalisation etc. stuff. The language simply doesn't help at all.

Python 3, ASCII, and UTF-8

Posted Jan 10, 2018 17:40 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

And don't forget about Unihan and variant selectors: https://en.wikipedia.org/wiki/Han_unification#Examples_of... - yet ANOTHER can of worms.

Python 3, ASCII, and UTF-8

Posted Jan 10, 2018 22:20 UTC (Wed) by ras (subscriber, #33059) [Link] (5 responses)

> yet ANOTHER can of worms.

Having spent my life in a country were ASCII every character we use plus some, this is all news to me. It sounds like a right royal balls up. Was there a good reason for not making code point == grapheme?

Python 3, ASCII, and UTF-8

Posted Jan 10, 2018 22:43 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

A desire to group multiple graphical variants of one character into one code point. It also gives a possibility of fallbacks - variant-unaware fonts can just have one variant which will (probably) be kinda understood by most speakers. With separate code points you'll just get boxes in place of missing variants.

Python 3, ASCII, and UTF-8

Posted Jan 10, 2018 23:15 UTC (Wed) by ras (subscriber, #33059) [Link] (3 responses)

> A desire to group multiple graphical variants of one character into one code point.

Sounds almost reasonable.

I wonder if they realised how many bugs that feature would create? Most programmers don't care about this stuff, to the point that if the unit test displays "Hello World" displays properly, job done. I'd be tempted to say "no programmer cares", but guess there must be at least one renegade out there who has tested whether their regex chokes on multi code point grapheme's.

Python 3, ASCII, and UTF-8

Posted Jan 10, 2018 23:31 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

Complex graphemes are a foregone conclusion anyway, so why not add more?

It's not like Unihan is even the worst offender. Look at this, for example: देवनागरी लिपि - try to edit it in browser.

Python 3, ASCII, and UTF-8

Posted Jan 11, 2018 0:40 UTC (Thu) by ras (subscriber, #33059) [Link]

> देवनागरी लिपि

I may as well be looking at hieroglyphs. In fact I might have more chance with hieroglyphs as the pictures are sometimes recognisable.

I guess the point I was trying to make is you want this stuff have any chance of just working in a program written by a programmer that doesn't care much about this stuff (which I suspect is a tiny few and only some of the time) is to make code point == grapheme.

It is nice to have a fall back for a grapheme you can't display to it's root, but that could also be handled by libraries that do the displaying. There are really only a few programs that care overly - browsers, email clients spring and word processors to mind. ls, cat and the 100's of little scripts I write to help me through the day don't care is someone can't read every character in the data they display, but mangling the data on the way through (which is what Python3 manages to pull off!) is an absolute no-no. For security it's even worse. This is like 'O' looking like '0', but now there really is an '0' in the string, albeit proceeded by a code point that makes it not an '0'. Who is going to check for that?

What I would class as good reason for not making code point == grapheme would be the number of code points exceed 2^31. But since you say Perl 6 emulates it, I guess that's not a problem.

Python 3, ASCII, and UTF-8

Posted Jan 11, 2018 0:44 UTC (Thu) by ras (subscriber, #33059) [Link]

Oh, and thanks for responding. I've learn a lot though this exchange.

Python 3, ASCII, and UTF-8

Posted Dec 21, 2017 7:42 UTC (Thu) by togga (guest, #53103) [Link]

The issue here which the (somewhat smaller) "python3 community" misses is that this "obstinacy" is measured in productivity, maintenance and development hours. I'd say it's up there along with .NET interoperability, window "\" path/quoting issues for multiplatform scripts. Man made problems, detached from reality, popping up from thin air.

Python is going from "productivity" to "cumbersome" measured in development hours on a day-to-day basis and has now a hard time to compensate for all it's (other) weaknesses (multithreading with GIL, performance, ...).

How much lipstick can you paint the pig with? "Coercing", "C.UTF-8", "new UTF-8", "PYTHONUTF", "-x utf8", ... My favourite is "surrogateescape error handler".

You can either be this and that type of script with different settings. Best for all?
What if I need to be both or the most common a case not exactly matching any of them. This mess is getting so complex that for a pro it's almost impossible to get full control and newbies never even knew what hit them when they're floored by python3 (I've helped a few...). For example the simple case with ctypes attributes, often read from an external source, that needs to be string type. The text mess is a nasty swamp to wade through.

I agree with Cyberax. Keep it simple (bytestrings) and let the user do explicit things with the data with library functions and classes (maybe include a more advanced regex) when text functionality really needed.

That said. Python2 gave me 14 pretty good years of scripting productivity (mostly attributed to numpy). With this, it's time to move on.

Python 3, ASCII, and UTF-8

Posted Dec 18, 2017 13:16 UTC (Mon) by MarcB (subscriber, #101804) [Link] (23 responses)

It works fine for all applications that basically pass through text data without processing or analyzing it in any form.

Depending, on what you usually do, that might indeed be 99% of your applications - or it might be closer to 0%. For myself, it is pretty much 0% (but I do not use Python).

It breaks, as soon as you want to do such things as getting the length of a string in characters. Or do correct uppercasing, lowercasing or case folding. Or if you need to re-encode in any form. Or even if you just want to compare strings: for Unicode, the following is NOT TRUE: string_a = string_b <=> encoded(string_a) = encoded(string_b).

Now, the question is: Does it make sense to use the current locale settings to decode STDIN? Obviously, it is not always correct: The content might have been written under another setting. Or the writing application might not have honored that setting. Or it might not even be text.
But then: The application would have to handle this anyway. A correct application would produce the same results in Python 2 and Python 3. An incorrect application would throw an exception in Python 3 while it could silently produce wrong results in Python 2.

Python 3, ASCII, and UTF-8

Posted Dec 19, 2017 0:45 UTC (Tue) by khim (subscriber, #9252) [Link] (5 responses)

It breaks, as soon as you want to do such things as getting the length of a string in characters.
Okey. Let's see how python3 solves that problem:
$ python3
Python 3.4.3 (default, Nov 28 2017, 16:41:13) 
[GCC 4.8.4] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> len('Length')
6
>>> len(' Len ')
5
>>> len('👦🏼👦🏼👦🏼')
6
So where this non-solution for non-problem is useful?
Or do correct uppercasing, lowercasing or case folding.
Does not work there, too.
An incorrect application would throw an exception in Python 3 while it could silently produce wrong results in Python 2.
No. Both would introduce errors unless lots of care would be done. Except with python2 it would be immediately obvious and with python3 errors would be subtle and not-immediately-obvious.

Unicode is complex - because human languages are complex.

The Zen of python says: explicit is better than implicit - yet wih Unicode Python3 tries very hard to do implicit - and fails. Even after all these years.

Python 3, ASCII, and UTF-8

Posted Dec 19, 2017 10:31 UTC (Tue) by kigurai (guest, #85475) [Link] (2 responses)

The Zen of Python also says "Simple is better than complex" and "[...] practicality beats purity", both which I think could be applied here:
If it does the right thing in most cases, then implicit and simple is very practical.

Granted, I work very little with text these days, so I don't claim to know what the ultimate policy is.

Python 3, ASCII, and UTF-8

Posted Dec 20, 2017 20:23 UTC (Wed) by kjp (guest, #39639) [Link] (1 responses)

The entire python3 project is "purity over practicality". They've totally lost their way. Putting magic surrogate escapes into "strings" from POSIX system calls, and trying to pretend unix is windows, is insane.

Python 3, ASCII, and UTF-8

Posted Dec 20, 2017 23:51 UTC (Wed) by togga (guest, #53103) [Link]

"The entire python3 project is "purity over practicality"."

True, and this is the sad part. It shows all over, from changing "Queue" to "queue" to the whole string mess, it's like they're doing a piece of art that gets more and more abstract and detached from reality. After once chosen Python for practicality, every time I've tried to switch to python3 over the years I've got into more and more practical issues, up until the current mess...

"The Linux way; never ever break user experience".
https://felipec.wordpress.com/2013/10/07/the-linux-way/

That shouldn't be so hard, works for most other major languages out there. When did C have an regression like this?

What every programmer and sysadmin should know

Posted Dec 24, 2017 15:20 UTC (Sun) by CChittleborough (subscriber, #60775) [Link] (1 responses)

The reason that len('👦🏼👦🏼👦🏼')returns 6 is that the string is actually the two-unicode sequence [U+1F466 U+1F3FC] repeated 3 times.

(Aside: U+1F466 = BOY and U+1F3FC = EMOJI MODIFIER FITZPATRICK TYPE-3, so that two-unicode sequence represents a single glyph.)

There is a more general and more serious problem here. Too many programmers assume that Unicode is an improvement on the preceding mess of charsets (especially ISO-2022), but it is actually a union of every significant charset, with none of their problems fixed, and new problems added (such as the distinction between unicodes and glyphs seen here).

The Unicode Consortium prioritized compatibility with previous encodings over programmer convenience, because they knew that the people who decided whether to switch to Unicode would care a lot more about round-trip compatibility than code simplicity. (For example, titlecase is a hack to deal with 4 (four) little-used characters in a Yugoslavia-specific charset.) Now we programmers have to live with that lack of simplicity.

Sadly, few of us can just close our eyes and pretend that everything is a byte sequence. Most programmers will need to at least understand at a high level the problems of handling Unicode text. (The important lessons: (1) use library routines instead of writing your own string-handling code and (2) be grateful to the poor folk who write those library routines.) System admins will have to decide on what encodings to use in file names, and what to do when (not if) someone downloads files which use a different encoding. And so on. This is a significant challenge, and there is no way to avoid it.

What every programmer and sysadmin should know

Posted Dec 24, 2017 15:28 UTC (Sun) by CChittleborough (subscriber, #60775) [Link]

Something I forgot to mention: one of the challenges is setting locales correctly. It is easy to get the locale almost right, and have nasty, hard-to-track-down problems show up years later.

Python 3, ASCII, and UTF-8

Posted Dec 19, 2017 18:11 UTC (Tue) by tialaramex (subscriber, #21167) [Link] (16 responses)

"the length of a string in characters"

This is almost invariably not going to be what you wanted. Character is a very loaded word in this context, and whatever programming structure you build probably won't handle the weight of it.

I think Python provides a code of how many Unicode code points were in your string. It is left as an exercise to the reader to imagine why you'd want to count those, as I can't think of any good examples. That you think it's counting "Characters" should sound alarm bells.

In particular you cannot try to cut Strings by just lopping off some of the Unicode code points. That's not how a String works in Human languages and particularly Unicode. Yes it worked in ASCII, sort of, but this isn't ASCII.

Python 3, ASCII, and UTF-8

Posted Dec 19, 2017 18:12 UTC (Tue) by tialaramex (subscriber, #21167) [Link]

s/provides a code/provides a count/

Python 3, ASCII, and UTF-8

Posted Dec 19, 2017 20:44 UTC (Tue) by nybble41 (subscriber, #55106) [Link] (14 responses)

To summarize: Python's Unicode strings are not nearly sufficient to handle all the nuances involved in properly manipulating text with respect to locale. On the other hand, for merely passing text around unchanged a simple immutable byte array with an 'encoding' tag would be a better fit. Python's Unicode strings thus combine the worst aspects of both, lacking the simplicity and reliability of byte arrays and yet not powerful enough for proper locale-aware handling of text.

A significant problem with the pervasive use of Unicode strings in Python is that it's based on a strict final encoding—all data is coerced into Unicode on input by default, whether it needs to be or not. This means that programs which otherwise have no reason to worry about encoding issues are still forced to deal with errors when presented with non-Unicode input, or else expend considerable effort to preserve the original binary format. The default should be to keep input in its original form (a byte array) until and unless the program actually needs to treat it as a sequence of Unicode codepoints, at which point it should be prepared to deal with any decoding errors.

Fortunately, the single most common operations on text input—concatenating strings and splitting them into fields based on 7-bit ASCII delimiters—work perfectly well on UTF-8 byte strings (by design) without decoding into codepoints. The less common operations which would require the program to deal with Unicode issues, such as case conversion and inexact comparison of arbitrary non-ASCII strings, are generally ill-advised without a much more comprehensive framework than that provided by the Python string libraries.

Python 3, ASCII, and UTF-8

Posted Dec 20, 2017 0:32 UTC (Wed) by tialaramex (subscriber, #21167) [Link] (13 responses)

Unfortunately processing Unicode text without understanding it is likely to end badly for somebody, somewhere in the chain, once there's an adversary choosing the input.

For example, suppose your tool "simply" takes pipe separated data and concatenates fields four and six then outputs them. It _seems_ very easy to implement this on UTF-8 input without grokking Unicode, we can just say "bytes are bytes" the pipe symbol is ASCII and so it all works fine, right?

Nope. Now an adversary can hide a symbol by splitting the representation bytes across the two fields, making it into nonsense in the process, and relying on our dumb non-Unicode process to re-assemble it. A symbol that we're absolutely sure we prohibited elsewhere in our process chain sneaks in this way and causes havoc. Unicode has some nice simple rules that allow us to avoid this mistake, but to implement them we're back to grokking Unicode, which you've said we needn't bother doing.

Sometimes, near the end of a processing chain whose output we're confident we don't care about, omitting Unicode error handling is safe‡. But like so many things in computing the fact it's "sometimes" safe but not often is enough reason to never do it up front until you've proved you really need to for some reason.

‡ Jamie Zawinski swears this is true for his Xscreensaver text output, if nonsense pixels come out, he would argue, the screen is garbled but the screensaver still works, so an attacker has achieved nothing of consequence. I have not examined the complicated program to confirm "nonsense pixels" are the worst possible outcome, but perhaps so.

Python 3, ASCII, and UTF-8

Posted Dec 20, 2017 1:01 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (12 responses)

> Nope. Now an adversary can hide a symbol by splitting the representation bytes across the two fields, making it into nonsense in the process, and relying on our dumb non-Unicode process to re-assemble it.
You can do UTF-8 validation for it. In most cases it's not even needed, split bytes are fine for CSV-like processing.

Python 3, ASCII, and UTF-8

Posted Dec 20, 2017 15:59 UTC (Wed) by nybble41 (subscriber, #55106) [Link] (11 responses)

> You can do UTF-8 validation for it.

Exactly. This is the sort of thing the runtime should provide a library for—given a byte array, is it valid UTF-8? And of course, if you want to filter for specific Unicode codepoints then you'll have to decode the string anyway.

Not that Python's Unicode strings would necessarily protect against this scenario in the first place. In surrogateescape mode the incomplete codepoints would be passed through unchanged, just as if you'd used byte arrays. But how often does one see code concatenating fields together without a known (and typically ASCII) delimiter anyway? No matter how it was implemented the transformation would lose information.

Python 3, ASCII, and UTF-8

Posted Dec 20, 2017 17:17 UTC (Wed) by brouhaha (subscriber, #1698) [Link] (10 responses)

Python 3 provides that. It's the method decode applied to the array of bytes. It will return a string if the bytes are a valid UTF-8 sequence, and an exception if not.
#!/usr/bin/env python3
import sys

def is_valid_unicode(b):
    try:
        s = b.decode('utf-8')
    except:
        return False
    return True

b = bytes([int(x, 16) for x in sys.argv[1:]])
print(is_valid_unicode(b))
Example:
$ ./validutf8.py ce bc e0 b8 99 f0 90 8e b7 e2 a1 8d 0a
True
$ ./validutf8.py 2d 66 5b 1a f7 53 e3 f6 fd 47 a2 07 fc
False
I'm confused by aspects of this discussion. Do some people think that having distinct data types for array-of-arbitrary-bytes and string-of-characters is a bad thing?

Python 3, ASCII, and UTF-8

Posted Dec 20, 2017 17:49 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (8 responses)

> I'm confused by aspects of this discussion. Do some people think that having distinct data types for array-of-arbitrary-bytes and string-of-characters is a bad thing?
Yes it is bad, if the language defaults to "string-of-codepoints" under the hood and goes through a lot of contortions to maintain this pretense.

Python 3, ASCII, and UTF-8

Posted Dec 20, 2017 18:36 UTC (Wed) by vstinner (subscriber, #42675) [Link] (7 responses)

I don't know where should I put my comment in this long thread.

To be clear: Python doesn't force anyone to use Unicode.

All OS functions accept bytes, and Python commonly provides two flavor of the same API: one for Unicode, one for bytes.

Examples: urllib accepts URL as str and bytes, os.listdir(str)->str and os.listdir(bytes)->bytes, os.environ:str and os.environb:bytes, os.getcwd():str and os.getcwdb():bytes, open(str) and open(bytes), etc.

sys.stdin.buffer.read() returns bytes which can be written back into stdout using sys.stdout.buffer.write().

The PEP 538 and 540 try to make the life easier for programmers who cares of Unicode and want to use Unicode for good reasons. Again, for Unix-like tools (imagine a Python "grep"), stdin and stdout are configured to be able to "pass through bytes". So it also makes the life easier for programmers who want to process data as "bytes" (even if technically it's Unicode, Unicode is easier to manipulate in Python 3).

Python 3, ASCII, and UTF-8

Posted Dec 20, 2017 18:55 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

> To be clear: Python doesn't force anyone to use Unicode.
You mean: "Python3 provides enough hoops through which you can jump to achieve parity with Python2".

> The PEP 538 and 540 try to make the life easier for programmers who cares of Unicode and want to use Unicode for good reasons.
No they don't try to make it easier.

Python 3, ASCII, and UTF-8

Posted Dec 20, 2017 20:42 UTC (Wed) by nas (subscriber, #17) [Link] (1 responses)

>> The PEP 538 and 540 try to make the life easier for programmers who cares of Unicode and want to use Unicode for good reasons.

> No they don't try to make it easier.

Your trolling on this site is reaching new heights. Now you actually claiming that the people working on developing these PEPs are not even intending to make Python better? What devious motivation do you think they have then? I'm mean, come on.

You like the string model of Python 2.7 better. Okay, you can have it. As someone who writes internationalized web applications, Python 3.6 works vastly better for me.

Python 3, ASCII, and UTF-8

Posted Dec 20, 2017 20:46 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

> Your trolling on this site is reaching new heights. Now you actually claiming that the people working on developing these PEPs are not even intending to make Python better? What devious motivation do you think they have then? I'm mean, come on.
People who designed register_globals for PHP were also trying to make life easier for other developers.

> You like the string model of Python 2.7 better. Okay, you can have it. As someone who writes internationalized web applications, Python 3.6 works vastly better for me.
You assume that I haven't written i18n-ed applications? How exactly is Py3 better for it?

Python 3, ASCII, and UTF-8

Posted Dec 21, 2017 11:56 UTC (Thu) by jwilk (subscriber, #63328) [Link] (1 responses)

There's no byte equivalent for sys.argv.

Here's a familiar Python developer arguing against adding it: https://bugs.python.org/issue8776#msg217416

Python 3, ASCII, and UTF-8

Posted Dec 21, 2017 12:00 UTC (Thu) by vstinner (subscriber, #42675) [Link]

"argvb can be computed in one line: list(map(os.fsencode, sys.argv))."

sys.argb wasn't added since it's hard to maintain two separated lists in sync. Harder than os.environ and os.environb which are mappings.

Python 3, ASCII, and UTF-8

Posted Dec 21, 2017 17:53 UTC (Thu) by kjp (guest, #39639) [Link]

But unicode doesn't mean unicode with python3! What is this surrogate escaping nonsense coming back from system APIs? What happened to "errors shouldn't pass silently"?

Python 3, ASCII, and UTF-8

Posted Dec 23, 2017 16:35 UTC (Sat) by mathstuf (subscriber, #69389) [Link]

The problems I've seen are usually with parsing JSON or XML files and the like. Python2 is fine, but Python3 all of a sudden starts complaining that things aren't ASCII when it encounters UTF-8. The thing is that JSON is defined as UTF-8, so I don't know why Python is being ignorant of things and our XML has the encoding declared at the top of the file.

Python 3, ASCII, and UTF-8

Posted Dec 20, 2017 21:22 UTC (Wed) by nybble41 (subscriber, #55106) [Link]

> Python 3 provides that. It's the method decode applied to the array of bytes.

That is a partial solution which works so long as you don't mind dealing with exceptions or care about the performance impact of actually decoding a potentially large byte string into a Unicode string. It would be nice to have a function which just scanned the array in place and returned the result without allocating memory or throwing exceptions.

In any case I was not trying to say that Python 3 does not provide a function like this, just that it's something that does belong in the standard library.

> Do some people think that having distinct data types for array-of-arbitrary-bytes and string-of-characters is a bad thing?

Having separate types is good. However, making the more complex and restrictive string-of-characters type the default or customary form when array-of-arbitrary-bytes would be sufficient is a mistake. Duplicating all the APIs (a byte version and a Unicode version) is yet another mistake, and basing return types on the types of unrelated parameters makes it even worse. (Why assume the file _contents_ are in UTF8 just because a Unicode string was used for the filename?) Putting aside the minority of APIs which inherently relate to Unicode, the rest should only accept and return byte arrays, leaving the conversions up to the user _if they are needed_.

Arguably, filenames should be a third type, neither byte arrays nor Unicode strings. On some platforms (e.g. Linux) they are byte arrays with a variety of encodings, most commonly UTF8 but not restricted to valid UTF8 sequences. On others (e.g. Windows) they are a restricted subset of Unicode (UCS-2). Some platforms (MacOS) apply transformations to the strings, for normalization or case-insensitivity, so equality as bytes or as Unicode codepoints may not be the same as equality as filenames. Handling all of this portably is a difficult problem, and the solutions which are most suitable for filenames are unlikely to be applicable to strings in general.

Python 3, ASCII, and UTF-8

Posted Dec 18, 2017 5:22 UTC (Mon) by eru (subscriber, #2753) [Link] (2 responses)

Sounds to me the containers should be fixed (which ought to be quite easy), instead of applying band-aids to Python.

Python 3, ASCII, and UTF-8

Posted Dec 18, 2017 11:50 UTC (Mon) by vstinner (subscriber, #42675) [Link] (1 responses)

The PEP 538 states "The Alpine Linux based Python images provided by Docker, Inc. already use the C.UTF-8 locale by default: (...) LC_CTYPE="C.UTF-8" (...)". So the PEP 538 is not needed on Alpine Linux. But containers are more diverse than just Alpine Linux :-)

Python 3, ASCII, and UTF-8

Posted Dec 18, 2017 13:36 UTC (Mon) by Otus (subscriber, #67685) [Link]

So have python fail to run if system locale is non-utf8.

/somewhat-unserious

Python 3, ASCII, and UTF-8

Posted Dec 18, 2017 17:52 UTC (Mon) by mirabilos (subscriber, #84359) [Link] (2 responses)

PEP 538 is roughly what we do in MirBSD. MirBSD has one locale, and it’s about the same as C.UTF-8 (which I proposed for Debian/eglibc, and is now available there).

However, recently I got told off for roughly this:

tg@tglase-bsd:~ $ echo '<mäh>' | LC_ALL=C tr -d '[[:alpha:]]' # MirBSD
<>

The author of the script explicitly set LC_ALL to C to get ASCII behaviour, which was ignored in MirBSD “because everything is UTF-8”, however the POSIX C locale requires different classifications (like the iswalpha(3) one used here) than a full C.UTF-8 environment would have. In fact, only the 26 lower- and 26 upper-case latin letters are required to be matched by the alpha cclass, as is on Debian:

tglase@tglase:~ $ echo '<mäh>' | LC_ALL=C tr -d '[[:alpha:]]' # Debian
<ä>

(That being said, GNU’s tr(1) is defective, as this…

tglase@tglase:~ $ echo '<mäh>' | LC_ALL=C.UTF-8 tr -d '[[:alpha:]]' # Debian
<ä>

… should have output just “<>”, but that’s a different story.)

tl;dr: MirBSD has to go back ☹ to a two-locale model (C with ASCII being the default, C.UTF-8 with UTF-8 being the nōn-default alternative), and therefore PEP 538 will end up breaking things and need to be reverted, even if it improves the situation for most users ☹

Python 3, ASCII, and UTF-8

Posted Dec 18, 2017 18:01 UTC (Mon) by mirabilos (subscriber, #84359) [Link]

Huh.

「It will only do that if the POSIX/C locale has not been explicitly chosen by the user or environment (e.g. using LC_ALL=C), so it will essentially override a default to the POSIX/C locale with UTF-8.」

That could actually work. Still, someone with time on their hands please think over my parent post and provide feedback (also to me).

Python 3, ASCII, and UTF-8

Posted Dec 21, 2017 11:11 UTC (Thu) by vstinner (subscriber, #42675) [Link]

> tl;dr: MirBSD has to go back ☹ to a two-locale model (C with ASCII being the default, C.UTF-8 with UTF-8 being the nōn-default alternative), and therefore PEP 538 will end up breaking things and need to be reverted, even if it improves the situation for most users ☹

No solution is perfect, and that's why both PEP 538 (C locale coercion) and PEP 540 (UTF-8 Mode) can be explicitly disabled, when you really want the C locale with the ASCII encoding.

Example:

$ env -i PYTHONCOERCECLOCALE=0 ./python -X utf8=0 -c 'import locale, sys; print(sys.stdout.encoding, locale.getpreferredencoding())'
ANSI_X3.4-1968 ANSI_X3.4-1968

The bet is that most users will prefer the C locale coercion and UTF-8 Mode enabled by the POSIX locale.

Python 3, ASCII, and UTF-8

Posted Jan 4, 2018 0:28 UTC (Thu) by jklowden (guest, #107637) [Link]

"The upside of the UTF-8 Mode approach is that it allows an embedding application to change the interpreter's behaviour without having to change the process global locale settings."

No, that's a downside: added complexity without added power. All the user of an embedded application has to do is set the environment appropriately before using it. If host and child need different environments, the problem is deep indeed, and the probability that any PEP will solve it correspondingly low.

PEP 540 adds weird automagical transformation, convenient (if it works) for the user who forgets to set the environment, or doesn't know how. That's not a service to them, or to anyone else.


Copyright © 2017, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds