Python 3, ASCII, and UTF-8
The dreaded UnicodeDecodeError exception is one of the signature "features" of Python 3. It is raised when the language encounters a byte sequence that it cannot decode into a string; strictly treating strings differently from arrays of byte values was something that came with Python 3. Two Python Enhancement Proposals (PEPs) bound for Python 3.7 look toward reducing those errors (and the related UnicodeEncodeError) for environments where they are prevalent—and often unexpected.
Two related problems are being addressed by PEP 538 ("Coercing the legacy C locale to a UTF-8 based locale") and PEP 540 ("Add a new UTF-8 Mode"). The problems stem from the fact that locales are often incorrectly specified and that the default locale (the "POSIX" or "C" locale) specifies an ASCII encoding, which is often not what users actually want. Over time, more and more programs and developers are using UTF-8 and are expecting things to "just work".
PEP 538
PEP 538 was first posted by its author, Nick Coghlan, back in January 2017 on the Python Linux special interest group mailing list; after a stop on python-ideas, it made its way to the python-dev mailing list in May. Naoki Inada had been designated as the "BDFL-delegate" for the PEP (and for PEP 540); a BDFL-delegate sometimes is chosen to make the decision for PEPs that Guido van Rossum, Python's benevolent dictator for life (BDFL), doesn't have the time, interest, or necessary background to pass judgment on. Inada accepted PEP 538 for inclusion in Python 3.7 back in May.
As demonstrated in the voluminous text of PEP 538, many container images do not set the locale of the distribution, which means that Python defaults to the POSIX locale, thus to ASCII. This is unexpected behavior. Developers may well find that their local system handles UTF-8 just fine:
$ python3 -c 'print("ℙƴ☂ℌøἤ")' ℙƴ☂ℌøἤ
However, running that in a container on the same system (using a generic distribution container image) may fail with a UnicodeEncodeError. The LC_CTYPE locale environment variable can fix the problem, but it must be set to C.UTF-8 (or variants that are available on other Unix platforms) inside the container. The PEP notes that new application distribution formats (e.g. Flatpak, Snappy) may suffer from similar problems.
So Python 3.7 and beyond will determine if the POSIX/C locale is active at startup time and will switch its LC_TYPE to the appropriate UTF-8 setting if one is available. It will only do that if the POSIX/C locale has not been explicitly chosen by the user or environment (e.g. using LC_ALL=C), so it will essentially override a default to the POSIX/C locale with UTF-8. That setting will be inherited by any subprocesses or other components that get run from the initial interpreter.
This change could, of course, cause problems for those who are actually expecting (and wanting) to get ASCII-only behavior. The PEP changes the interpreter so it will emit a warning noting that it has changed the locale if it does so. In addition, it will also emit a warning if it does not change the locale because it has been explicitly set to the (legacy) POSIX/C locale. These warnings do not go through the regular warnings module, so that -Werror (which turns warnings into errors) will not cause the program to exit; it was found during testing that doing so led to various problems.
This idea rests on two fundamental assumptions that are laid out in the PEP:
- in desktop application use cases, the process locale will already be configured appropriately, and if it isn't, then that is an operating system or embedding application level problem that needs to be reported to and resolved by the operating system provider or application developer
- in network service development use cases (especially those based on Linux containers), the process locale may not be configured at all, and if it isn't, then the expectation is that components will impose their own default encoding the way Rust, Go and Node.js do, rather than trusting the legacy C default encoding of ASCII the way CPython currently does
Beyond that, locale coercion will default to the surrogateescape error handler for sys.stdin and sys.stdout for the new coerced UTF-8 locales. As described in PEP 383, surrogateescape effectively turns strings containing characters that cannot be decoded using the current encoding into their equivalent byte values, rather than raising a decoding exception as the strict error handler would.
The effect of the PEP is meant to do what is actually generally expected by users. The warnings should help alert users (and distributors and the like) if the assumptions are not correct in their environment.
PEP 540
The second PEP has only recently been accepted for 3.7; though it began around the same time as the first, it languished for some time. It was brought back up by its author, Victor Stinner, in early December 2017. But that version was rather large and unwieldy, leading Van Rossum to say:
Stinner acknowledged that and posted a much shorter version a day later. Inada expressed interest in accepting it, but there were some things that still needed to be worked out.
The basic idea is complementary to PEP 538. In fact, there are fairly subtle differences between the two, which has led to some confusion along the way. PEP 540 creates a "UTF-8 mode" that is decoupled from the locale of the system. When UTF-8 mode is active, the interpreter acts much the same as if it had been coerced into a new locale (à la PEP 538), except that it will not export those changes to the environment. Thus, subprocesses will not be affected.
In some ways, UTF-8 mode is more far reaching than locale coercion. For environments where there are either no locales or no suitable locale is found (i.e. no UTF-8 support), locale coercion will not work. But UTF-8 mode will be available to fill the gap.
UTF-8 mode will be disabled by default, unless the POSIX/C locale is active. It can be enabled by way of the "-X utf8" command line option or by setting PYTHONUTF8=1 in the environment (the latter will affect subprocesses, of course). Since the POSIX locale has ASCII encoding, UTF-8 (which is ASCII compatible at some level) is seen as a better encoding choice, much like with PEP 538.
The confusion between the two PEPs led to both referring to the other in order to make it clear why both exist and why both are needed. PEP 538 explains the main difference:
Beyond that, PEP 538 goes into an example of the Python read-eval-print loop (REPL), which uses GNU Readline. In order for Readline to properly handle UTF-8 input, it must have the locale propagated in LC_TYPE as locale coercion does. On the other hand, some situations may not want to push the locale on subprocesses, as PEP 540 notes:
The two earliest versions of PEP 540 (it went through four versions on the way to acceptance) proposed changing to the surrogateescape error handler for files returned by open(), but Inada was concerned about that. A common mistake that new Python programmers make is to open a binary file without using the "b" option to open(). Currently, that would typically generate lots of decoding errors, which would lead the programmer to spot their error. Using surrogateescape would silently mask the problem; it is also inconsistent with the locale coercion behavior, which does not change open() at all. After some discussion, Stinner changed the PEP to keep the strict error handler for files returned by open().
So Python 3.7 will work better than previous versions with respect to Unicode character handling. While the switch to Unicode was one of the most significant features of Python 3, there have been some unintended and unexpected hiccups along the way. In addition, the computing landscape has changed in some fundamental ways over that time, especially with the advent of containers. Strict ASCII was once a reasonable choice in many environments, but that has been steadily changing, even before Python 3 was released just over nine years ago.
Posted Dec 17, 2017 18:17 UTC (Sun)
by tialaramex (subscriber, #21167)
[Link]
[ Summary: Python doesn't do IDN properly, somehow this has infected its implementation of subject dnsName checks, even though the _whole point_ of IDN is that it lives in the presentation layer and shouldn't be gumming up low-level stuff - so if you use TLS/SSL with IDNs in Python what you get is probably not what you wanted ]
It is not clear to me (as someone who can program Python but doesn't work on its core) whether this has or will land. There's a line from November saying "I'll make sure a complete patch lands in 3.7" but hey, they managed to ship Python 3.0 without fixing all the "must have for Python 3" items so like President Bush said, fool me twice, won't get fooled again.
Posted Dec 17, 2017 21:13 UTC (Sun)
by Cyberax (✭ supporter ✭, #52523)
[Link] (74 responses)
Meanwhile, Python 2.7 is a great improvement on its successor. Strings are simply treated as sequences of bytes and this works just fine for 99% of applications.
Posted Dec 18, 2017 7:26 UTC (Mon)
by andrewsh (subscriber, #71043)
[Link] (23 responses)
Posted Dec 18, 2017 8:36 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (22 responses)
Posted Dec 18, 2017 8:49 UTC (Mon)
by epa (subscriber, #39769)
[Link] (1 responses)
Posted Dec 18, 2017 15:31 UTC (Mon)
by corbet (editor, #1)
[Link]
Posted Dec 18, 2017 9:42 UTC (Mon)
by tialaramex (subscriber, #21167)
[Link] (19 responses)
(in contrast, since I have a UTF-8 locale)
[njl@totoro ~]$ echo "Ňø, îť đöéšń’ŧ." | python3.6 -c "import sys; print(sys.stdin.read().upper());"
No doubt Cyberax has a great future ahead of them in Enterprise Security software or some other sector where it doesn't matter whether your software actually works so long as you can do a demo that looks passable, once the poor customer has handed over their money you can ignore them until it's time to sell them an "upgrade".
Posted Dec 18, 2017 10:41 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (18 responses)
> cyberax@20c9d0485705:~$ echo "Is this correct?" | python3.6 -c "import sys; print(sys.stdin.read().upper());"
Whoops. You've caused a murder: https://gizmodo.com/382026/a-cellphones-missing-dot-kills...
See, the letter "I" is actually a Turkish dotless-i and so I expect "i" to capitalize to "İ". I can also give an example with German "ß".
So please, spare me this nonsense. Python's "Unicode" implementation is simply misleading bullshit which lulls developers into thinking that their naïve code works correctly everywhere. It doesn't.
Meanwhile, Python 2.7 has the correct behavior - it doesn't try to do anything outside of well-defined ASCII range. If you really really need questionable Unicode character transformations, you can do that in Py2.7 as well, but it won't do it silently.
Posted Dec 18, 2017 11:58 UTC (Mon)
by cortana (subscriber, #24596)
[Link] (1 responses)
Perhaps I'm missing something, but this does not appear to be the case:
It doesn't matter whether the developer is using Python 2.7, Python 3 or another language: their code must be tested to ensure that it actually works. If it is not tested then the business has fucked up and this needs to be rectified.
Posted Dec 18, 2017 18:52 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link]
> It doesn't matter whether the developer is using Python 2.7, Python 3 or another language: their code must be tested to ensure that it actually works. If it is not tested then the business has fucked up and this needs to be rectified.
Python 2.7 makes you more aware of that.
Posted Dec 18, 2017 13:13 UTC (Mon)
by eru (subscriber, #2753)
[Link] (2 responses)
Posted Dec 18, 2017 18:52 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
Posted Dec 19, 2017 1:03 UTC (Tue)
by anselm (subscriber, #2796)
[Link]
Pretty much every language other than Turkish (and Azerbaijani) capitalises “i” as “I” (or for that matter lowercases “I” as “i”). Doing it like this is therefore a reasonable default unless the computer is explicitly told it is dealing with Turkish (or Azerbaijani) rather than English, French, German, Dutch, Portuguese, Finnish, Latin, Esperanto, …, none of which even have a dotless ı, and where capitalising “i” as “İ” or lowercasing “I” as “ı” would not just be silly but wrong.
Posted Dec 18, 2017 13:36 UTC (Mon)
by MarcB (subscriber, #101804)
[Link] (2 responses)
Your expectation is wrong: dotless-lower maps to dotless-upper. Dotted-lower maps to dotted-upper.
> I can also give an example with German "ß".
"ß" is special because it has (had) no uppercase form. So what Python should do here, according to Unicode rules, is map to "SS".
Which is exactly what it does:
Note, that lower("SS") will obviously not produce "ß", but "ss" instead. So this is no round-trip (this is where case-folding comes in).
Posted Dec 18, 2017 18:09 UTC (Mon)
by mirabilos (subscriber, #84359)
[Link]
This is most certainly not correct. (And yes, this system has locales-all.)
Expect Unicode to change about towupper('ß') though, de-DE-2017 changed it to mandate 'ẞ'. (About the only orthography change since 1995 I agree with.)
Posted Dec 18, 2017 19:04 UTC (Mon)
by Tara_Li (guest, #26706)
[Link]
https://www.youtube.com/watch?v=0j74jcxSunY
And while we're at it, let's check out time zones:
Posted Dec 18, 2017 21:43 UTC (Mon)
by ceplm (subscriber, #41334)
[Link] (6 responses)
Posted Dec 18, 2017 22:18 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (5 responses)
So I'm far more aware of encoding problems than most people in the US. After all, back in 90-s there were about 6 concurrently used different Russian encodings (Win1251, Cp866, GOST, Koi8-RU, ISO and of course UTF-8).
That's why I maintain that the only sane way to work with text is to not work with text unless you absolutely HAVE to do it.
Posted Dec 18, 2017 23:08 UTC (Mon)
by tialaramex (subscriber, #21167)
[Link] (2 responses)
Insisting on only processing OIDs, or telephone numbers, or EANs is avoiding working with text. If you try hard enough then insisting on working only with DNS A-labels, or email addresses, might count too. And if you _really_ squint maybe you can try this argument with even filenames [ but you might regret that ], or database schemas.
But that's still so very far from most things. If you're really staying away from "text" Python 3 is fine. OIDs work the same, and you can be annoyed when (in both Python 2.x and 3.x) Python thinks U-labels are the "right" way to think about DNS names even though DNS makes it very clear that this is a Bad Idea™. The problems you've run into exist only because you persist in working with text even while saying that's not what you're doing. That's how we got into this mess in the first place.
There are a bunch of hacks in existence to try to smash Latin text in particular into a form that suits machines better. This can make it seem as though "ASCII just works" but er, no, at most it has meant that Latin-consumed cultures have been brutalised into putting up with the consequences. Most English speakers will grudgingly tolerate a machine insisting they "Add 1 coin(s) to continue" but that's not actually how English works. W and i aren't actually supposed to be the same width, but we've made people sigh and roll their eyes when a machine insists they should be for the convenience of the programmer.
Just as we should almost always choose to have garbage collection over manual memory management because it's "pretty good" and doing better than "pretty good" is annoyingly difficult to it's better to focus on something else, so we should almost always use languages that do something sensible with Unicode rather than just shrugging and crossing our fingers that treating everything as raw bytes will work out OK. Are they perfect? No. Are they better than you were going to manage in the five minutes you had spare? Yes.
Posted Dec 18, 2017 23:26 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link]
And when you do, quite often you need to use additional rules to treat the text correctly. For example, filesystems are not guaranteed to have names that are valid UTF-8 sequences. DNS IDNs have additional restrictions and so on.
That's why the code written to work with "unicode" is most often incorrect. Its authors simply slather "unicode" stuff without thinking for a second about encodings, combining characters or locales and hope that it works.
Again, it's perfectly OK to have strings in the language that guaranteed that they are at all times valid UTF-8/UCS-2/whatever sequences. What is absolutely NOT OK is pretending that EVERYTHING is text by default. That's why Python3 had to grow a veritable menagerie of workarounds, starting from PathLike and ending with surrogate encoding for stdin.
Posted Dec 21, 2017 5:15 UTC (Thu)
by rixed (guest, #120242)
[Link]
I wish it was easier to distinguish between strings destined to be rendered for a user to see and vectors of bytes that happen to have a printable representation, for most of data and code itself. Restricting user interaction to html displayed in a web browser and giving up the terminal for UI is a trend going in the right direction, from this point of view.
Posted Dec 19, 2017 6:20 UTC (Tue)
by ceplm (subscriber, #41334)
[Link]
Posted Dec 19, 2017 6:24 UTC (Tue)
by ceplm (subscriber, #41334)
[Link]
Posted Dec 21, 2017 6:15 UTC (Thu)
by dvdeug (guest, #10998)
[Link] (2 responses)
So two programs have the exact same behavior, and yet one is wrong and the other is right. Huh. The only way that Turks are going to get proper capitalization of i to İ in most software is to have a Turkish locale and software written in Unicode-supporting languages that automatically take care of that little detail. Otherwise software that's written by people who aren't familiar with Turkish linguistic needs or assume that it won't used for Turkish (YAGNI lovers) will silently cause problems like the one in the link. Case-insensitive search is a real need, for example.
Posted Dec 22, 2017 17:04 UTC (Fri)
by nim-nim (subscriber, #34454)
[Link] (1 responses)
Unfortunately, too many free software devs thinks they can avoid locale particularities by setting C.UTF-8 everywhere, or deriving the locale from the input method (any latin layout can write other latin languages, and no I don't want to switch between qwerty and azerty when switching between English and French), or whatever. And they still invent text formats where the locale of a run of text is not explicitly specified
It sort of works, as long as the software is used by devs writing in pidging english. It fails spectacularly when exposed to normal people or normal human text.
Text = locale + encoding remove one and you're full of fail
Posted Jan 2, 2018 5:40 UTC (Tue)
by dvdeug (guest, #10998)
[Link]
And they still invent text formats where the locale of a run of text is not explicitly specified There's certainly logic there, but when exposed to normal people and normal human text, that often doesn't work. The theory is hard; at Wiktionary, we often have arguments about whether a word is French in English or has been adopted in English. Apple, Inc. is Apple even in many languages where they don't write in Latin script. Even texts where they could be clearly language tagged with human effort, like > > Esperantists who dream of the final victory are non-existent in the modern day. are likely to appear undifferentiated, with none of the posters having bothered to set the locale correctly (by default, that could have come from a German, Canadian, and an Estonian user, and given tags of de_DE, en_CA and ee_ET), and even those locales not having followed into the final post. Only serious texts I'm writing for permanent archival would I bother carefully language tagging; even a published PDF I might not bother, if you couldn't tell in the final copy.
Posted Dec 18, 2017 12:03 UTC (Mon)
by vstinner (subscriber, #42675)
[Link] (25 responses)
If you ignore the actual Python type, Python 3 is already processing stdin and stdout "as bytes" for the POSIX locale: since Python 3.5, stdin and stdout use the "surrogateescape" error handler if the locale is POSIX.
With the PEP 538 and PEP 540, the difference is that not only undecodable bytes are still passed through, but if the input is correctly encoded to UTF-8, Unicode functions work as well. You get the best of the two worlds (bytes of Unix world, and text of Python 3).
The PEP 540 makes this trick usable with any locale, not only the POSIX locale. The new UTF-8 Mode must be enabled explictly since ignoring the locale *can* introduce mojibake (even if it should not in practice, since we are already all using UTF-8, right? :-)).
The UTF-8 Mode allows to write Unix-like applications processing stdin and stdout "as bytes": undecodable bytes are stored as surrogate characters. It's the surrogateescape error handler, PEP 383.
It's not only a matter to decoding error (read stdin). It's also a matter of encoding error: print("<funny unicode characters>") will work as expected (as soon as your terminal decodes stdout from UTF-8). Example:
$ python2 -c 'print(u"euro sign: \u20ac")'
$ LANG= python2 -c 'print(u"euro sign: \u20ac")'
Posted Dec 18, 2017 19:49 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (24 responses)
Posted Dec 19, 2017 8:22 UTC (Tue)
by peter-b (guest, #66996)
[Link] (23 responses)
Text is hard. A "byte sequence" is never an adequate representation of text. A word may contain up to 10 codepoints per character. There are different rules, depending on language, for what a "word" is, even.
Treating text as a series of bytes is exactly suggests to me that some subset of the following apply:
1. You never need to deal with meaningful amounts of non-English text.
Your ridiculous Turkish capitalisation strawman shows *exactly* how hard text is to get right if you treat it as a context-free bucket of bytes or codepoints. Unicode merely lets you represent the characters; operations like "uppercase" and "lowercase" are entirely meaningless in most languages, so you actually do need to know the text's context before applying them.
The Rust standard library has 13 different string types in the standard library. A String is a UTF-8 Unicode codepoint sequence. A Vec<u8> is a series of bytes. There are other string types that represent other platform- and context-dependent text representations. It's well thought out (Mozilla's years of experience of dealing with arbitrary text in the trenches really shows), but it's complicated.
"Byte sequences" are neither a simple nor an honest way of handling and processing text. Your argument is the same as that of many people who objected to the excellent features that systemd provides for handling dependency-based service startup: "I don't have that problem, so it's not a valid problem." Please stop it.
Posted Dec 19, 2017 10:09 UTC (Tue)
by tialaramex (subscriber, #21167)
[Link] (1 responses)
"You tried your best, and you failed miserably. The lesson is: Never try"
We've been here before on many things, it's how lots of people managed to persuade themselves that CVS was fine (remember the era when large successful Free Software projects used CVS? That seems pretty crazy now). Sure, it doesn't do what anybody actually wants, but trivial fixes don't seem to solve all our problems, so let's stop trying.
I would say that a byte sequence _is_ an adequate representation of text, but only in the same sense that a byte sequence is an adequate representation of a symphony or photograph. We should probably not try to write software that tries to process symphonies or photographs one byte at a time just because they were stored as a byte sequence.
Posted Dec 20, 2017 1:08 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link]
1) Don't try to hide complexity by replacing it with simple, neat and wrong solutions.
2) But if you really want to dive deep then do it. Perl 6 is a credible attempt to make Unicode support useful. And they had to do quite a bit of work for it to become real: https://perl6advent.wordpress.com/2015/12/07/day-7-unicod...
Your CVS comparison is off-target. "Unicode" support in Python is kinda like RCS - an actual step back, that introduced more complexity than it solves.
Posted Dec 19, 2017 13:08 UTC (Tue)
by foom (subscriber, #14868)
[Link] (19 responses)
Python3 goes to a lot of effort to convert just about everything, be it actually in a valid known encoding or not, into a codepoint sequence, in the name of Unicode Support. But that is unfortunately mostly wasted effort.
Imagine if, instead of a Unicode String type, python instead just had functions to iterate over and otherwise manipulate a utf8 buffer by whichever grouping you like, byte, codepoint, glyph or word?
It could have ended up with something more like golang, where "string is the set of all strings of 8-bit bytes, conventionally but not necessarily representing UTF-8-encoded text."
And that would've been good.
Posted Dec 20, 2017 23:24 UTC (Wed)
by togga (guest, #53103)
[Link]
Not only that, it also wastes it's users time and effort. It trades that rare complex tasks might look easy on the surface with common simple tasks getting annoyingly complex.
Posted Jan 4, 2018 4:36 UTC (Thu)
by ras (subscriber, #33059)
[Link] (17 responses)
I'm not a huge fan of Python3 Unicode implementation, but this over eggs the issue. We have had byte sequences representing text for as long as I can remember (which is longer than I care to admit). This issue was if it really was text, you had to convert those byte sequences into text. There were standards, code pages and what not that told you how to do it of course. But there were so many to choose from.
ISO 10646 solved the problem by dividing it into half. It assigned each unique character a number it called a code point. Then, as a separate exercise, it proposed some ways of encoding those numbers into bytes. Everyone seems to be happy enough to accept the ISO 10646 mapping from characters to numbers, meaning every one is just as happy to with accepting the only true code point for € is 0x20ac as they are accepting the only true code point for 'a' is 0x61. (Well everyone except Cyberax apparently, who wants a code point an Italian i and presumably '\r\n' as well.) But maybe even Cyberax would agree 10646 was a major step forward over code pages. The encoding into bytes will remain a blight until everyone agrees on to just use UTF-8. Given UCS2 is hard wired into Windows and Javascript that might be never. Sigh.
My only issue with Python3 is it is far too eager to convert bytes into text. stdin, stdout, os.listdir(), the default return of open() - they should all be byte streams. I don't know why they thought they could safely convert it to text. Doing so in a 'nix environment looked to me to be recipe for disaster because the LANG= could be different on each run, and that's pretty much how it turned out. They had languages to copy from that did it mostly right, like Java for instance, and they still got it wrong.
Posted Jan 4, 2018 6:32 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (6 responses)
I'm against making magic "unicode" strings with unspecified encoding and then forcing everything to use these strings.
Posted Jan 4, 2018 9:47 UTC (Thu)
by anselm (subscriber, #2796)
[Link] (5 responses)
The idea behind “Unicode” strings is that they are not encoded at all (so much for “unspecified encoding”) because they don't need to be encoded. When you read text from a file, socket, … (where it tends to be encoded according to UTF-8, UCS2, or whatever, and that encoding is an attribute of the communication channel in question) it is decoded into a Unicode string for internal processing, and when you write it back it is again encoded according to UTF-8, UCS2, or whatever, depending on where it goes. The system assumes that people writing Python programs are more likely to deal with Unicode text rather than sequences of arbitrary bytes, and therefore leans in the former direction by default.
On the whole this is not an entirely unreasonable approach to take. There are problems with (inadvertently) feeding this mechanism streams of arbitrary bytes, which tends to fail due to decoding issues, with fuzzy boundaries between what should be treated as text vs. arbitrary bytes (such as path names on Linux), and with the general quagmire that Unicode quickly becomes when it comes to more complicated scripts. But saying that the way Python 3 handles Unicode is much worse in general than the way Python 2 handled Unicode doesn't ring true. My personal experience after mostly moving over to Python 3 is that now I seem to have much fewer issues with Unicode handling than I used to when I was mostly using Python 2, so as far as I'm concerned, Python 3 is an improvement in this area – but then again I'm one of those programmers who have to deal with Unicode text much more often than with streams of arbitrary bytes.
Posted Jan 4, 2018 19:28 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (4 responses)
As a result, they are strictly not better than simple byte sequences. You can't even concatenate them without getting strange results (just play around with combining characters and RTL switching). Pretty much the only safe thing you can do with Py3 strings is to spit them out verbatim. At which point it's just easier to deal with good old honest raw byte sequences.
And it doesn't have to be so - Perl 6 does Unicode the correct way. They most definitely specify the encoding and normalize the text to make indexing and regexes sane. In particular, combining characters and languages with complex scripts are treated well by normalizing the text into a series of graphemes.
Posted Jan 5, 2018 17:51 UTC (Fri)
by anselm (subscriber, #2796)
[Link] (3 responses)
As far as I'm concerned, it is silly to claim that Python's Unicode strings are “strictly not better” than byte sequences when it is possible, in Python 3, to open a text file using the built-in open() function and read UTF-8-encoded text from that file. This is arguably what people expect to be able to do in 2017, and it is an obvious improvement over Python 2.7, where you had to jump through your choice of flaming hoops (involving, e.g., the codecs module) to be able to accomplish the same thing. (People who would rather read binary data from a file can still open it in binary mode and see the raw bytes.)
So, Python's Unicode strings are better than simple byte sequences because they allow programmers to deal with Unicode code points (as opposed to bytes which might or might not be part of the representation of a Unicode code point). It is true that there are situations which Python doesn't handle well right now, but it is reasonable to expect that these will be sorted out in due course. Having the concept of Unicode strings in the first place, however, is a prerequisite for being able to deal with these cases at all.
It may well be the case that Perl 6's handling of Unicode data is currently better than Python's. They certainly took their own sweet time to figure it out, after all. But given the way things work between the Python and Perl communities, if the Perl people have in fact caught on to something worthwhile, chances are that the same – or very similar – functionality will pop up in Python in due course (and vice versa). So I wouldn't consider the Unicode functionality in current Python to be the last word on the topic.
Posted Jan 5, 2018 18:47 UTC (Fri)
by jwilk (subscriber, #63328)
[Link]
Posted Jan 5, 2018 18:51 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link]
> So, Python's Unicode strings are better than simple byte sequences because they allow programmers to deal with Unicode code points
> It is true that there are situations which Python doesn't handle well right now, but it is reasonable to expect that these will be sorted out in due course.
Posted Jan 5, 2018 23:46 UTC (Fri)
by lsl (subscriber, #86508)
[Link]
I can do that using the raw open and read syscalls. Why would I need magic Unicode strings for that?
> So, Python's Unicode strings are better than simple byte sequences because they allow programmers to deal with Unicode code points (as opposed to bytes which might or might not be part of the representation of a Unicode code point).
Except for the fact that arrays of Unicode code points aren't a terribly useful representation most of the time. Byte strings can carry UTF8-encoded Unicode text just as well, in addition to being able to hold everything else users might want to feed into my program and expect it to roundtrip correctly. The odd day where I really need to apply some Unicode-aware text transformation that works on code points I can convert it there and then. Or just package it up such that it accepts and returns UTF-8 strings.
Posted Jan 10, 2018 3:03 UTC (Wed)
by HelloWorld (guest, #56129)
[Link] (9 responses)
Posted Jan 10, 2018 8:29 UTC (Wed)
by jem (subscriber, #24231)
[Link] (1 responses)
Jav moved from UCS-2 (64k points) to UTF-16 (1M+ code points) in version 1.5 (2004). Of course, the transition is not completely transparent to applications, which can still think they are dealing with UCS-2.
Posted Jan 10, 2018 11:30 UTC (Wed)
by HelloWorld (guest, #56129)
[Link]
Posted Jan 10, 2018 17:40 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (6 responses)
Posted Jan 10, 2018 22:20 UTC (Wed)
by ras (subscriber, #33059)
[Link] (5 responses)
Having spent my life in a country were ASCII every character we use plus some, this is all news to me. It sounds like a right royal balls up. Was there a good reason for not making code point == grapheme?
Posted Jan 10, 2018 22:43 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (4 responses)
Posted Jan 10, 2018 23:15 UTC (Wed)
by ras (subscriber, #33059)
[Link] (3 responses)
Sounds almost reasonable.
I wonder if they realised how many bugs that feature would create? Most programmers don't care about this stuff, to the point that if the unit test displays "Hello World" displays properly, job done. I'd be tempted to say "no programmer cares", but guess there must be at least one renegade out there who has tested whether their regex chokes on multi code point grapheme's.
Posted Jan 10, 2018 23:31 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
It's not like Unihan is even the worst offender. Look at this, for example: देवनागरी लिपि - try to edit it in browser.
Posted Jan 11, 2018 0:40 UTC (Thu)
by ras (subscriber, #33059)
[Link]
I may as well be looking at hieroglyphs. In fact I might have more chance with hieroglyphs as the pictures are sometimes recognisable.
I guess the point I was trying to make is you want this stuff have any chance of just working in a program written by a programmer that doesn't care much about this stuff (which I suspect is a tiny few and only some of the time) is to make code point == grapheme.
It is nice to have a fall back for a grapheme you can't display to it's root, but that could also be handled by libraries that do the displaying. There are really only a few programs that care overly - browsers, email clients spring and word processors to mind. ls, cat and the 100's of little scripts I write to help me through the day don't care is someone can't read every character in the data they display, but mangling the data on the way through (which is what Python3 manages to pull off!) is an absolute no-no. For security it's even worse. This is like 'O' looking like '0', but now there really is an '0' in the string, albeit proceeded by a code point that makes it not an '0'. Who is going to check for that?
What I would class as good reason for not making code point == grapheme would be the number of code points exceed 2^31. But since you say Perl 6 emulates it, I guess that's not a problem.
Posted Jan 11, 2018 0:44 UTC (Thu)
by ras (subscriber, #33059)
[Link]
Posted Dec 21, 2017 7:42 UTC (Thu)
by togga (guest, #53103)
[Link]
Python is going from "productivity" to "cumbersome" measured in development hours on a day-to-day basis and has now a hard time to compensate for all it's (other) weaknesses (multithreading with GIL, performance, ...).
How much lipstick can you paint the pig with? "Coercing", "C.UTF-8", "new UTF-8", "PYTHONUTF", "-x utf8", ... My favourite is "surrogateescape error handler".
You can either be this and that type of script with different settings. Best for all?
I agree with Cyberax. Keep it simple (bytestrings) and let the user do explicit things with the data with library functions and classes (maybe include a more advanced regex) when text functionality really needed.
That said. Python2 gave me 14 pretty good years of scripting productivity (mostly attributed to numpy). With this, it's time to move on.
Posted Dec 18, 2017 13:16 UTC (Mon)
by MarcB (subscriber, #101804)
[Link] (23 responses)
Depending, on what you usually do, that might indeed be 99% of your applications - or it might be closer to 0%. For myself, it is pretty much 0% (but I do not use Python).
It breaks, as soon as you want to do such things as getting the length of a string in characters. Or do correct uppercasing, lowercasing or case folding. Or if you need to re-encode in any form. Or even if you just want to compare strings: for Unicode, the following is NOT TRUE: string_a = string_b <=> encoded(string_a) = encoded(string_b).
Now, the question is: Does it make sense to use the current locale settings to decode STDIN? Obviously, it is not always correct: The content might have been written under another setting. Or the writing application might not have honored that setting. Or it might not even be text.
Posted Dec 19, 2017 0:45 UTC (Tue)
by khim (subscriber, #9252)
[Link] (5 responses)
Unicode is complex - because human languages are complex.
The Zen of python says: explicit is better than implicit - yet wih Unicode Python3 tries very hard to do implicit - and fails. Even after all these years.
Posted Dec 19, 2017 10:31 UTC (Tue)
by kigurai (guest, #85475)
[Link] (2 responses)
Granted, I work very little with text these days, so I don't claim to know what the ultimate policy is.
Posted Dec 20, 2017 20:23 UTC (Wed)
by kjp (guest, #39639)
[Link] (1 responses)
Posted Dec 20, 2017 23:51 UTC (Wed)
by togga (guest, #53103)
[Link]
True, and this is the sad part. It shows all over, from changing "Queue" to "queue" to the whole string mess, it's like they're doing a piece of art that gets more and more abstract and detached from reality. After once chosen Python for practicality, every time I've tried to switch to python3 over the years I've got into more and more practical issues, up until the current mess...
"The Linux way; never ever break user experience".
That shouldn't be so hard, works for most other major languages out there. When did C have an regression like this?
Posted Dec 24, 2017 15:20 UTC (Sun)
by CChittleborough (subscriber, #60775)
[Link] (1 responses)
(Aside: U+1F466 = BOY and U+1F3FC = EMOJI MODIFIER FITZPATRICK TYPE-3, so that two-unicode sequence represents a single glyph.)
There is a more general and more serious problem here. Too many programmers assume that Unicode is an improvement on the preceding mess of charsets (especially ISO-2022), but it is actually a union of every significant charset, with none of their problems fixed, and new problems added (such as the distinction between unicodes and glyphs seen here).
The Unicode Consortium prioritized compatibility with previous encodings over programmer convenience, because they knew that the people who decided whether to switch to Unicode would care a lot more about round-trip compatibility than code simplicity. (For example, titlecase is a hack to deal with 4 (four) little-used characters in a Yugoslavia-specific charset.) Now we programmers have to live with that lack of simplicity.
Sadly, few of us can just close our eyes and pretend that everything is a byte sequence. Most programmers will need to at least understand at a high level the problems of handling Unicode text. (The important lessons: (1) use library routines instead of writing your own string-handling code and (2) be grateful to the poor folk who write those library routines.) System admins will have to decide on what encodings to use in file names, and what to do when (not if) someone downloads files which use a different encoding. And so on. This is a significant challenge, and there is no way to avoid it.
Posted Dec 24, 2017 15:28 UTC (Sun)
by CChittleborough (subscriber, #60775)
[Link]
Posted Dec 19, 2017 18:11 UTC (Tue)
by tialaramex (subscriber, #21167)
[Link] (16 responses)
This is almost invariably not going to be what you wanted. Character is a very loaded word in this context, and whatever programming structure you build probably won't handle the weight of it.
I think Python provides a code of how many Unicode code points were in your string. It is left as an exercise to the reader to imagine why you'd want to count those, as I can't think of any good examples. That you think it's counting "Characters" should sound alarm bells.
In particular you cannot try to cut Strings by just lopping off some of the Unicode code points. That's not how a String works in Human languages and particularly Unicode. Yes it worked in ASCII, sort of, but this isn't ASCII.
Posted Dec 19, 2017 18:12 UTC (Tue)
by tialaramex (subscriber, #21167)
[Link]
Posted Dec 19, 2017 20:44 UTC (Tue)
by nybble41 (subscriber, #55106)
[Link] (14 responses)
A significant problem with the pervasive use of Unicode strings in Python is that it's based on a strict final encoding—all data is coerced into Unicode on input by default, whether it needs to be or not. This means that programs which otherwise have no reason to worry about encoding issues are still forced to deal with errors when presented with non-Unicode input, or else expend considerable effort to preserve the original binary format. The default should be to keep input in its original form (a byte array) until and unless the program actually needs to treat it as a sequence of Unicode codepoints, at which point it should be prepared to deal with any decoding errors.
Fortunately, the single most common operations on text input—concatenating strings and splitting them into fields based on 7-bit ASCII delimiters—work perfectly well on UTF-8 byte strings (by design) without decoding into codepoints. The less common operations which would require the program to deal with Unicode issues, such as case conversion and inexact comparison of arbitrary non-ASCII strings, are generally ill-advised without a much more comprehensive framework than that provided by the Python string libraries.
Posted Dec 20, 2017 0:32 UTC (Wed)
by tialaramex (subscriber, #21167)
[Link] (13 responses)
For example, suppose your tool "simply" takes pipe separated data and concatenates fields four and six then outputs them. It _seems_ very easy to implement this on UTF-8 input without grokking Unicode, we can just say "bytes are bytes" the pipe symbol is ASCII and so it all works fine, right?
Nope. Now an adversary can hide a symbol by splitting the representation bytes across the two fields, making it into nonsense in the process, and relying on our dumb non-Unicode process to re-assemble it. A symbol that we're absolutely sure we prohibited elsewhere in our process chain sneaks in this way and causes havoc. Unicode has some nice simple rules that allow us to avoid this mistake, but to implement them we're back to grokking Unicode, which you've said we needn't bother doing.
Sometimes, near the end of a processing chain whose output we're confident we don't care about, omitting Unicode error handling is safe‡. But like so many things in computing the fact it's "sometimes" safe but not often is enough reason to never do it up front until you've proved you really need to for some reason.
‡ Jamie Zawinski swears this is true for his Xscreensaver text output, if nonsense pixels come out, he would argue, the screen is garbled but the screensaver still works, so an attacker has achieved nothing of consequence. I have not examined the complicated program to confirm "nonsense pixels" are the worst possible outcome, but perhaps so.
Posted Dec 20, 2017 1:01 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (12 responses)
Posted Dec 20, 2017 15:59 UTC (Wed)
by nybble41 (subscriber, #55106)
[Link] (11 responses)
Exactly. This is the sort of thing the runtime should provide a library for—given a byte array, is it valid UTF-8? And of course, if you want to filter for specific Unicode codepoints then you'll have to decode the string anyway.
Not that Python's Unicode strings would necessarily protect against this scenario in the first place. In surrogateescape mode the incomplete codepoints would be passed through unchanged, just as if you'd used byte arrays. But how often does one see code concatenating fields together without a known (and typically ASCII) delimiter anyway? No matter how it was implemented the transformation would lose information.
Posted Dec 20, 2017 17:17 UTC (Wed)
by brouhaha (subscriber, #1698)
[Link] (10 responses)
Posted Dec 20, 2017 17:49 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (8 responses)
Posted Dec 20, 2017 18:36 UTC (Wed)
by vstinner (subscriber, #42675)
[Link] (7 responses)
To be clear: Python doesn't force anyone to use Unicode.
All OS functions accept bytes, and Python commonly provides two flavor of the same API: one for Unicode, one for bytes.
Examples: urllib accepts URL as str and bytes, os.listdir(str)->str and os.listdir(bytes)->bytes, os.environ:str and os.environb:bytes, os.getcwd():str and os.getcwdb():bytes, open(str) and open(bytes), etc.
sys.stdin.buffer.read() returns bytes which can be written back into stdout using sys.stdout.buffer.write().
The PEP 538 and 540 try to make the life easier for programmers who cares of Unicode and want to use Unicode for good reasons. Again, for Unix-like tools (imagine a Python "grep"), stdin and stdout are configured to be able to "pass through bytes". So it also makes the life easier for programmers who want to process data as "bytes" (even if technically it's Unicode, Unicode is easier to manipulate in Python 3).
Posted Dec 20, 2017 18:55 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
> The PEP 538 and 540 try to make the life easier for programmers who cares of Unicode and want to use Unicode for good reasons.
Posted Dec 20, 2017 20:42 UTC (Wed)
by nas (subscriber, #17)
[Link] (1 responses)
> No they don't try to make it easier.
Your trolling on this site is reaching new heights. Now you actually claiming that the people working on developing these PEPs are not even intending to make Python better? What devious motivation do you think they have then? I'm mean, come on.
You like the string model of Python 2.7 better. Okay, you can have it. As someone who writes internationalized web applications, Python 3.6 works vastly better for me.
Posted Dec 20, 2017 20:46 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link]
> You like the string model of Python 2.7 better. Okay, you can have it. As someone who writes internationalized web applications, Python 3.6 works vastly better for me.
Posted Dec 21, 2017 11:56 UTC (Thu)
by jwilk (subscriber, #63328)
[Link] (1 responses)
Here's a familiar Python developer arguing against adding it: https://bugs.python.org/issue8776#msg217416
Posted Dec 21, 2017 12:00 UTC (Thu)
by vstinner (subscriber, #42675)
[Link]
sys.argb wasn't added since it's hard to maintain two separated lists in sync. Harder than os.environ and os.environb which are mappings.
Posted Dec 21, 2017 17:53 UTC (Thu)
by kjp (guest, #39639)
[Link]
Posted Dec 23, 2017 16:35 UTC (Sat)
by mathstuf (subscriber, #69389)
[Link]
Posted Dec 20, 2017 21:22 UTC (Wed)
by nybble41 (subscriber, #55106)
[Link]
That is a partial solution which works so long as you don't mind dealing with exceptions or care about the performance impact of actually decoding a potentially large byte string into a Unicode string. It would be nice to have a function which just scanned the array in place and returned the result without allocating memory or throwing exceptions.
In any case I was not trying to say that Python 3 does not provide a function like this, just that it's something that does belong in the standard library.
> Do some people think that having distinct data types for array-of-arbitrary-bytes and string-of-characters is a bad thing?
Having separate types is good. However, making the more complex and restrictive string-of-characters type the default or customary form when array-of-arbitrary-bytes would be sufficient is a mistake. Duplicating all the APIs (a byte version and a Unicode version) is yet another mistake, and basing return types on the types of unrelated parameters makes it even worse. (Why assume the file _contents_ are in UTF8 just because a Unicode string was used for the filename?) Putting aside the minority of APIs which inherently relate to Unicode, the rest should only accept and return byte arrays, leaving the conversions up to the user _if they are needed_.
Arguably, filenames should be a third type, neither byte arrays nor Unicode strings. On some platforms (e.g. Linux) they are byte arrays with a variety of encodings, most commonly UTF8 but not restricted to valid UTF8 sequences. On others (e.g. Windows) they are a restricted subset of Unicode (UCS-2). Some platforms (MacOS) apply transformations to the strings, for normalization or case-insensitivity, so equality as bytes or as Unicode codepoints may not be the same as equality as filenames. Handling all of this portably is a difficult problem, and the solutions which are most suitable for filenames are unlikely to be applicable to strings in general.
Posted Dec 18, 2017 5:22 UTC (Mon)
by eru (subscriber, #2753)
[Link] (2 responses)
Posted Dec 18, 2017 11:50 UTC (Mon)
by vstinner (subscriber, #42675)
[Link] (1 responses)
Posted Dec 18, 2017 13:36 UTC (Mon)
by Otus (subscriber, #67685)
[Link]
/somewhat-unserious
Posted Dec 18, 2017 17:52 UTC (Mon)
by mirabilos (subscriber, #84359)
[Link] (2 responses)
However, recently I got told off for roughly this:
tg@tglase-bsd:~ $ echo '<mäh>' | LC_ALL=C tr -d '[[:alpha:]]' # MirBSD
The author of the script explicitly set LC_ALL to C to get ASCII behaviour, which was ignored in MirBSD “because everything is UTF-8”, however the POSIX C locale requires different classifications (like the iswalpha(3) one used here) than a full C.UTF-8 environment would have. In fact, only the 26 lower- and 26 upper-case latin letters are required to be matched by the alpha cclass, as is on Debian:
tglase@tglase:~ $ echo '<mäh>' | LC_ALL=C tr -d '[[:alpha:]]' # Debian
(That being said, GNU’s tr(1) is defective, as this…
tglase@tglase:~ $ echo '<mäh>' | LC_ALL=C.UTF-8 tr -d '[[:alpha:]]' # Debian
… should have output just “<>”, but that’s a different story.)
tl;dr: MirBSD has to go back ☹ to a two-locale model (C with ASCII being the default, C.UTF-8 with UTF-8 being the nōn-default alternative), and therefore PEP 538 will end up breaking things and need to be reverted, even if it improves the situation for most users ☹
Posted Dec 18, 2017 18:01 UTC (Mon)
by mirabilos (subscriber, #84359)
[Link]
「It will only do that if the POSIX/C locale has not been explicitly chosen by the user or environment (e.g. using LC_ALL=C), so it will essentially override a default to the POSIX/C locale with UTF-8.」
That could actually work. Still, someone with time on their hands please think over my parent post and provide feedback (also to me).
Posted Dec 21, 2017 11:11 UTC (Thu)
by vstinner (subscriber, #42675)
[Link]
No solution is perfect, and that's why both PEP 538 (C locale coercion) and PEP 540 (UTF-8 Mode) can be explicitly disabled, when you really want the C locale with the ASCII encoding.
Example:
$ env -i PYTHONCOERCECLOCALE=0 ./python -X utf8=0 -c 'import locale, sys; print(sys.stdout.encoding, locale.getpreferredencoding())'
The bet is that most users will prefer the C locale coercion and UTF-8 Mode enabled by the POSIX locale.
Posted Jan 4, 2018 0:28 UTC (Thu)
by jklowden (guest, #107637)
[Link]
No, that's a downside: added complexity without added power. All the user of an embedded application has to do is set the environment appropriately before using it. If host and child need different environments, the problem is deep indeed, and the probability that any PEP will solve it correspondingly low.
PEP 540 adds weird automagical transformation, convenient (if it works) for the user who forgets to set the environment, or doesn't know how. That's not a service to them, or to anyone else.
Python 3, ASCII, and UTF-8
https://github.com/python/cpython/pull/3010
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Ňø, îť đöéšń’ŧ.
Python 3, ASCII, and UTF-8
LWN is still Python 2. The last dependency just moved forward a month or two ago, so all that's needed is a bit of time to make the switch. That "bit of time" part can be surprisingly hard, though.
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Ňø, îť đöéšń’ŧ.
ŇØ, ÎŤ ĐÖÉŠŃ’Ŧ.
Python 3, ASCII, and UTF-8
> IS THIS CORRECT?
Python 3, ASCII, and UTF-8
See, the letter "I" is actually a Turkish dotless-i and so I expect "i" to capitalize to "İ"
$ echo "Is this correct?" | hd
00000000 49 73 20 74 68 69 73 20 63 6f 72 72 65 63 74 3f |Is this correct?|
00000010 0a |.|
00000011
So please, spare me this nonsense. Python's "Unicode" implementation is simply misleading bullshit which lulls developers into thinking that their naïve code works correctly everywhere. It doesn't.
Python 3, ASCII, and UTF-8
Nope. That's the whole deal - Turkish uses latin "I" but has different capitalization rules.
This is a complete non-answer. Python 3 makes it easy to write code that is mostly correct under assumption that it is ALWAYS correct.
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
echo "Is thıs correct?" | python3.6 -c "import sys; print(sys.stdin.read().upper());"
IS THIS CORRECT?
echo "Iß thiß correct?" | python3.6 -c "import sys; print(sys.stdin.read().upper());"
ISS THISS CORRECT?
Python 3, ASCII, and UTF-8
IS THIS CORRECT?
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Yes, it does. In 99% of cases you need not go past ASCII or its UTF-8 subset to parse structural data.
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
> IS THIS CORRECT?
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
> Ne, mi estas finvenkisto!
Okay, so that's one finvenkisto. Most of us no longer dream of Esperanto being the be-all and end-all of world language.Python 3, ASCII, and UTF-8
euro sign: €
$ python3.7 -c 'print(u"euro sign: \u20ac")'
euro sign: €
...
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 11: ordinal not in range(128)
$ LANG= python3.7 -c 'print(u"euro sign: \u20ac")' # enjoy PEP 538+PEP 540
euro sign: €
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
2. You never need to deal with non-trivial text manipulation, such as interactive editing. Pressing <Backspace> removes the last glyph; how many bytes is that?
3. You never need to deal with parsing or segmenting text *per se*. What's the first word in this sentence?
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Nope. My position is:
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
I'm against making magic "unicode" strings with unspecified encoding and then forcing everything to use these strings.
Python 3, ASCII, and UTF-8
As I said, they are magic strings that solve everything.
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
If you want UTF-8, you need to specify the encoding explicitly.
Python 3, ASCII, and UTF-8
I can do so in Py2 as well: 'content = open("file").read().decode("utf-8")'. I still don't see why it justifies huge breaking changes that require mass re-audit of huge amount of code.
I've yet to hear why I would want to deal with Unicode codepoints instead of bytes everywhere by default.
No they won't, without yet another round of incompatible changes. Py3 is stuck with Uselesscode strings.
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
A code point is *not* a character. Umlauts like ä, ö and ü can be written in two ways in Unicode, either with a single code point for the whole thing or with one code point for the underlying vowel and one code point for the diacritic. This sequence of two code points is called a grapheme cluster. There are entire Unicode blocks that contain only code points that only make sense as part of grapheme clusters, like Hangul Jamo for Korean. Many variations of emoji are also implemented this way. For this reason I don't think it makes sense to treat strings as sequences of Unicode code points, it should be grapheme clusters, and that's what Perl 6 does while Python, like always, fucked it up (and Java did even worse, because its strings are sequences of 16-Bit "char" values which are not even code points, because there are more than 65k of those by now)
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
What if I need to be both or the most common a case not exactly matching any of them. This mess is getting so complex that for a pro it's almost impossible to get full control and newbies never even knew what hit them when they're floored by python3 (I've helped a few...). For example the simple case with ctypes attributes, often read from an external source, that needs to be string type. The text mess is a nasty swamp to wade through.
Python 3, ASCII, and UTF-8
But then: The application would have to handle this anyway. A correct application would produce the same results in Python 2 and Python 3. An incorrect application would throw an exception in Python 3 while it could silently produce wrong results in Python 2.
Python 3, ASCII, and UTF-8
It breaks, as soon as you want to do such things as getting the length of a string in characters.
Okey. Let's see how python3 solves that problem:
So where this non-solution for non-problem is useful?
$ python3
Python 3.4.3 (default, Nov 28 2017, 16:41:13)
[GCC 4.8.4] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> len('Length')
6
>>> len(' Len ')
5
>>> len('👦🏼👦🏼👦🏼')
6
Or do correct uppercasing, lowercasing or case folding.
Does not work there, too.
An incorrect application would throw an exception in Python 3 while it could silently produce wrong results in Python 2.
No. Both would introduce errors unless lots of care would be done. Except with python2 it would be immediately obvious and with python3 errors would be subtle and not-immediately-obvious.Python 3, ASCII, and UTF-8
If it does the right thing in most cases, then implicit and simple is very practical.
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
https://felipec.wordpress.com/2013/10/07/the-linux-way/
The reason that What every programmer and sysadmin should know
len('👦🏼👦🏼👦🏼')
returns 6 is that the string is actually the two-unicode sequence [U+1F466 U+1F3FC] repeated 3 times.
What every programmer and sysadmin should know
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
You can do UTF-8 validation for it. In most cases it's not even needed, split bytes are fine for CSV-like processing.
Python 3, ASCII, and UTF-8
Python 3 provides that. It's the method decode applied to the array of bytes. It will return a string if the bytes are a valid UTF-8 sequence, and an exception if not.
Python 3, ASCII, and UTF-8
#!/usr/bin/env python3
import sys
def is_valid_unicode(b):
try:
s = b.decode('utf-8')
except:
return False
return True
b = bytes([int(x, 16) for x in sys.argv[1:]])
print(is_valid_unicode(b))
Example:
$ ./validutf8.py ce bc e0 b8 99 f0 90 8e b7 e2 a1 8d 0a
True
$ ./validutf8.py 2d 66 5b 1a f7 53 e3 f6 fd 47 a2 07 fc
False
I'm confused by aspects of this discussion. Do some people think that having distinct data types for array-of-arbitrary-bytes and string-of-characters is a bad thing?
Python 3, ASCII, and UTF-8
Yes it is bad, if the language defaults to "string-of-codepoints" under the hood and goes through a lot of contortions to maintain this pretense.
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
You mean: "Python3 provides enough hoops through which you can jump to achieve parity with Python2".
No they don't try to make it easier.
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
People who designed register_globals for PHP were also trying to make life easier for other developers.
You assume that I haven't written i18n-ed applications? How exactly is Py3 better for it?
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
<>
<ä>
<ä>
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
ANSI_X3.4-1968 ANSI_X3.4-1968
Python 3, ASCII, and UTF-8