Fedora and Python 2

Posted Apr 9, 2018 2:17 UTC (Mon) by Cyberax (✭ supporter ✭, #52523)
In reply to: Fedora and Python 2 by smurf
Parent article: Fedora and Python 2

> PHP6-full-of-Unicode idea was doomed to fail the second they decided to base their work on UTF-16
Why is that? What is wrong with UTF-16?

> "the day Django stops supporting Python 2 they will be able to rip out a ton of code that exists purely because it was so easy to mix text and binary data and get it wrong" (cited from the page you linked to).
I really doubt it.

> If Go is a better fit than Python2/3 for whatever it is you want to do, well, go for it. I'm not going to infer any perceived inferiority of Python2 vs. Python3 from that. It's probably more like "if we need to spend some effort to switch over anyway, let's examine what else might work even better".
Exactly. And this is self-inflicted entirely.

Fedora and Python 2

Posted Apr 9, 2018 2:58 UTC (Mon) by roc (subscriber, #30627) [Link] (5 responses)

I know nothing about PHP but

> What is wrong with UTF-16?

Outside UTF-16-based platforms (Windows, Java, JS), UTF-16 is strictly worse than UTF-8:
* Twice the space usage for ASCII (almost never more compact than UTF-8 on any text in practice)
* Multi-code-point characters are rarer, therefore less well tested
* Byte-order-dependent
* Needs special logic to sort lexicographically
More details: http://utf8everywhere.org/

Fedora and Python 2

Posted Apr 9, 2018 7:27 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

So why did Py3 default to it on Windows and many Linuxes until Py3.3?

Fedora and Python 2

Posted Apr 9, 2018 10:09 UTC (Mon) by smurf (subscriber, #17840) [Link]

Presumably they now know better. Correcting mistakes and all that.

Fedora and Python 2

Posted Apr 9, 2018 23:49 UTC (Mon) by roc (subscriber, #30627) [Link] (2 responses)

I don't know.

People use UTF16 a lot more than they should because in the 90s everyone was taught that 2-byte code points was *the* way to support Unicode, and Windows, Java, JS and others designed their APIs around that assumption. That kinda made sense if you believed, as many people did, that the 2-byte encoding with no multi-code-point characters (UCS2) would suffice for every Unicode character forever. That teaching was pervasive enough that many people came to *define* Unicode support as 2-byte code points, and some still do. So I'm not surprised that Python people got it wrong.

It's all a multibillion-dollar mistake :-(. Platform and application vendors did a ton of work to shift to UCS2 and then (implicitly) UTF16, to end up with a worse encoding than UTF8, a lot more work than if they'd just reinterpreted their byte strings as UTF8. Linux got this right, I think largely by just delaying Unicode support until it was clear UTF16 was a loser.

People at the Unicode consortium who set everyone down the UTF16 path really should fess up and apologize. Their mistake caused vast resources to be wasted and has actually caused Unicode support to be worse than it otherwise would have been.

Fedora and Python 2

Posted Apr 12, 2018 23:18 UTC (Thu) by dvdeug (guest, #10998) [Link]

There's people at Unicode still pissed at UTF-8. We could have had one standard line end marker, one paragraph marker, proper dashes and quotes, but instead we got UTF-8 and forced Unicode to be a fancier ASCII. UTF-1 was sort of ASCII compatible back at the start of Unicode, but it was ... horrifying. Doing "mod 190" to encode characters, anyone?

The other argument is that nobody could have sold a 32 bit encoding in the early 1990s. In 1996, they declared that it was going to have expand from 16 bit to 32 bit (or 20.1 bit). In 2001, Deseret was one of the first scripts encoded beyond above FFFF because they needed to start encoding stuff up there, but they didn't want to start with scripts people were going to fight to keep in the BMP. And yet it wasn't until 2010 that MySQL, even in UTF-8, supported characters above FFFF. Unless they've made some changes since I checked last time, it still bites people that MySQL charset utf8 is for FFFF and below only, and utf8mb4 is needed to actually encode UTF-8.

With a bunch of foresight on everyone's part, it might have been better. But pushing a 32 bit encoding in 1990 could also have mired the idea and left us working with an ISO 2022-style pile of encodings or at least stalled things by a decade where more legacy data in legacy encodings, and even more legacy encodings, were created, and more protocols were designed around the idea that everything has its own encoding instead of everything being in a fixed encoding, or at least a Unicode-compatible encoding.

Han Unification means UCS-2 was doomed and UTF-16 makes no sense

Posted Apr 15, 2018 1:14 UTC (Sun) by DHR (guest, #81356) [Link]

When UNICODE-1 (ucs-2) was designed, with a maximum of 64k code points, it was 100% forseeable that this was a mistake.

The key was it require "Han unification". Japanese, traditional Chinese, Korean, and simplified Chinese symbols would have to share code points. This was never going to be acceptable to people using those languages.

The analogy I heard was: would the Greek and Roman alphabets share code points? Alpha and A are really the same, are they not? How about Aleph? No way!

The fact is that UNICODE-1 was doomed before birth.

UTF-16 was always a bad idea. Some tried to ignore that and we live with that mistake.

<https://en.wikipedia.org/wiki/Han_unification>

UTF-8 was designed by the Plan9 folks. Quite early. On a napkin. Some of them later brought us Go.