Python 3, ASCII, and UTF-8

Posted Dec 21, 2017 6:15 UTC (Thu) by dvdeug (guest, #10998)
In reply to: Python 3, ASCII, and UTF-8 by Cyberax
Parent article: Python 3, ASCII, and UTF-8

> cyberax@20c9d0485705:~$ echo "Is this correct?" | python2.7 -c "import sys; print sys.stdin.read().upper();"
> IS THIS CORRECT?

So two programs have the exact same behavior, and yet one is wrong and the other is right. Huh. The only way that Turks are going to get proper capitalization of i to İ in most software is to have a Turkish locale and software written in Unicode-supporting languages that automatically take care of that little detail. Otherwise software that's written by people who aren't familiar with Turkish linguistic needs or assume that it won't used for Turkish (YAGNI lovers) will silently cause problems like the one in the link. Case-insensitive search is a real need, for example.

Python 3, ASCII, and UTF-8

Posted Dec 22, 2017 17:04 UTC (Fri) by nim-nim (subscriber, #34454) [Link] (1 responses)

There are lots of cases were correct text processing depends on the correct locale (CJK de-unification, balkanic cyrillic vs russian cyrillic, selecting the correct spell and grammar checker)

Unfortunately, too many free software devs thinks they can avoid locale particularities by setting C.UTF-8 everywhere, or deriving the locale from the input method (any latin layout can write other latin languages, and no I don't want to switch between qwerty and azerty when switching between English and French), or whatever. And they still invent text formats where the locale of a run of text is not explicitly specified

It sort of works, as long as the software is used by devs writing in pidging english. It fails spectacularly when exposed to normal people or normal human text.

Text = locale + encoding remove one and you're full of fail

Python 3, ASCII, and UTF-8

Posted Jan 2, 2018 5:40 UTC (Tue) by dvdeug (guest, #10998) [Link]

And they still invent text formats where the locale of a run of text is not explicitly specified

There's certainly logic there, but when exposed to normal people and normal human text, that often doesn't work. The theory is hard; at Wiktionary, we often have arguments about whether a word is French in English or has been adopted in English. Apple, Inc. is Apple even in many languages where they don't write in Latin script. Even texts where they could be clearly language tagged with human effort, like

> > Esperantists who dream of the final victory are non-existent in the modern day.
> Ne, mi estas finvenkisto!
Okay, so that's one finvenkisto. Most of us no longer dream of Esperanto being the be-all and end-all of world language.

are likely to appear undifferentiated, with none of the posters having bothered to set the locale correctly (by default, that could have come from a German, Canadian, and an Estonian user, and given tags of de_DE, en_CA and ee_ET), and even those locales not having followed into the final post. Only serious texts I'm writing for permanent archival would I bother carefully language tagging; even a published PDF I might not bother, if you couldn't tell in the final copy.