Python 3, ASCII, and UTF-8

Posted Dec 18, 2017 17:52 UTC (Mon) by mirabilos (subscriber, #84359)
Parent article: Python 3, ASCII, and UTF-8

PEP 538 is roughly what we do in MirBSD. MirBSD has one locale, and it’s about the same as C.UTF-8 (which I proposed for Debian/eglibc, and is now available there).

However, recently I got told off for roughly this:

tg@tglase-bsd:~ $ echo '<mäh>' | LC_ALL=C tr -d '[[:alpha:]]' # MirBSD
<>

The author of the script explicitly set LC_ALL to C to get ASCII behaviour, which was ignored in MirBSD “because everything is UTF-8”, however the POSIX C locale requires different classifications (like the iswalpha(3) one used here) than a full C.UTF-8 environment would have. In fact, only the 26 lower- and 26 upper-case latin letters are required to be matched by the alpha cclass, as is on Debian:

tglase@tglase:~ $ echo '<mäh>' | LC_ALL=C tr -d '[[:alpha:]]' # Debian
<ä>

(That being said, GNU’s tr(1) is defective, as this…

tglase@tglase:~ $ echo '<mäh>' | LC_ALL=C.UTF-8 tr -d '[[:alpha:]]' # Debian
<ä>

… should have output just “<>”, but that’s a different story.)

tl;dr: MirBSD has to go back ☹ to a two-locale model (C with ASCII being the default, C.UTF-8 with UTF-8 being the nōn-default alternative), and therefore PEP 538 will end up breaking things and need to be reverted, even if it improves the situation for most users ☹

Python 3, ASCII, and UTF-8

Posted Dec 18, 2017 18:01 UTC (Mon) by mirabilos (subscriber, #84359) [Link]

Huh.

「It will only do that if the POSIX/C locale has not been explicitly chosen by the user or environment (e.g. using LC_ALL=C), so it will essentially override a default to the POSIX/C locale with UTF-8.」

That could actually work. Still, someone with time on their hands please think over my parent post and provide feedback (also to me).

Python 3, ASCII, and UTF-8

Posted Dec 21, 2017 11:11 UTC (Thu) by vstinner (subscriber, #42675) [Link]

> tl;dr: MirBSD has to go back ☹ to a two-locale model (C with ASCII being the default, C.UTF-8 with UTF-8 being the nōn-default alternative), and therefore PEP 538 will end up breaking things and need to be reverted, even if it improves the situation for most users ☹

No solution is perfect, and that's why both PEP 538 (C locale coercion) and PEP 540 (UTF-8 Mode) can be explicitly disabled, when you really want the C locale with the ASCII encoding.

Example:

$ env -i PYTHONCOERCECLOCALE=0 ./python -X utf8=0 -c 'import locale, sys; print(sys.stdout.encoding, locale.getpreferredencoding())'
ANSI_X3.4-1968 ANSI_X3.4-1968

The bet is that most users will prefer the C locale coercion and UTF-8 Mode enabled by the POSIX locale.