Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
Posted Dec 18, 2017 17:52 UTC (Mon) by mirabilos (subscriber, #84359)Parent article: Python 3, ASCII, and UTF-8
However, recently I got told off for roughly this:
tg@tglase-bsd:~ $ echo '<mäh>' | LC_ALL=C tr -d '[[:alpha:]]' # MirBSD
<>
The author of the script explicitly set LC_ALL to C to get ASCII behaviour, which was ignored in MirBSD “because everything is UTF-8”, however the POSIX C locale requires different classifications (like the iswalpha(3) one used here) than a full C.UTF-8 environment would have. In fact, only the 26 lower- and 26 upper-case latin letters are required to be matched by the alpha cclass, as is on Debian:
tglase@tglase:~ $ echo '<mäh>' | LC_ALL=C tr -d '[[:alpha:]]' # Debian
<ä>
(That being said, GNU’s tr(1) is defective, as this…
tglase@tglase:~ $ echo '<mäh>' | LC_ALL=C.UTF-8 tr -d '[[:alpha:]]' # Debian
<ä>
… should have output just “<>”, but that’s a different story.)
tl;dr: MirBSD has to go back ☹ to a two-locale model (C with ASCII being the default, C.UTF-8 with UTF-8 being the nōn-default alternative), and therefore PEP 538 will end up breaking things and need to be reverted, even if it improves the situation for most users ☹
Posted Dec 18, 2017 18:01 UTC (Mon)
by mirabilos (subscriber, #84359)
[Link]
「It will only do that if the POSIX/C locale has not been explicitly chosen by the user or environment (e.g. using LC_ALL=C), so it will essentially override a default to the POSIX/C locale with UTF-8.」
That could actually work. Still, someone with time on their hands please think over my parent post and provide feedback (also to me).
Posted Dec 21, 2017 11:11 UTC (Thu)
by vstinner (subscriber, #42675)
[Link]
No solution is perfect, and that's why both PEP 538 (C locale coercion) and PEP 540 (UTF-8 Mode) can be explicitly disabled, when you really want the C locale with the ASCII encoding.
Example:
$ env -i PYTHONCOERCECLOCALE=0 ./python -X utf8=0 -c 'import locale, sys; print(sys.stdout.encoding, locale.getpreferredencoding())'
The bet is that most users will prefer the C locale coercion and UTF-8 Mode enabled by the POSIX locale.
Python 3, ASCII, and UTF-8
Python 3, ASCII, and UTF-8
ANSI_X3.4-1968 ANSI_X3.4-1968