For the various reasons outlined in the text, we are considering
moving the C locale to using UTF-8 rather than US-ASCII as its
locale codeset. This won't be done immediately; we will create
a C.UTF-8 for testing before considering the full switch to default it.
This will give us native UTF-8 end-to-end from source code to
compiled binary to program output and subsequent terminal display.
Posted May 7, 2009 6:47 UTC (Thu) by nix (subscriber, #2304)
[Link]
It'll be fascinating to see what that breaks when someone throws in a
character with the high bit set :) stuff that relies upon the C locale
rarely makes a distinction between bytes and characters, even where it
should... of course, one would hope that not much such software is left.
Debian switching to EGLIBC
Posted May 8, 2009 2:02 UTC (Fri) by spitzak (guest, #4593)
[Link]
Nothing will break when a byte has a high bit set, since it will just be copied to the output unchanged.
Don't panic about UTF-8. The biggest problem with it is people who do not understand it, some of them are good enough programmers that they might write some code that is very damaging, where they actually try to interpret the UTF-8 encoding.
The only real bug in Unix with UTF-8 is a whole lot of documentation that says "character" where it should say "byte". There is nothing wrong with the current implementations.
Debian switching to EGLIBC
Posted May 8, 2009 13:57 UTC (Fri) by nix (subscriber, #2304)
[Link]
I covered this 'nothing will care if you feed UTF-8 to a program expecting
a byte stream' canard in my other response. It's trivially wrong.