Posted May 6, 2009 16:46 UTC (Wed) by ajross (subscriber, #4563)
[Link]
And what exactly do you need to do with UTF-8 for which you need locale support? I mean, the whole point of UTF-8 is that it works great, unchanged, with code that assumes LANG=C. If we'd picked it from the start, we'd never have had "locale" (well, the wide/multibyte character APIs, which I assume is what you mean) support in the C library to begin with.
Debian switching to EGLIBC
Posted May 6, 2009 19:50 UTC (Wed) by nix (subscriber, #2304)
[Link]
There's a lot more to locale support than the wchar stuff. In fact the
wchar stuff is decidedly minority and doesn't use more than a few tens of
Kb.
Most of the space consumption is charmaps (which we can't do without, and
the largest ones are Far Eastern and Unicode), converters to/from UTF-8
(obviously necessary if we want to handle other encodings at *all*) and
timezones (which we can't do without, although it would be nice if glibc
contained an interface to let us use the historical data in them properly:
Ulrich has explicitly (and bluntly) ruled this out without giving a
rationale. As usual. Maybe eglibc can add an appropriate interface.)
Debian switching to EGLIBC
Posted May 8, 2009 1:56 UTC (Fri) by spitzak (guest, #4593)
[Link]
I would be extremely happy if I could rely on printf not changing the periods to commas when I don't expect it, and strcmp not changing. I have to force the C locale at startup just so our software can write files that I can be sure can be read.
Any intelligent person would have just made a different %-command. It really is not very hard for a programmer to choose what to do based on the locale. The C library, if it provides anything at all, should only provide a "this is the locale" call. It should have ZERO effect on the behavior of any functions that do not actually take a locale as an argument.
Locales and UTF-8
Posted May 6, 2009 20:47 UTC (Wed) by rfunk (subscriber, #4054)
[Link]
You still need wide/multibyte character APIs with UTF-8. A UTF-8 character
can be up to four bytes. It's only the old ASCII characters that are still
only one byte in UTF-8; the Latin-1 extensions, for example, are two
bytes. Therefore the old APIs that assume 1 byte = 1 character are not so
useful with UTF-8.
Locales and UTF-8
Posted May 6, 2009 21:35 UTC (Wed) by nix (subscriber, #2304)
[Link]
What you *don't* need with UTF-8 is the misbegotten horror that is wchar_t
and all the interfaces to deal with that.
Locales and UTF-8
Posted May 8, 2009 1:54 UTC (Fri) by spitzak (guest, #4593)
[Link]
Anybody who thinks strlen(utf8) should return anything other than the number of bytes in the string does not know what they are talking about. Sorry.
UTF-8 is TRIVIAL if people would just WAKE UP and realize that it *is* trivial. The ONLY people who care where character boundaries are is people writing low-level rendering routines that have to look up font glyphs.
But for some reason the fact that a byte array represents a series of characters causes otherwise intelligent programmers to turn into complete morons. It suddenly becomes IMPOSSIBLE to work with the bytes, just because of the type of data in the string!
Here is a thought experiment: why in the world are we capable of making files containing English text when all the *words* are different sizes! Why it must be impossible! Counting words will be so slow and inefficient! How could the programs ever work?
Locales and UTF-8
Posted May 8, 2009 13:55 UTC (Fri) by nix (subscriber, #2304)
[Link]
The only people who care where character boundaries are in a UTF-8
stringare people writing routines taking textual input, routines producing
textual output, routines modifying text strings, routines manipulating
text strings in *any* way that depends on anything a human would care
about. I can see how this could be considered rare.
Touching individual bytes in a unicode string outside of something like
serialization makes as much sense as touching individual bits in it does
(except of course that you have to touch both in order to convert the
UTF-8 into actual Unicode code points and back).
This is all library stuff, yes, sure... except when it isn't.
Locales and UTF-8
Posted May 8, 2009 15:18 UTC (Fri) by endecotp (guest, #36428)
[Link]
nix, maybe you're overlooking some of the helpful properties of UTF-8 that make many, though of course not all, of those textual operations work if you treat it as a byte stream? In my experience, programmers often go to lengths to process a UTF-8 string character-by-character when doing so byte-by-byte would be just as correct, simpler to code and faster.
For example, if I'm parsing a UTF-8 CSV file into rows and columns then I can treat it as a byte stream, since the punctuation characters (eg ,"\NL) are all single bytes and those bytes are guaranteed not to occur in multi-byte characters.
As another example, I can search-and-replace one character sequence with another character sequence by treating the text, pattern and replacement as byte sequences - even if there are multibyte characters in the text, pattern, or replacement.
My experience is that the only places where UTF-8 cannot be treated as byte streams are: GUI and similar I/O, sorting and case conversion when the result needs to look right for a human, and interfaces that specify an encoding other than UTF-8.
Locales and UTF-8
Posted May 8, 2009 16:56 UTC (Fri) by spitzak (guest, #4593)
[Link]
Thank you for a breath of fresh air! Somebody who gets it!
I do not understand why so many otherwise intelligent and experienced software engineers turn into such complete morons when they think about UTF-8.
Even more annoying: programmers do not seem to have this mental block when presented with the older multibyte Asian encodings, or with UTF-16 which is variable length as well. For some reason people only assign these made-up problems to UTF-8.
Locales and UTF-8
Posted May 8, 2009 17:28 UTC (Fri) by nix (subscriber, #2304)
[Link]
Um, not all the punctuation characters are single bytes. A huge variety of
punctuation is up above U+2000, for instance, including U+2010 (the
hyphen) and U+2003 (the em space). Helpfully this is somewhat jumbled up
with nonpunctuation stuff like numeric superscripts.
The rest of your points stand: if all you want to do is manipulate ASCII
characters in a UTF-8 stream, you can do that without being Unicode-aware
at all. But this will tend to annoy your users when they type in a € and
find that your program can't manipulate it because it's U+20AC. It'll
annoy your users even more to find that they can remove some characters,
but that others take several keystrokes to remove and miraculously
transmogrify into other characters as they do so. (More mess: the Euro
cent sign is U+00A2!)
I suppose misbehaviour from this change is unlikely *if* you're in the US.
Anywhere else? Bite your knuckles.
Locales and UTF-8
Posted May 8, 2009 17:35 UTC (Fri) by ajross (subscriber, #4563)
[Link]
You're simultaneously overstating the complexity of this problem and the ability of the ANSI C locale facility to solve it.
The product I work on for my day job does natural language processing of internet content in arbitrary languages and encodings. I did the encoding transformation and "word breaker" lexical analyzer for it. The whole system works by transforming the data into UTF-8 and operating on it at the byte level. So sorry to pull the "domain expert" card here, but you're basically just wrong. This stuff has its subtleties, but it's absolutely not something that requires special API support. And if we *had* to pick an API, I can guarantee you it wouldn't be ANSI C's locale stuff, which is a complete non-starter for many of the reasons already detailed.
Locales and UTF-8
Posted May 8, 2009 18:47 UTC (Fri) by nix (subscriber, #2304)
[Link]
I certainly don't think the ANSI C locale facility solves everything (or
even *much*, it's pretty nasty). And, as I said, it'll be interesting to
see what breaks. (I suspect not much will: most things that need to be
*are* Unicode-aware, on Debian at least. But it might get hair-raising.)
-- N., just wasted three months auditing and fixing countless places in a
horrible financial application to allow for UTF-8 awareness (the simplest
example: lots of places in that software cared if something
was 'alphanumeric', for instance, and isalpha() really doesn't work). It
could have been worse: before I came along they were planning to move to
UCS-2, hark at the forward planning and lovely C-compatibility...
Locales and UTF-8
Posted May 8, 2009 21:55 UTC (Fri) by spitzak (guest, #4593)
[Link]
Yes, isalpha() and ctype is one thing that should be fixed. There are only 3 types of byte with the high bit set:
1. bytes that are not allowed in UTF-8.
2. "second" bytes
3. "first" bytes
I think first & second bytes should pass the isalpha() test. This will allow UTF-8 letters to be put into identifiers and keywords (of course it also allows UTF-8 punctuation and lots of other stuff but that is about the best that can be done). I also think ctype should not vary depending on locale, this is another thing that causes me nothing but trouble, most programmers revert to doing ">='a' && <='z'" and thus make their software even less portable.
Probably the ctype tables should add some bits to identify these byte types.
Locales and UTF-8
Posted May 8, 2009 21:48 UTC (Fri) by spitzak (guest, #4593)
[Link]
Actually the Euro is U+20AC. It is 0xA2 in the CP1252 encoding used by Microsoft but not in official Unicode. However I do thing the Unicode standard should just realize that CP1252 is really common and change the characters 0x80-0xAF to what it defines.
I do hope a program trying to parse for a period only looks for the ASCII period. As soon as you start saying other Unicode characters are "equivalent" then you get a huge mess because different programs may disagree on what is in the equivalent set, and Unicode could add a new character at any time. We already have quite a mess with newlines, lets not make it worse! The only software that should be looking for Unicode punctuation is actual glyph layout and rendering.
Locales and UTF-8
Posted May 11, 2009 16:01 UTC (Mon) by endecotp (guest, #36428)
[Link]
> not all the punctuation characters are single bytes
I was referring to the punctuation characters used to delimit CSV, which are all ASCII characters (as are those used in XML).
> The rest of your points stand: if all you want to do is manipulate
> ASCII characters in a UTF-8 stream, you can do that without being
> Unicode-aware
My points were that you can do all of those things (e.g. search and replace) EVEN IF the input is non-ASCII.
Your example of delete key behaviour is an interesting one that comes under my category of "GUI and similar I/O". It is clearly necessary to delete back as far as the last character-starting byte. Doing so is not very hard.
> I suppose misbehaviour from this change is unlikely *if* you're in
> the US. Anywhere else? Bite your knuckles.
I am not in the U.S., and my code works with UTF-8 without the sort of major headaches that you allude to.
Locales and UTF-8
Posted May 10, 2009 14:46 UTC (Sun) by epa (subscriber, #39769)
[Link]
For example, if I'm parsing a UTF-8 CSV file into rows and columns then I can treat it as a byte stream, since the punctuation characters (eg ,"\NL) are all single bytes and those bytes are guaranteed not to occur in multi-byte characters.
This is true if you know that your input is valid UTF-8. However if it might be malformed, then your program could end up splitting a row in the middle of an (invalid) character sequence and producing different invalid sequences as output. This is often fine: garbage in, garbage out. But there can be interesting security holes where malformed UTF-8 is treated differently by different code. Luckily, checking for valid UTF-8 is a fast operation, so there is no reason not to check every string that comes from the user before doing anything with it - even if the processing you do is just treating it as a byte stream.
Locales and UTF-8
Posted May 11, 2009 16:06 UTC (Mon) by ajross (subscriber, #4563)
[Link]
Everything you say is true of ASCII too. You have to validate untrusted input, regardless of what it is. ASCII doesn't have the high bit set, but any ASCII format is by necessity going to have escaping mechanisms that need equivalent validation. For this specific example, you *are* counting your matching quote characters, right? Everything is an "encoding" at some level.
Avoiding UTF-8 in the blind expectation that it somehow makes your code more "secure" is just plain wrong. This kind of mistake is exactly what I'm talking about. People attribute to encoding transformation and I18N all sorts of complexities that aren't actually there in practice.
Locales and UTF-8
Posted May 19, 2009 9:18 UTC (Tue) by epa (subscriber, #39769)
[Link]
I agree that using some hacky alternative instead of UTF-8 will not improve security. Nothing I wrote should be taken as a reason to avoid UTF-8. (Though it's not true that you *always* have to include escaping mechanisms for ASCII input - some file formats such as /etc/passwd can get away with being completely stupid and not supporting escaping or accented characters at all.)
Locales and UTF-8
Posted May 11, 2009 17:27 UTC (Mon) by spitzak (guest, #4593)
[Link]
Invalid UTF-8 is not a problem. In fact one HUGE advantage of working with UTF-8 is that you can defer invalid UTF-8 until display, where it can safely be changed into the matching CP1252 glyph or whatever is needed to provide the user with a readable result so they can figure out what went wrong. Converting earlier can result in security and other errors.
Errors in UTF-8 should be treated as single byte entities. Four four-byte prefixes in a row are 4 errors, not a single 4-byte error. You can't split an error if it is only one byte long.
This also means that ASCII characters cannot be "inside an error" so that errors have zero effect on programs that are looking for ASCII only.
It also means it is impossible to make a pointer "inside" an error or to split one. It is also vital to treat errors this way (even if converting to other encodings) so that concatenation to a string ending in an error cannot convert a good character at the start of the next string into an error.
Locales and UTF-8
Posted May 8, 2009 16:49 UTC (Fri) by spitzak (guest, #4593)
[Link]
Hmm. I seem to be able to cat a UTF-8 file to my UTF-8 terminal and it works perfectly. Yet cat has no concept whatsoever of UTF-8 and quite likely is splitting the text into blocks right in the middle of UTF-8 characters! How is this possible?
UTF-8 is in fact trivial. You are basically doing exactly what I am complaining about: panicking that there is some magical problem with not looking for the character boundaries. Try comparing it to words: how much of a word processor is able to ignore word boundaries? Almost all of it. But that does not somehow make it impossible for word wrap and word deletion to work.
It's not rocket science. The problem is people who are so convinced it is that they complicate things to no end and are hurting I18N and everybody.
Locales and UTF-8
Posted May 8, 2009 16:57 UTC (Fri) by ajross (subscriber, #4563)
[Link]
Amen. I've found the same thing -- developers who have no trouble with complicated algorithms and who have exhaustive knowledge of their platforms at all levels still turn into quivering voodoo practitioners when it comes to I18N stuff. All I can think is that because the data involved contains indecipherable foreign text, they get fooled into thinking the code for handling it must be equally inscrutable.
Really, this stuff is easy once you get used to it.
Locales and UTF-8
Posted May 8, 2009 17:33 UTC (Fri) by nix (subscriber, #2304)
[Link]
Note that at no point did I say that it was horribly hard to deal with
streams of UTF-8 chars. It's trivial to decode, and it's just as trivial
to interpose a wrapper so that your strings *appear* to contain single
bytes with arbitrarily large values :) but it does require a bit of extra
work. (I'm just thinking here of how long it took to get zsh's
Unicode-awareness right. Its ZLE wheel-reimplementation of readline was
the trickiest part, which is not surprising.)
Debian switching to EGLIBC
Posted May 6, 2009 23:37 UTC (Wed) by rleigh (subscriber, #14622)
[Link]
For the various reasons outlined in the text, we are considering
moving the C locale to using UTF-8 rather than US-ASCII as its
locale codeset. This won't be done immediately; we will create
a C.UTF-8 for testing before considering the full switch to default it.
This will give us native UTF-8 end-to-end from source code to
compiled binary to program output and subsequent terminal display.
Regards,
Roger
Debian switching to EGLIBC
Posted May 7, 2009 6:47 UTC (Thu) by nix (subscriber, #2304)
[Link]
It'll be fascinating to see what that breaks when someone throws in a
character with the high bit set :) stuff that relies upon the C locale
rarely makes a distinction between bytes and characters, even where it
should... of course, one would hope that not much such software is left.
Debian switching to EGLIBC
Posted May 8, 2009 2:02 UTC (Fri) by spitzak (guest, #4593)
[Link]
Nothing will break when a byte has a high bit set, since it will just be copied to the output unchanged.
Don't panic about UTF-8. The biggest problem with it is people who do not understand it, some of them are good enough programmers that they might write some code that is very damaging, where they actually try to interpret the UTF-8 encoding.
The only real bug in Unix with UTF-8 is a whole lot of documentation that says "character" where it should say "byte". There is nothing wrong with the current implementations.
Debian switching to EGLIBC
Posted May 8, 2009 13:57 UTC (Fri) by nix (subscriber, #2304)
[Link]
I covered this 'nothing will care if you feed UTF-8 to a program expecting
a byte stream' canard in my other response. It's trivially wrong.
Debian switching to EGLIBC
Posted May 6, 2009 18:43 UTC (Wed) by jordanb (guest, #45668)
[Link]
Unicode doesn't magically make localization issues disappear.
Programs still need to know what language the user(s) of the system prefer to see responses in, if their radix mark is a ',' or a '.', if they count days using the Gregorian or Chinese calendar, etc.
I agree that the world would be a brighter place if it could be said "all text on disk or streamed on the network MUST be in Unicode's UTF-8 encoding" and then locales could just say "en_US" instead of "en_US-UTF-8" but issues of representation and encoding are only half of the localization problem.
Debian switching to EGLIBC
Posted May 6, 2009 19:18 UTC (Wed) by jreiser (subscriber, #11027)
[Link]
/usr/lib/locale/locale-archive is 79MB, and subsetting is not supported actively. Many developers would be overjoyed to have only LANG=C, or to select just the 5 locales that cover 99.99% of the users for their product.
Debian switching to EGLIBC
Posted May 6, 2009 20:15 UTC (Wed) by kleptog (subscriber, #1183)
[Link]
That weird. I've never done anything special with locales and my locale-archive is only 1.3MB. I just selected the locales I wanted during install, they're listed in /etc/locale.gen and they're the only locales in the archive. How did you get your archive to be so large? (Debian BTW)
Debian switching to EGLIBC
Posted May 6, 2009 22:13 UTC (Wed) by vmole (guest, #111)
[Link]
Are you using localepurge, by chance? It removes undesired locales after each apt run.
Debian switching to EGLIBC
Posted May 7, 2009 0:48 UTC (Thu) by ABCD (subscriber, #53650)
[Link]
localepurge only touches /usr/share/locale and /usr/share/man; /usr/lib/locale/locale-archive is not modified at all by localepurge, so I'm not sure what would cause it to be so large.
Debian switching to EGLIBC
Posted May 6, 2009 20:29 UTC (Wed) by nix (subscriber, #2304)
[Link]
Um, subsetting is trivial and documented. Just don't run localedef for
every single locale in the world, only for those you use.
(Debian has a /usr/sbin/locale-gen script and /etc/locale.gen file for
exactly this reason.)
Debian switching to EGLIBC
Posted May 12, 2009 4:59 UTC (Tue) by dirtyepic (subscriber, #30178)
[Link]
1.5M here, consisting of en_US and en_US.UTF-8. subsetting has been a standard part of Gentoo as long as I've used it (2004-ish). we moved to locale-gen from Debian in 2006.
Debian switching to EGLIBC
Posted May 8, 2009 1:45 UTC (Fri) by spitzak (guest, #4593)
[Link]
The purpose of turning off Locales is so you *can* use UTF-8 and rely on the string functions not doing strange things such as not sorting strings in the obvious way.
Locales are an obsolete idea that UTF-8 is intended to solve. (yea I know locales also change the format some printing and strcmp does which has caused nothing but grief and could be done trivially by the program writer if they really wanted it rather than predictable results).
Debian switching to EGLIBC
Posted May 8, 2009 7:08 UTC (Fri) by anselm (subscriber, #2796)
[Link]
With respect, I think that you're oversimplifying things here.
It turns out that »the obvious way« to sort strings doesn't work for many
languages other than English, which is precisely one of the reasons the
locale concept was invented in the first place. Look up the collating
rules for
German and Swedish, for example, to see three different ways of collating
the »ä« character, none of which corresponds to »the obvious way«. IMHO it
does make some sense to put this sort of arcane knowledge into the
standard library so that programmers (who are usually not also linguists)
do not have to wonder where »ø« goes in the Danish alphabet, and so that a
program has half a chance of doing string collation correctly in languages
that the original programmers didn't even know existed, let alone catered
for in their code. (I'm saying »half a chance« because of the next
paragraph.)
Also it turns out that strcmp(3) doesn't, in fact, care
about locales at all, so if you use strcmp(3) only in your programs
you will not be surprised if the user changes their locale —
it's the strcoll(3) function that is supposed to be
used for locale-dependent string
comparisons. (I do agree with you about the
decimal separator issue in printf(3), though.)
I18N is a difficult issue at best, and it isn't helped by people who try
cutting corners. Unicode/ISO-10646 and UTF-8 play an important role in
making the problem easier to handle, but they're a fairly low-level part
of the grand scheme of things. They're like the wheels on a car —
indispensable for a smooth ride, but one would generally still like seats
and a steering
wheel, too.
Debian switching to EGLIBC
Posted May 8, 2009 16:45 UTC (Fri) by spitzak (guest, #4593)
[Link]
The most common reason to sort strings is so that a set can be implemented and identical strings found. It would not matter if the sorting order had nothing to do with english or any language, what does matter is that every program in the world sort the strings in exactly the same way.
You are right that strcmp() does what is wanted. I believe I was remembering some scripting langauges where the string comparison changed depending on the locale, which was a nightmare because people rarely test in other locales.
The printf problem is really a pain and forces me to always force the locale to C at startup. I need to use printf, sometimes hidden inside scripting languages where I can't change it, to write data files that are expected to be readable by the same program even if the locale is different.
strcoll() is approximatly the right idea. Make it perfectly clear that this is some human-oriented sorting function. I think the real solution is to make all such functions take the locale as an argument, rather than using a static variable.