The purpose of turning off Locales is so you *can* use UTF-8 and rely on the string functions not doing strange things such as not sorting strings in the obvious way.
Locales are an obsolete idea that UTF-8 is intended to solve. (yea I know locales also change the format some printing and strcmp does which has caused nothing but grief and could be done trivially by the program writer if they really wanted it rather than predictable results).
Posted May 8, 2009 7:08 UTC (Fri) by anselm (subscriber, #2796)
[Link]
With respect, I think that you're oversimplifying things here.
It turns out that »the obvious way« to sort strings doesn't work for many
languages other than English, which is precisely one of the reasons the
locale concept was invented in the first place. Look up the collating
rules for
German and Swedish, for example, to see three different ways of collating
the »ä« character, none of which corresponds to »the obvious way«. IMHO it
does make some sense to put this sort of arcane knowledge into the
standard library so that programmers (who are usually not also linguists)
do not have to wonder where »ø« goes in the Danish alphabet, and so that a
program has half a chance of doing string collation correctly in languages
that the original programmers didn't even know existed, let alone catered
for in their code. (I'm saying »half a chance« because of the next
paragraph.)
Also it turns out that strcmp(3) doesn't, in fact, care
about locales at all, so if you use strcmp(3) only in your programs
you will not be surprised if the user changes their locale —
it's the strcoll(3) function that is supposed to be
used for locale-dependent string
comparisons. (I do agree with you about the
decimal separator issue in printf(3), though.)
I18N is a difficult issue at best, and it isn't helped by people who try
cutting corners. Unicode/ISO-10646 and UTF-8 play an important role in
making the problem easier to handle, but they're a fairly low-level part
of the grand scheme of things. They're like the wheels on a car —
indispensable for a smooth ride, but one would generally still like seats
and a steering
wheel, too.
Debian switching to EGLIBC
Posted May 8, 2009 16:45 UTC (Fri) by spitzak (guest, #4593)
[Link]
The most common reason to sort strings is so that a set can be implemented and identical strings found. It would not matter if the sorting order had nothing to do with english or any language, what does matter is that every program in the world sort the strings in exactly the same way.
You are right that strcmp() does what is wanted. I believe I was remembering some scripting langauges where the string comparison changed depending on the locale, which was a nightmare because people rarely test in other locales.
The printf problem is really a pain and forces me to always force the locale to C at startup. I need to use printf, sometimes hidden inside scripting languages where I can't change it, to write data files that are expected to be readable by the same program even if the locale is different.
strcoll() is approximatly the right idea. Make it perfectly clear that this is some human-oriented sorting function. I think the real solution is to make all such functions take the locale as an argument, rather than using a static variable.