Locales and UTF-8
Locales and UTF-8
Posted May 8, 2009 13:55 UTC (Fri) by nix (subscriber, #2304)In reply to: Locales and UTF-8 by spitzak
Parent article: Debian switching to EGLIBC
stringare people writing routines taking textual input, routines producing
textual output, routines modifying text strings, routines manipulating
text strings in *any* way that depends on anything a human would care
about. I can see how this could be considered rare.
Touching individual bytes in a unicode string outside of something like
serialization makes as much sense as touching individual bits in it does
(except of course that you have to touch both in order to convert the
UTF-8 into actual Unicode code points and back).
This is all library stuff, yes, sure... except when it isn't.
Posted May 8, 2009 15:18 UTC (Fri)
by endecotp (guest, #36428)
[Link] (11 responses)
For example, if I'm parsing a UTF-8 CSV file into rows and columns then I can treat it as a byte stream, since the punctuation characters (eg ,"\NL) are all single bytes and those bytes are guaranteed not to occur in multi-byte characters.
As another example, I can search-and-replace one character sequence with another character sequence by treating the text, pattern and replacement as byte sequences - even if there are multibyte characters in the text, pattern, or replacement.
My experience is that the only places where UTF-8 cannot be treated as byte streams are: GUI and similar I/O, sorting and case conversion when the result needs to look right for a human, and interfaces that specify an encoding other than UTF-8.
Posted May 8, 2009 16:56 UTC (Fri)
by spitzak (guest, #4593)
[Link]
I do not understand why so many otherwise intelligent and experienced software engineers turn into such complete morons when they think about UTF-8.
Even more annoying: programmers do not seem to have this mental block when presented with the older multibyte Asian encodings, or with UTF-16 which is variable length as well. For some reason people only assign these made-up problems to UTF-8.
Posted May 8, 2009 17:28 UTC (Fri)
by nix (subscriber, #2304)
[Link] (5 responses)
The rest of your points stand: if all you want to do is manipulate ASCII
I suppose misbehaviour from this change is unlikely *if* you're in the US.
Posted May 8, 2009 17:35 UTC (Fri)
by ajross (guest, #4563)
[Link] (2 responses)
The product I work on for my day job does natural language processing of internet content in arbitrary languages and encodings. I did the encoding transformation and "word breaker" lexical analyzer for it. The whole system works by transforming the data into UTF-8 and operating on it at the byte level. So sorry to pull the "domain expert" card here, but you're basically just wrong. This stuff has its subtleties, but it's absolutely not something that requires special API support. And if we *had* to pick an API, I can guarantee you it wouldn't be ANSI C's locale stuff, which is a complete non-starter for many of the reasons already detailed.
Posted May 8, 2009 18:47 UTC (Fri)
by nix (subscriber, #2304)
[Link] (1 responses)
-- N., just wasted three months auditing and fixing countless places in a
Posted May 8, 2009 21:55 UTC (Fri)
by spitzak (guest, #4593)
[Link]
1. bytes that are not allowed in UTF-8.
I think first & second bytes should pass the isalpha() test. This will allow UTF-8 letters to be put into identifiers and keywords (of course it also allows UTF-8 punctuation and lots of other stuff but that is about the best that can be done). I also think ctype should not vary depending on locale, this is another thing that causes me nothing but trouble, most programmers revert to doing ">='a' && <='z'" and thus make their software even less portable.
Probably the ctype tables should add some bits to identify these byte types.
Posted May 8, 2009 21:48 UTC (Fri)
by spitzak (guest, #4593)
[Link]
I do hope a program trying to parse for a period only looks for the ASCII period. As soon as you start saying other Unicode characters are "equivalent" then you get a huge mess because different programs may disagree on what is in the equivalent set, and Unicode could add a new character at any time. We already have quite a mess with newlines, lets not make it worse! The only software that should be looking for Unicode punctuation is actual glyph layout and rendering.
Posted May 11, 2009 16:01 UTC (Mon)
by endecotp (guest, #36428)
[Link]
I was referring to the punctuation characters used to delimit CSV, which are all ASCII characters (as are those used in XML).
> The rest of your points stand: if all you want to do is manipulate
My points were that you can do all of those things (e.g. search and replace) EVEN IF the input is non-ASCII.
Your example of delete key behaviour is an interesting one that comes under my category of "GUI and similar I/O". It is clearly necessary to delete back as far as the last character-starting byte. Doing so is not very hard.
> I suppose misbehaviour from this change is unlikely *if* you're in
I am not in the U.S., and my code works with UTF-8 without the sort of major headaches that you allude to.
Posted May 10, 2009 14:46 UTC (Sun)
by epa (subscriber, #39769)
[Link] (3 responses)
Posted May 11, 2009 16:06 UTC (Mon)
by ajross (guest, #4563)
[Link] (1 responses)
Avoiding UTF-8 in the blind expectation that it somehow makes your code more "secure" is just plain wrong. This kind of mistake is exactly what I'm talking about. People attribute to encoding transformation and I18N all sorts of complexities that aren't actually there in practice.
Posted May 19, 2009 9:18 UTC (Tue)
by epa (subscriber, #39769)
[Link]
Posted May 11, 2009 17:27 UTC (Mon)
by spitzak (guest, #4593)
[Link]
Errors in UTF-8 should be treated as single byte entities. Four four-byte prefixes in a row are 4 errors, not a single 4-byte error. You can't split an error if it is only one byte long.
This also means that ASCII characters cannot be "inside an error" so that errors have zero effect on programs that are looking for ASCII only.
It also means it is impossible to make a pointer "inside" an error or to split one. It is also vital to treat errors this way (even if converting to other encodings) so that concatenation to a string ending in an error cannot convert a good character at the start of the next string into an error.
Posted May 8, 2009 16:49 UTC (Fri)
by spitzak (guest, #4593)
[Link] (2 responses)
UTF-8 is in fact trivial. You are basically doing exactly what I am complaining about: panicking that there is some magical problem with not looking for the character boundaries. Try comparing it to words: how much of a word processor is able to ignore word boundaries? Almost all of it. But that does not somehow make it impossible for word wrap and word deletion to work.
It's not rocket science. The problem is people who are so convinced it is that they complicate things to no end and are hurting I18N and everybody.
Posted May 8, 2009 16:57 UTC (Fri)
by ajross (guest, #4563)
[Link] (1 responses)
Really, this stuff is easy once you get used to it.
Posted May 8, 2009 17:33 UTC (Fri)
by nix (subscriber, #2304)
[Link]
Locales and UTF-8
Locales and UTF-8
Locales and UTF-8
punctuation is up above U+2000, for instance, including U+2010 (the
hyphen) and U+2003 (the em space). Helpfully this is somewhat jumbled up
with nonpunctuation stuff like numeric superscripts.
characters in a UTF-8 stream, you can do that without being Unicode-aware
at all. But this will tend to annoy your users when they type in a € and
find that your program can't manipulate it because it's U+20AC. It'll
annoy your users even more to find that they can remove some characters,
but that others take several keystrokes to remove and miraculously
transmogrify into other characters as they do so. (More mess: the Euro
cent sign is U+00A2!)
Anywhere else? Bite your knuckles.
Locales and UTF-8
Locales and UTF-8
even *much*, it's pretty nasty). And, as I said, it'll be interesting to
see what breaks. (I suspect not much will: most things that need to be
*are* Unicode-aware, on Debian at least. But it might get hair-raising.)
horrible financial application to allow for UTF-8 awareness (the simplest
example: lots of places in that software cared if something
was 'alphanumeric', for instance, and isalpha() really doesn't work). It
could have been worse: before I came along they were planning to move to
UCS-2, hark at the forward planning and lovely C-compatibility...
Locales and UTF-8
2. "second" bytes
3. "first" bytes
Locales and UTF-8
Locales and UTF-8
> ASCII characters in a UTF-8 stream, you can do that without being
> Unicode-aware
> the US. Anywhere else? Bite your knuckles.
Locales and UTF-8
For example, if I'm parsing a UTF-8 CSV file into rows and columns then I can treat it as a byte stream, since the punctuation characters (eg ,"\NL) are all single bytes and those bytes are guaranteed not to occur in multi-byte characters.
This is true if you know that your input is valid UTF-8. However if it might be malformed, then your program could end up splitting a row in the middle of an (invalid) character sequence and producing different invalid sequences as output. This is often fine: garbage in, garbage out. But there can be interesting security holes where malformed UTF-8 is treated differently by different code. Luckily, checking for valid UTF-8 is a fast operation, so there is no reason not to check every string that comes from the user before doing anything with it - even if the processing you do is just treating it as a byte stream.
Locales and UTF-8
Locales and UTF-8
Locales and UTF-8
Locales and UTF-8
Locales and UTF-8
Locales and UTF-8
streams of UTF-8 chars. It's trivial to decode, and it's just as trivial
to interpose a wrapper so that your strings *appear* to contain single
bytes with arbitrarily large values :) but it does require a bit of extra
work. (I'm just thinking here of how long it took to get zsh's
Unicode-awareness right. Its ZLE wheel-reimplementation of readline was
the trickiest part, which is not surprising.)