Um, not all the punctuation characters are single bytes. A huge variety of
punctuation is up above U+2000, for instance, including U+2010 (the
hyphen) and U+2003 (the em space). Helpfully this is somewhat jumbled up
with nonpunctuation stuff like numeric superscripts.
The rest of your points stand: if all you want to do is manipulate ASCII
characters in a UTF-8 stream, you can do that without being Unicode-aware
at all. But this will tend to annoy your users when they type in a € and
find that your program can't manipulate it because it's U+20AC. It'll
annoy your users even more to find that they can remove some characters,
but that others take several keystrokes to remove and miraculously
transmogrify into other characters as they do so. (More mess: the Euro
cent sign is U+00A2!)
I suppose misbehaviour from this change is unlikely *if* you're in the US.
Anywhere else? Bite your knuckles.
Posted May 8, 2009 17:35 UTC (Fri) by ajross (subscriber, #4563)
[Link]
You're simultaneously overstating the complexity of this problem and the ability of the ANSI C locale facility to solve it.
The product I work on for my day job does natural language processing of internet content in arbitrary languages and encodings. I did the encoding transformation and "word breaker" lexical analyzer for it. The whole system works by transforming the data into UTF-8 and operating on it at the byte level. So sorry to pull the "domain expert" card here, but you're basically just wrong. This stuff has its subtleties, but it's absolutely not something that requires special API support. And if we *had* to pick an API, I can guarantee you it wouldn't be ANSI C's locale stuff, which is a complete non-starter for many of the reasons already detailed.
Locales and UTF-8
Posted May 8, 2009 18:47 UTC (Fri) by nix (subscriber, #2304)
[Link]
I certainly don't think the ANSI C locale facility solves everything (or
even *much*, it's pretty nasty). And, as I said, it'll be interesting to
see what breaks. (I suspect not much will: most things that need to be
*are* Unicode-aware, on Debian at least. But it might get hair-raising.)
-- N., just wasted three months auditing and fixing countless places in a
horrible financial application to allow for UTF-8 awareness (the simplest
example: lots of places in that software cared if something
was 'alphanumeric', for instance, and isalpha() really doesn't work). It
could have been worse: before I came along they were planning to move to
UCS-2, hark at the forward planning and lovely C-compatibility...
Locales and UTF-8
Posted May 8, 2009 21:55 UTC (Fri) by spitzak (guest, #4593)
[Link]
Yes, isalpha() and ctype is one thing that should be fixed. There are only 3 types of byte with the high bit set:
1. bytes that are not allowed in UTF-8.
2. "second" bytes
3. "first" bytes
I think first & second bytes should pass the isalpha() test. This will allow UTF-8 letters to be put into identifiers and keywords (of course it also allows UTF-8 punctuation and lots of other stuff but that is about the best that can be done). I also think ctype should not vary depending on locale, this is another thing that causes me nothing but trouble, most programmers revert to doing ">='a' && <='z'" and thus make their software even less portable.
Probably the ctype tables should add some bits to identify these byte types.
Locales and UTF-8
Posted May 8, 2009 21:48 UTC (Fri) by spitzak (guest, #4593)
[Link]
Actually the Euro is U+20AC. It is 0xA2 in the CP1252 encoding used by Microsoft but not in official Unicode. However I do thing the Unicode standard should just realize that CP1252 is really common and change the characters 0x80-0xAF to what it defines.
I do hope a program trying to parse for a period only looks for the ASCII period. As soon as you start saying other Unicode characters are "equivalent" then you get a huge mess because different programs may disagree on what is in the equivalent set, and Unicode could add a new character at any time. We already have quite a mess with newlines, lets not make it worse! The only software that should be looking for Unicode punctuation is actual glyph layout and rendering.
Locales and UTF-8
Posted May 11, 2009 16:01 UTC (Mon) by endecotp (guest, #36428)
[Link]
> not all the punctuation characters are single bytes
I was referring to the punctuation characters used to delimit CSV, which are all ASCII characters (as are those used in XML).
> The rest of your points stand: if all you want to do is manipulate
> ASCII characters in a UTF-8 stream, you can do that without being
> Unicode-aware
My points were that you can do all of those things (e.g. search and replace) EVEN IF the input is non-ASCII.
Your example of delete key behaviour is an interesting one that comes under my category of "GUI and similar I/O". It is clearly necessary to delete back as far as the last character-starting byte. Doing so is not very hard.
> I suppose misbehaviour from this change is unlikely *if* you're in
> the US. Anywhere else? Bite your knuckles.
I am not in the U.S., and my code works with UTF-8 without the sort of major headaches that you allude to.