I certainly don't think the ANSI C locale facility solves everything (or
even *much*, it's pretty nasty). And, as I said, it'll be interesting to
see what breaks. (I suspect not much will: most things that need to be
*are* Unicode-aware, on Debian at least. But it might get hair-raising.)
-- N., just wasted three months auditing and fixing countless places in a
horrible financial application to allow for UTF-8 awareness (the simplest
example: lots of places in that software cared if something
was 'alphanumeric', for instance, and isalpha() really doesn't work). It
could have been worse: before I came along they were planning to move to
UCS-2, hark at the forward planning and lovely C-compatibility...
Posted May 8, 2009 21:55 UTC (Fri) by spitzak (guest, #4593)
[Link]
Yes, isalpha() and ctype is one thing that should be fixed. There are only 3 types of byte with the high bit set:
1. bytes that are not allowed in UTF-8.
2. "second" bytes
3. "first" bytes
I think first & second bytes should pass the isalpha() test. This will allow UTF-8 letters to be put into identifiers and keywords (of course it also allows UTF-8 punctuation and lots of other stuff but that is about the best that can be done). I also think ctype should not vary depending on locale, this is another thing that causes me nothing but trouble, most programmers revert to doing ">='a' && <='z'" and thus make their software even less portable.
Probably the ctype tables should add some bits to identify these byte types.