LWN.net Logo

Locales and UTF-8

Locales and UTF-8

Posted May 8, 2009 21:55 UTC (Fri) by spitzak (guest, #4593)
In reply to: Locales and UTF-8 by nix
Parent article: Debian switching to EGLIBC

Yes, isalpha() and ctype is one thing that should be fixed. There are only 3 types of byte with the high bit set:

1. bytes that are not allowed in UTF-8.
2. "second" bytes
3. "first" bytes

I think first & second bytes should pass the isalpha() test. This will allow UTF-8 letters to be put into identifiers and keywords (of course it also allows UTF-8 punctuation and lots of other stuff but that is about the best that can be done). I also think ctype should not vary depending on locale, this is another thing that causes me nothing but trouble, most programmers revert to doing ">='a' && <='z'" and thus make their software even less portable.

Probably the ctype tables should add some bits to identify these byte types.


(Log in to post comments)

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds