LWN.net Logo

Locales and UTF-8

Locales and UTF-8

Posted May 8, 2009 16:49 UTC (Fri) by spitzak (guest, #4593)
In reply to: Locales and UTF-8 by nix
Parent article: Debian switching to EGLIBC

Hmm. I seem to be able to cat a UTF-8 file to my UTF-8 terminal and it works perfectly. Yet cat has no concept whatsoever of UTF-8 and quite likely is splitting the text into blocks right in the middle of UTF-8 characters! How is this possible?

UTF-8 is in fact trivial. You are basically doing exactly what I am complaining about: panicking that there is some magical problem with not looking for the character boundaries. Try comparing it to words: how much of a word processor is able to ignore word boundaries? Almost all of it. But that does not somehow make it impossible for word wrap and word deletion to work.

It's not rocket science. The problem is people who are so convinced it is that they complicate things to no end and are hurting I18N and everybody.


(Log in to post comments)

Locales and UTF-8

Posted May 8, 2009 16:57 UTC (Fri) by ajross (subscriber, #4563) [Link]

Amen. I've found the same thing -- developers who have no trouble with complicated algorithms and who have exhaustive knowledge of their platforms at all levels still turn into quivering voodoo practitioners when it comes to I18N stuff. All I can think is that because the data involved contains indecipherable foreign text, they get fooled into thinking the code for handling it must be equally inscrutable.

Really, this stuff is easy once you get used to it.

Locales and UTF-8

Posted May 8, 2009 17:33 UTC (Fri) by nix (subscriber, #2304) [Link]

Note that at no point did I say that it was horribly hard to deal with
streams of UTF-8 chars. It's trivial to decode, and it's just as trivial
to interpose a wrapper so that your strings *appear* to contain single
bytes with arbitrarily large values :) but it does require a bit of extra
work. (I'm just thinking here of how long it took to get zsh's
Unicode-awareness right. Its ZLE wheel-reimplementation of readline was
the trickiest part, which is not surprising.)

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds