LWN.net Logo

Locales and UTF-8

Locales and UTF-8

Posted May 11, 2009 17:27 UTC (Mon) by spitzak (guest, #4593)
In reply to: Locales and UTF-8 by epa
Parent article: Debian switching to EGLIBC

Invalid UTF-8 is not a problem. In fact one HUGE advantage of working with UTF-8 is that you can defer invalid UTF-8 until display, where it can safely be changed into the matching CP1252 glyph or whatever is needed to provide the user with a readable result so they can figure out what went wrong. Converting earlier can result in security and other errors.

Errors in UTF-8 should be treated as single byte entities. Four four-byte prefixes in a row are 4 errors, not a single 4-byte error. You can't split an error if it is only one byte long.

This also means that ASCII characters cannot be "inside an error" so that errors have zero effect on programs that are looking for ASCII only.

It also means it is impossible to make a pointer "inside" an error or to split one. It is also vital to treat errors this way (even if converting to other encodings) so that concatenation to a string ending in an error cannot convert a good character at the start of the next string into an error.


(Log in to post comments)

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds