LWN.net Logo

Locales and UTF-8

Locales and UTF-8

Posted May 10, 2009 14:46 UTC (Sun) by epa (subscriber, #39769)
In reply to: Locales and UTF-8 by endecotp
Parent article: Debian switching to EGLIBC

For example, if I'm parsing a UTF-8 CSV file into rows and columns then I can treat it as a byte stream, since the punctuation characters (eg ,"\NL) are all single bytes and those bytes are guaranteed not to occur in multi-byte characters.
This is true if you know that your input is valid UTF-8. However if it might be malformed, then your program could end up splitting a row in the middle of an (invalid) character sequence and producing different invalid sequences as output. This is often fine: garbage in, garbage out. But there can be interesting security holes where malformed UTF-8 is treated differently by different code. Luckily, checking for valid UTF-8 is a fast operation, so there is no reason not to check every string that comes from the user before doing anything with it - even if the processing you do is just treating it as a byte stream.


(Log in to post comments)

Locales and UTF-8

Posted May 11, 2009 16:06 UTC (Mon) by ajross (subscriber, #4563) [Link]

Everything you say is true of ASCII too. You have to validate untrusted input, regardless of what it is. ASCII doesn't have the high bit set, but any ASCII format is by necessity going to have escaping mechanisms that need equivalent validation. For this specific example, you *are* counting your matching quote characters, right? Everything is an "encoding" at some level.

Avoiding UTF-8 in the blind expectation that it somehow makes your code more "secure" is just plain wrong. This kind of mistake is exactly what I'm talking about. People attribute to encoding transformation and I18N all sorts of complexities that aren't actually there in practice.

Locales and UTF-8

Posted May 19, 2009 9:18 UTC (Tue) by epa (subscriber, #39769) [Link]

I agree that using some hacky alternative instead of UTF-8 will not improve security. Nothing I wrote should be taken as a reason to avoid UTF-8. (Though it's not true that you *always* have to include escaping mechanisms for ASCII input - some file formats such as /etc/passwd can get away with being completely stupid and not supporting escaping or accented characters at all.)

Locales and UTF-8

Posted May 11, 2009 17:27 UTC (Mon) by spitzak (guest, #4593) [Link]

Invalid UTF-8 is not a problem. In fact one HUGE advantage of working with UTF-8 is that you can defer invalid UTF-8 until display, where it can safely be changed into the matching CP1252 glyph or whatever is needed to provide the user with a readable result so they can figure out what went wrong. Converting earlier can result in security and other errors.

Errors in UTF-8 should be treated as single byte entities. Four four-byte prefixes in a row are 4 errors, not a single 4-byte error. You can't split an error if it is only one byte long.

This also means that ASCII characters cannot be "inside an error" so that errors have zero effect on programs that are looking for ASCII only.

It also means it is impossible to make a pointer "inside" an error or to split one. It is also vital to treat errors this way (even if converting to other encodings) so that concatenation to a string ending in an error cannot convert a good character at the start of the next string into an error.

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds