|
|
Log in / Subscribe / Register

Bad understanding of UTF-8

Bad understanding of UTF-8

Posted Mar 30, 2009 16:08 UTC (Mon) by spitzak (guest, #4593)
In reply to: Bad understanding of UTF-8 by njs
Parent article: Wheeler: Fixing Unix/Linux/POSIX Filenames

Yes I am sure.

The first two references are about programs failing to recognize overlong encodings as being invalid. But those invalid sequences start with a byte with the high bit set (following bytes may not have it set, but the fact that decoders consider them part of the first byte is the decoders error, a fixed decoder would consider it a one-byte error with the high bit set, followed by normal ascii characters which are unchanged and thus cannot cause a security hole).

The last one is EXACTLY the bug I am trying to fix: stupid people who somehow believe that throwing errors or replacing with non-unique strings is how invalid UTF-8 should be handled. The bug is that it maps more than one different string to the same one. The proper solution is to stop translating UTF-8 into something else and treat it as a stream of bytes. Nothing should care that it is UTF-8 except stuff that draws it on the screen.


to post comments

Bad understanding of UTF-8

Posted Mar 31, 2009 4:49 UTC (Tue) by njs (subscriber, #40338) [Link] (3 responses)

> Yes I am sure.

So -- just checking we're on the same page here -- what you're saying is that you're sure that those three security bugs I found in 5 minutes of googling were "not problems in any program".

> The first two references are about programs failing to recognize overlong encodings as being invalid.

Right -- if invalid codings are interpreted differently in different parts of a system, then that creates bugs and security holes.

> But those invalid sequences start with a byte with the high bit set (following bytes may not have it set, but the fact that decoders consider them part of the first byte is the decoders error, a fixed decoder would consider it a one-byte error with the high bit set, followed by normal ascii characters which are unchanged and thus cannot cause a security hole).

I'm sorry -- I cannot make out a word of this. The bug in the first two links is that the invalid sequences are over-long (but like all the bugs mentioned here, involve only bytes with the high bits set -- do you know how UTF-8 works?). The decoder should have an explicit check for such sequences and throw an error if they are encountered, but this check was left out.

> The last one is EXACTLY the bug I am trying to fix: stupid people who somehow believe that throwing errors or replacing with non-unique strings is how invalid UTF-8 should be handled.

Errrr... quite so. I wasn't sure how useful this was to start with, but when you say in so many words that the proper solution to XSS security holes is to stop sanitizing web form inputs and instead convert all web browsers so that they *don't interpret unicode* then... maybe it's time I step out of this thread. Best of luck to you.

Bad understanding of UTF-8

Posted Mar 31, 2009 17:59 UTC (Tue) by spitzak (guest, #4593) [Link] (2 responses)

I am sure that "errors in UTF-8 only contain bytes with the high bit set", which is what I thought you were asking.

An overlong encoding consists of a leading byte with the high bit set. This is an error. That may be followed by any byte. If it is another leading byte then it might start another UTF-8 character, or it might be an error. If it is a continuation byte then it is an error. If it is an ASCII character then it is not an error. As before, EVERY ERROR BYTE has the high bit set!

I might have misunderstood your question. You said "are you sure" in response to me saying that all error bytes have the high bit set. The reason I was confirming that all error bytes have the high bit set is that if they are mapped to a 128-long range of Unicode then the adjacent 128-long range makes a good candidate for "quoting" characters that are not allowed in filenames.

I do believe there are some serious mistakes in a lot of modern software. UTF-8 should NOT be converted until the very last moment when it is converted to "display form" for drawing on the screen. This is the only reliable way of preserving identity of invalid strings. People who think invalid strings will not occur or that it is acceptable for them to compare equal or silently be changed to other invalid strings or with valid strings are living in a fantasy land.

Bad understanding of UTF-8

Posted Apr 1, 2009 5:12 UTC (Wed) by njs (subscriber, #40338) [Link] (1 responses)

> I am sure that "errors in UTF-8 only contain bytes with the high bit set", which is what I thought you were asking.

Okay, fair enough. I agree, all ASCII characters are valid UTF-8. I was objecting to your claim that bytes with the high bits set "do not cause any problems with any programs".

> An overlong encoding consists of a leading byte with the high bit set. This is an error.

All characters with codepoint >= 128 are encoded in UTF-8 as a string of bytes with the high bit set (including on the leading byte). Having the high bit set is *certainly* not an error. I can't tell what you're saying in general, but it's just not true that the only time strings need to be interpreted as text is for display. In many, many cases text needs to be processed as text, and it's often impossible and rarely practical to write algorithms in such a way that they do something sensible with invalid encodings. Those serious security bugs I pointed out up above are examples of what happens when you try.

(You're right that invalid strings usually shouldn't be silently transmuted to valid strings; they should usually signal a hard error.)

Bad understanding of UTF-8

Posted Apr 1, 2009 16:38 UTC (Wed) by spitzak (guest, #4593) [Link]

A program that treats bytes with the high bit set as "this may be a piece of a UTF-8 character", and puts all those bytes into a single class such as "may be a part of an identifier", can safely handle UTF-8 strings (including invalid ones) as bytes. This is FAR better than trying to detect and handle errors, in particular because it is a hundred times simper and thus more reliable and less likely to have bugs.

Do NOT throw exceptions on bad strings. This turns a possible security error into a guaranteed DOS error. Working around it (as I have had to do countless times due to stupid string-drawing routines that refuse to draw a string with an error in it) means I have to write my *own* UTF-8 parser, just to remove the errors, before displaying it or using it. I hope you can see how forcing programmers to use their own code to parse the strings rather than providing reusable routines is a bad idea.

And I don't want exceptions thrown when I compare two strings for equality. That way lies madness. It is unfortunate that too much of this stuff is being designed by people who never use it or they (and you) would not make such trivial design errors.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds