|
|
Log in / Subscribe / Register

Bad understanding of UTF-8

Bad understanding of UTF-8

Posted Mar 31, 2009 17:59 UTC (Tue) by spitzak (guest, #4593)
In reply to: Bad understanding of UTF-8 by njs
Parent article: Wheeler: Fixing Unix/Linux/POSIX Filenames

I am sure that "errors in UTF-8 only contain bytes with the high bit set", which is what I thought you were asking.

An overlong encoding consists of a leading byte with the high bit set. This is an error. That may be followed by any byte. If it is another leading byte then it might start another UTF-8 character, or it might be an error. If it is a continuation byte then it is an error. If it is an ASCII character then it is not an error. As before, EVERY ERROR BYTE has the high bit set!

I might have misunderstood your question. You said "are you sure" in response to me saying that all error bytes have the high bit set. The reason I was confirming that all error bytes have the high bit set is that if they are mapped to a 128-long range of Unicode then the adjacent 128-long range makes a good candidate for "quoting" characters that are not allowed in filenames.

I do believe there are some serious mistakes in a lot of modern software. UTF-8 should NOT be converted until the very last moment when it is converted to "display form" for drawing on the screen. This is the only reliable way of preserving identity of invalid strings. People who think invalid strings will not occur or that it is acceptable for them to compare equal or silently be changed to other invalid strings or with valid strings are living in a fantasy land.


to post comments

Bad understanding of UTF-8

Posted Apr 1, 2009 5:12 UTC (Wed) by njs (subscriber, #40338) [Link] (1 responses)

> I am sure that "errors in UTF-8 only contain bytes with the high bit set", which is what I thought you were asking.

Okay, fair enough. I agree, all ASCII characters are valid UTF-8. I was objecting to your claim that bytes with the high bits set "do not cause any problems with any programs".

> An overlong encoding consists of a leading byte with the high bit set. This is an error.

All characters with codepoint >= 128 are encoded in UTF-8 as a string of bytes with the high bit set (including on the leading byte). Having the high bit set is *certainly* not an error. I can't tell what you're saying in general, but it's just not true that the only time strings need to be interpreted as text is for display. In many, many cases text needs to be processed as text, and it's often impossible and rarely practical to write algorithms in such a way that they do something sensible with invalid encodings. Those serious security bugs I pointed out up above are examples of what happens when you try.

(You're right that invalid strings usually shouldn't be silently transmuted to valid strings; they should usually signal a hard error.)

Bad understanding of UTF-8

Posted Apr 1, 2009 16:38 UTC (Wed) by spitzak (guest, #4593) [Link]

A program that treats bytes with the high bit set as "this may be a piece of a UTF-8 character", and puts all those bytes into a single class such as "may be a part of an identifier", can safely handle UTF-8 strings (including invalid ones) as bytes. This is FAR better than trying to detect and handle errors, in particular because it is a hundred times simper and thus more reliable and less likely to have bugs.

Do NOT throw exceptions on bad strings. This turns a possible security error into a guaranteed DOS error. Working around it (as I have had to do countless times due to stupid string-drawing routines that refuse to draw a string with an error in it) means I have to write my *own* UTF-8 parser, just to remove the errors, before displaying it or using it. I hope you can see how forcing programmers to use their own code to parse the strings rather than providing reusable routines is a bad idea.

And I don't want exceptions thrown when I compare two strings for equality. That way lies madness. It is unfortunate that too much of this stuff is being designed by people who never use it or they (and you) would not make such trivial design errors.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds