|
|
Log in / Subscribe / Register

Bad understanding of UTF-8

Bad understanding of UTF-8

Posted Apr 1, 2009 5:12 UTC (Wed) by njs (subscriber, #40338)
In reply to: Bad understanding of UTF-8 by spitzak
Parent article: Wheeler: Fixing Unix/Linux/POSIX Filenames

> I am sure that "errors in UTF-8 only contain bytes with the high bit set", which is what I thought you were asking.

Okay, fair enough. I agree, all ASCII characters are valid UTF-8. I was objecting to your claim that bytes with the high bits set "do not cause any problems with any programs".

> An overlong encoding consists of a leading byte with the high bit set. This is an error.

All characters with codepoint >= 128 are encoded in UTF-8 as a string of bytes with the high bit set (including on the leading byte). Having the high bit set is *certainly* not an error. I can't tell what you're saying in general, but it's just not true that the only time strings need to be interpreted as text is for display. In many, many cases text needs to be processed as text, and it's often impossible and rarely practical to write algorithms in such a way that they do something sensible with invalid encodings. Those serious security bugs I pointed out up above are examples of what happens when you try.

(You're right that invalid strings usually shouldn't be silently transmuted to valid strings; they should usually signal a hard error.)


to post comments

Bad understanding of UTF-8

Posted Apr 1, 2009 16:38 UTC (Wed) by spitzak (guest, #4593) [Link]

A program that treats bytes with the high bit set as "this may be a piece of a UTF-8 character", and puts all those bytes into a single class such as "may be a part of an identifier", can safely handle UTF-8 strings (including invalid ones) as bytes. This is FAR better than trying to detect and handle errors, in particular because it is a hundred times simper and thus more reliable and less likely to have bugs.

Do NOT throw exceptions on bad strings. This turns a possible security error into a guaranteed DOS error. Working around it (as I have had to do countless times due to stupid string-drawing routines that refuse to draw a string with an error in it) means I have to write my *own* UTF-8 parser, just to remove the errors, before displaying it or using it. I hope you can see how forcing programmers to use their own code to parse the strings rather than providing reusable routines is a bad idea.

And I don't want exceptions thrown when I compare two strings for equality. That way lies madness. It is unfortunate that too much of this stuff is being designed by people who never use it or they (and you) would not make such trivial design errors.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds