Bad understanding of UTF-8
Bad understanding of UTF-8
Posted Apr 1, 2009 5:12 UTC (Wed) by njs (subscriber, #40338)In reply to: Bad understanding of UTF-8 by spitzak
Parent article: Wheeler: Fixing Unix/Linux/POSIX Filenames
Okay, fair enough. I agree, all ASCII characters are valid UTF-8. I was objecting to your claim that bytes with the high bits set "do not cause any problems with any programs".
> An overlong encoding consists of a leading byte with the high bit set. This is an error.
All characters with codepoint >= 128 are encoded in UTF-8 as a string of bytes with the high bit set (including on the leading byte). Having the high bit set is *certainly* not an error. I can't tell what you're saying in general, but it's just not true that the only time strings need to be interpreted as text is for display. In many, many cases text needs to be processed as text, and it's often impossible and rarely practical to write algorithms in such a way that they do something sensible with invalid encodings. Those serious security bugs I pointed out up above are examples of what happens when you try.
(You're right that invalid strings usually shouldn't be silently transmuted to valid strings; they should usually signal a hard error.)
