Bad understanding of UTF-8
Bad understanding of UTF-8
Posted Mar 30, 2009 16:08 UTC (Mon) by spitzak (guest, #4593)In reply to: Bad understanding of UTF-8 by njs
Parent article: Wheeler: Fixing Unix/Linux/POSIX Filenames
The first two references are about programs failing to recognize overlong encodings as being invalid. But those invalid sequences start with a byte with the high bit set (following bytes may not have it set, but the fact that decoders consider them part of the first byte is the decoders error, a fixed decoder would consider it a one-byte error with the high bit set, followed by normal ascii characters which are unchanged and thus cannot cause a security hole).
The last one is EXACTLY the bug I am trying to fix: stupid people who somehow believe that throwing errors or replacing with non-unique strings is how invalid UTF-8 should be handled. The bug is that it maps more than one different string to the same one. The proper solution is to stop translating UTF-8 into something else and treat it as a stream of bytes. Nothing should care that it is UTF-8 except stuff that draws it on the screen.
