Bad understanding of UTF-8
Bad understanding of UTF-8
Posted Mar 31, 2009 4:49 UTC (Tue) by njs (subscriber, #40338)In reply to: Bad understanding of UTF-8 by spitzak
Parent article: Wheeler: Fixing Unix/Linux/POSIX Filenames
So -- just checking we're on the same page here -- what you're saying is that you're sure that those three security bugs I found in 5 minutes of googling were "not problems in any program".
> The first two references are about programs failing to recognize overlong encodings as being invalid.
Right -- if invalid codings are interpreted differently in different parts of a system, then that creates bugs and security holes.
> But those invalid sequences start with a byte with the high bit set (following bytes may not have it set, but the fact that decoders consider them part of the first byte is the decoders error, a fixed decoder would consider it a one-byte error with the high bit set, followed by normal ascii characters which are unchanged and thus cannot cause a security hole).
I'm sorry -- I cannot make out a word of this. The bug in the first two links is that the invalid sequences are over-long (but like all the bugs mentioned here, involve only bytes with the high bits set -- do you know how UTF-8 works?). The decoder should have an explicit check for such sequences and throw an error if they are encountered, but this check was left out.
> The last one is EXACTLY the bug I am trying to fix: stupid people who somehow believe that throwing errors or replacing with non-unique strings is how invalid UTF-8 should be handled.
Errrr... quite so. I wasn't sure how useful this was to start with, but when you say in so many words that the proper solution to XSS security holes is to stop sanitizing web form inputs and instead convert all web browsers so that they *don't interpret unicode* then... maybe it's time I step out of this thread. Best of luck to you.
