Bad understanding of UTF-8
Bad understanding of UTF-8
Posted Mar 27, 2009 22:56 UTC (Fri) by spitzak (guest, #4593)Parent article: Wheeler: Fixing Unix/Linux/POSIX Filenames
An "invalid" UTF-8 string can contain only some extraneous bytes in the range 0x80-0xff. These high-order bytes do not cause any problems with any programs.
The problem is the stupid Python guys who believe in magic fairy land where all UTF-8 is valid. This is also causing havoc with using Python3 for URLs and HTML. No, I'm sorry, if a file contains UTF-8, it is going to have invalid sequences. They need to get their heads out of their *** and do something so that invalid UTF-8 is preserved ALL THE TIME and never throws an exception, unless you specifically call "throw_exception_if_not_valid_utf8()".
Because a whole lot of stupid people thought that "wide characters" are the solution and put them into certain systems we have to live with it and interoperate. The most popular solution is to translate invalid bytes in UTF-8 into 0xDCxx. This can be used as a stopgap until they finally realize that leaving the data in UTF-8 is the real solution. This substitution does not really fix things because it does not do a round trip clean conversion. Supporting round-trip means your system cannot name invalid UTF-16 file names, and if you think those don't exist you are really living in a fantasy world!
I think therefore the escape character can easily be the UTF-8 encoding of 0xDCxx. This will not conflict with the above because all the escaped characters do not have the high bit set. This will survive a translation to UTF-16 and thus provides a way to put the exact same filenames on Windows UTF-16 filesystems.
His proposed rules for disallowed bytes seem pretty reasonable though I would not disallow any printing characters in the interior of the filename, backslash escaping works pretty good in there.
