Bad understanding of UTF-8
Bad understanding of UTF-8
Posted Mar 31, 2009 17:59 UTC (Tue) by spitzak (guest, #4593)In reply to: Bad understanding of UTF-8 by njs
Parent article: Wheeler: Fixing Unix/Linux/POSIX Filenames
An overlong encoding consists of a leading byte with the high bit set. This is an error. That may be followed by any byte. If it is another leading byte then it might start another UTF-8 character, or it might be an error. If it is a continuation byte then it is an error. If it is an ASCII character then it is not an error. As before, EVERY ERROR BYTE has the high bit set!
I might have misunderstood your question. You said "are you sure" in response to me saying that all error bytes have the high bit set. The reason I was confirming that all error bytes have the high bit set is that if they are mapped to a 128-long range of Unicode then the adjacent 128-long range makes a good candidate for "quoting" characters that are not allowed in filenames.
I do believe there are some serious mistakes in a lot of modern software. UTF-8 should NOT be converted until the very last moment when it is converted to "display form" for drawing on the screen. This is the only reliable way of preserving identity of invalid strings. People who think invalid strings will not occur or that it is acceptable for them to compare equal or silently be changed to other invalid strings or with valid strings are living in a fantasy land.
