The most obvious catostrophe is the inability to access data that is not valid UTF-8, even to fix it.
You can't fix incorrect UTF-8 if your editor refuses to load the file. For a more obvious example,
you cannot correct an incorrect UTF-8 filename if your filesystem API refuses to provide a way to
identify that file in the rename call.
If you have not seen a junior programmer "fix" these by treating the UTF-8 as ISO-8859-1
(sometimes done by "double encoding UTF-8" but the result is the same) then I don't think you
have worked very much with teams of programmers. This is destroying I18N on Linux and in many
internet standards. On Windows it is destroying UTF-16 but it is less of a problem as only non-BMP
characters are lost.
I think changing to iterators is the first step to correctly handling canonical forms and all the other
Unicode problems. This insistence on changing it to a fixed size array and ignoring patterns is
actually a deterrent to correct comparisons.
Posted Oct 31, 2009 12:37 UTC (Sat) by nix (subscriber, #2304)
[Link]
How often do you *see* allegedly-UTF-8 data that isn't valid UTF-8? In my
experience it's vanishingly rare, much less common than encountering
Unicode mapping to unmapped codepoints. What's more, both are dealt with
the same way: the latter is shown using a square box glyph (losing
information about precisely which character it is, but you rarely care);
the former is dealt with by transforming it into a convenient valid
character, often a form of ? or the replacement character, or a graphical
box containing the invalid bytes (you sometimes lose information about
precisely what the invalid string was, but you rarely care). (Noncanonical
UTF-8 is generally quietly canonicalized.)
You are making a mountain out of a very, very small and already-levelled
molehill: Python's behaviour is known bad and will almost certainly be
fixed, that's why there was such a lot of noise over it. To claim that
it's 'destroying' UTF-8 is utterly laughable.
As for filenames in UTF-8, well, that's why POSIX considers filenames to
be a byte string. So should interfaces to POSIX. This is unlikely to
affect anything but a language that nobody much uses yet and an OS
(Windows/NTFS) that has taken considerable pain over this (especially
combined with case-insensitivity) and which thankfully is not an OS this
site is about.
"Unicode"
Posted Nov 2, 2009 18:37 UTC (Mon) by spitzak (guest, #4593)
[Link]
I agree it is very rare, but it just takes ONE failure to make a programmer say "forget that, I'll treat it as ISO-8859-1 because I don't give a s**t about Chinese..."
You would like errors to turn into boxes, but the majority of software does not, instead they throw exceptions, which is most cases is equivalent to a Denial of Service if in fact there is no other way to convey the string to the back end. Particularily nasty for me are Python's convertions to "Unicode", QT strings, QT's HTML renderer, and the XRender "draw this UTF-8 string" api. I am sure there are many many other examples.
In my ideal solution, conversion is deferred until as late as possible, probably as part of the glyph layout code (ie Pango, etc). At this point it is harmless to make a lossy conversion (since layout is lossy anyway, doing canonicalization), and I would convert the error bytes to the matching characters in the Microsoft CP1252 character set. This has the advantage that accidental non-UTF-8 is readable by the users. Believe me they really don't want to see boxes!
The Python "solution" of turning errors into 0xCDxx sort of works, but has the nasty problem that you must track the original source of a string to properly convert back to UTF-8 or UTF-16. If you don't, you either make it impossible to produce all possible UTF-16 strings (very bad because you will be unable to name all files on Windows), or you make it possible for a malicious invalid UTF-16 string to turn into a valid UTF-8 string. For this reason I don't think this solution is going to work, and that keeping the strings as UTF-8 (and converting UTF-16 TO UTF-8, which is lossless) is the only way to go.
"POSIX considers filenames to be byte strings": this sort of statement is the problem. Of course it is byte strings. What you are really saying is "I will pretend the problem does not exist by declaring anything that might contain errors to be "not UTF-8"". The problem is that at some point somebody wants to look at what the byte string means to the user, they will indeed have to say "oh yes this *is* UTF-8". Or worse, they might say "oh this is ISO-8859-1 because then I know my program won't throw a damn exception". Statements like this are exactly the problem I am hoping can be fixed.
In reality, of course filenames are "byte strings", but this is because UTF-8 is a byte string. ALL BYTE STRINGS ARE UTF-8. They are also ASCII and ISO-8859-1 or JP encoded or random binary garbage! They can have invalid UTF-8 sequences in them. They can also have misspelled words, control characters, or they can have French words in them while the program thinks they are English. They can invalid Unicode glyph sequences such as misplaced combining accents. They can spell out a false math proof, or a political opinion that you disagree with. There are billions of errors that can be in the string. Deal with it correctly, instead of declaring that some tiny ill-defined subset of possible errors make the string be "not UTF-8".
"Unicode"
Posted Nov 3, 2009 6:47 UTC (Tue) by Cato (subscriber, #7643)
[Link]
I think the answer is more configurability of the conversion process - depending on the context you may want to stop the conversion or insert substitute characters as XML entities, \xNNNN, etc. Perl's Encode module does a pretty good job here.
Also, filenames are not always byte strings, unfortunately - every filesystem has various illegal characters, and NTFS and HFS+ expect valid UTF-8 (HFS+ uses UTF-16 internally, and it must also be decomposed Unicode i.e. NFD).
"Unicode"
Posted Nov 3, 2009 19:17 UTC (Tue) by spitzak (guest, #4593)
[Link]
> filenames are not always byte strings, unfortunately - every filesystem has various illegal characters
That is the opposite problem. The problem I am trying to solve is that the filesystem can have filenames that are NOT possible in the API that libraries are providing.
The only non-byte-stream filename api that is used at all is UTF-16. However UTF-16 (including invalid UTF-16) can be losslessly translated to UTF-8 and then back to UTF-16. Therefore all filesystems in existence can be controlled by a byte stream API, using UTF-8 as the encoding.
It is true that there are UTF-8 streams that cannot be turned into UTF-16, these would be "illegal characters" for the filenames. If the filesystem does not have a byte api then this can be replicated by turning all errors into "illegal characters" in UTF-16 so that an equivalent error is thrown.
"Unicode"
Posted Nov 2, 2009 9:40 UTC (Mon) by njs (subscriber, #40338)
[Link]
> This is destroying I18N on Linux and in many internet standards.
I think if you want us to take such catastrophic declarations seriously you should perhaps name some examples of specific free software or internet standards that have had their I18N "destroyed" (or even negatively affected).