LWN.net Logo

"Unicode"

"Unicode"

Posted Nov 2, 2009 18:37 UTC (Mon) by spitzak (guest, #4593)
In reply to: "Unicode" by nix
Parent article: Proposal: Moratorium on Python language changes

I agree it is very rare, but it just takes ONE failure to make a programmer say "forget that, I'll treat it as ISO-8859-1 because I don't give a s**t about Chinese..."

You would like errors to turn into boxes, but the majority of software does not, instead they throw exceptions, which is most cases is equivalent to a Denial of Service if in fact there is no other way to convey the string to the back end. Particularily nasty for me are Python's convertions to "Unicode", QT strings, QT's HTML renderer, and the XRender "draw this UTF-8 string" api. I am sure there are many many other examples.

In my ideal solution, conversion is deferred until as late as possible, probably as part of the glyph layout code (ie Pango, etc). At this point it is harmless to make a lossy conversion (since layout is lossy anyway, doing canonicalization), and I would convert the error bytes to the matching characters in the Microsoft CP1252 character set. This has the advantage that accidental non-UTF-8 is readable by the users. Believe me they really don't want to see boxes!

The Python "solution" of turning errors into 0xCDxx sort of works, but has the nasty problem that you must track the original source of a string to properly convert back to UTF-8 or UTF-16. If you don't, you either make it impossible to produce all possible UTF-16 strings (very bad because you will be unable to name all files on Windows), or you make it possible for a malicious invalid UTF-16 string to turn into a valid UTF-8 string. For this reason I don't think this solution is going to work, and that keeping the strings as UTF-8 (and converting UTF-16 TO UTF-8, which is lossless) is the only way to go.

"POSIX considers filenames to be byte strings": this sort of statement is the problem. Of course it is byte strings. What you are really saying is "I will pretend the problem does not exist by declaring anything that might contain errors to be "not UTF-8"". The problem is that at some point somebody wants to look at what the byte string means to the user, they will indeed have to say "oh yes this *is* UTF-8". Or worse, they might say "oh this is ISO-8859-1 because then I know my program won't throw a damn exception". Statements like this are exactly the problem I am hoping can be fixed.

In reality, of course filenames are "byte strings", but this is because UTF-8 is a byte string. ALL BYTE STRINGS ARE UTF-8. They are also ASCII and ISO-8859-1 or JP encoded or random binary garbage! They can have invalid UTF-8 sequences in them. They can also have misspelled words, control characters, or they can have French words in them while the program thinks they are English. They can invalid Unicode glyph sequences such as misplaced combining accents. They can spell out a false math proof, or a political opinion that you disagree with. There are billions of errors that can be in the string. Deal with it correctly, instead of declaring that some tiny ill-defined subset of possible errors make the string be "not UTF-8".


(Log in to post comments)

"Unicode"

Posted Nov 3, 2009 6:47 UTC (Tue) by Cato (subscriber, #7643) [Link]

I think the answer is more configurability of the conversion process - depending on the context you may want to stop the conversion or insert substitute characters as XML entities, \xNNNN, etc. Perl's Encode module does a pretty good job here.

Also, filenames are not always byte strings, unfortunately - every filesystem has various illegal characters, and NTFS and HFS+ expect valid UTF-8 (HFS+ uses UTF-16 internally, and it must also be decomposed Unicode i.e. NFD).

"Unicode"

Posted Nov 3, 2009 19:17 UTC (Tue) by spitzak (guest, #4593) [Link]

> filenames are not always byte strings, unfortunately - every filesystem has various illegal characters

That is the opposite problem. The problem I am trying to solve is that the filesystem can have filenames that are NOT possible in the API that libraries are providing.

The only non-byte-stream filename api that is used at all is UTF-16. However UTF-16 (including invalid UTF-16) can be losslessly translated to UTF-8 and then back to UTF-16. Therefore all filesystems in existence can be controlled by a byte stream API, using UTF-8 as the encoding.

It is true that there are UTF-8 streams that cannot be turned into UTF-16, these would be "illegal characters" for the filenames. If the filesystem does not have a byte api then this can be replicated by turning all errors into "illegal characters" in UTF-16 so that an equivalent error is thrown.

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds