LWN.net Logo

"Unicode"

"Unicode"

Posted Nov 3, 2009 6:47 UTC (Tue) by Cato (subscriber, #7643)
In reply to: "Unicode" by spitzak
Parent article: Proposal: Moratorium on Python language changes

I think the answer is more configurability of the conversion process - depending on the context you may want to stop the conversion or insert substitute characters as XML entities, \xNNNN, etc. Perl's Encode module does a pretty good job here.

Also, filenames are not always byte strings, unfortunately - every filesystem has various illegal characters, and NTFS and HFS+ expect valid UTF-8 (HFS+ uses UTF-16 internally, and it must also be decomposed Unicode i.e. NFD).


(Log in to post comments)

"Unicode"

Posted Nov 3, 2009 19:17 UTC (Tue) by spitzak (guest, #4593) [Link]

> filenames are not always byte strings, unfortunately - every filesystem has various illegal characters

That is the opposite problem. The problem I am trying to solve is that the filesystem can have filenames that are NOT possible in the API that libraries are providing.

The only non-byte-stream filename api that is used at all is UTF-16. However UTF-16 (including invalid UTF-16) can be losslessly translated to UTF-8 and then back to UTF-16. Therefore all filesystems in existence can be controlled by a byte stream API, using UTF-8 as the encoding.

It is true that there are UTF-8 streams that cannot be turned into UTF-16, these would be "illegal characters" for the filenames. If the filesystem does not have a byte api then this can be replicated by turning all errors into "illegal characters" in UTF-16 so that an equivalent error is thrown.

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds