I think the answer is more configurability of the conversion process - depending on the context you may want to stop the conversion or insert substitute characters as XML entities, \xNNNN, etc. Perl's Encode module does a pretty good job here.
Also, filenames are not always byte strings, unfortunately - every filesystem has various illegal characters, and NTFS and HFS+ expect valid UTF-8 (HFS+ uses UTF-16 internally, and it must also be decomposed Unicode i.e. NFD).
Posted Nov 3, 2009 19:17 UTC (Tue) by spitzak (guest, #4593)
[Link]
> filenames are not always byte strings, unfortunately - every filesystem has various illegal characters
That is the opposite problem. The problem I am trying to solve is that the filesystem can have filenames that are NOT possible in the API that libraries are providing.
The only non-byte-stream filename api that is used at all is UTF-16. However UTF-16 (including invalid UTF-16) can be losslessly translated to UTF-8 and then back to UTF-16. Therefore all filesystems in existence can be controlled by a byte stream API, using UTF-8 as the encoding.
It is true that there are UTF-8 streams that cannot be turned into UTF-16, these would be "illegal characters" for the filenames. If the filesystem does not have a byte api then this can be replicated by turning all errors into "illegal characters" in UTF-16 so that an equivalent error is thrown.