Those arguments could equally be made against UTF-8, where there are different byte sequences that some UTF-8 parsers will consider equal while others will consider to be invalid (e.g. encoding a '\u0000' as '\xC0\x80'). The solution to this problem is to require that inputs be in a canonical form.
Of course, once you start working with Unicode it isn't really enough to just require unique representations for each code point. You can have multiple sequences of unicode code points that have the same meaning. So you really want a normalised code point sequence encoded in a canonical form.
Posted Nov 29, 2010 18:18 UTC (Mon) by iabervon (subscriber, #722)
[Link]
UTF-8 actually specifies only one valid byte sequence for a given sequence of code points; which some parsers will accept other sequences, only one is valid and therefore canonical. UTF-7, on the other hand, doesn't have a single valid byte sequence, and doesn't seem to have any obvious canonical form.
The code point sequence issue is real (which is why I was careful not to say "character" anywhere), and unfortunately, there are multiple possible normalizations. So not only do you need a normalized code point sequence, you need one with a particular normalization that everything will agree on. (Also, since the availability of characters may affect the normalization, you might in principle have to specify the version of Unicode, although I think they're careful not to introduce new ways of getting the same character.) And, of course, you have to avoid using Apple products, because they silently rename your files to have a different normalization from what everybody else uses.
Control characters in file names
Posted Dec 1, 2010 2:32 UTC (Wed) by jamesh (guest, #1159)
[Link]
I understand that the non-canonical sequences are invalid. However, when UTF-8 was new it was common for decoders to accept the alternative byte sequences (and this often led to security bugs).
My point was that if you picked a canonical representation for UTF-7, and required that file names used it, then it would work okay as a file name encoding. That said, it still isn't a very good idea ...