UTF-8 actually specifies only one valid byte sequence for a given sequence of code points; which some parsers will accept other sequences, only one is valid and therefore canonical. UTF-7, on the other hand, doesn't have a single valid byte sequence, and doesn't seem to have any obvious canonical form.
The code point sequence issue is real (which is why I was careful not to say "character" anywhere), and unfortunately, there are multiple possible normalizations. So not only do you need a normalized code point sequence, you need one with a particular normalization that everything will agree on. (Also, since the availability of characters may affect the normalization, you might in principle have to specify the version of Unicode, although I think they're careful not to introduce new ways of getting the same character.) And, of course, you have to avoid using Apple products, because they silently rename your files to have a different normalization from what everybody else uses.
Posted Dec 1, 2010 2:32 UTC (Wed) by jamesh (guest, #1159)
[Link]
I understand that the non-canonical sequences are invalid. However, when UTF-8 was new it was common for decoders to accept the alternative byte sequences (and this often led to security bugs).
My point was that if you picked a canonical representation for UTF-7, and required that file names used it, then it would work okay as a file name encoding. That said, it still isn't a very good idea ...