All of the encodings I could think of consider byte values less than 0x20 to be either invalid or control characters in any context. In fact, I couldn't find any that disagree with ASCII on the interpretation of any valid byte less than 0x40, and only Shift-JIS seems to disagree with ASCII at all below 0x80 (and there only as the second byte of two-byte characters, aside from a few direct character replacements). So it should be viable to consider filenames to be a sequence of bytes with only 0x2F and 0x00 having special meanings, but 0x01-0x1F prohibited entirely. (I think 0x7F could be prohibited as well.). Unfortunately, there are also other control characters, in the 0x80-0x9F range, which cannot be recognized directly from bytes, where 0x9B is the interesting one, because it can start ANSI escape sequences.
Posted Nov 23, 2010 22:06 UTC (Tue) by Simetrical (guest, #53439)
[Link]
UTF-7, UTF-16, UTF-32, and EBCDIC all treat some byte values below 0x20 differently from ASCII.
Control characters in file names
Posted Nov 23, 2010 22:29 UTC (Tue) by foom (subscriber, #14868)
[Link]
...and you can't use any of those as a locale encoding on an ASCII-centric UNIX system. It is expressly prohibited by POSIX.
(If you didn't have any ASCII locales, you could use an EBCDIC locale -- your system just needs to be self-consistent for all the characters in the Portable Character Set, across locales. UTF-7/16/32 are right out, though, since all characters in the Portable Character need to be encoded by a single byte.)
Control characters in file names
Posted Nov 25, 2010 16:19 UTC (Thu) by Spudd86 (guest, #51683)
[Link]
UTF16 and UTF32 are out entirely since they would end up with nul bytes, you could conceivably use UTF7 to name a file and it would work, it just wouldn't show the correct name anywhere...
Control characters in file names
Posted Nov 25, 2010 21:03 UTC (Thu) by iabervon (subscriber, #722)
[Link]
UTF-7 would be terrible, because the encoded form isn't even unique for a sequence of codepoints. (That is, even if you knew the character sequence for a filename and how it was decomposed and knew it was encoded as UTF-7 in the filesystem, you wouldn't know what sequence of bytes to ask the kernel for.) Also, encoders may not represent a '/' literally in between two blocks of characters outside the Latin-1 range, because it can be more efficient to use all 16 bits instead of the necessary padding to finish the encoded chunk.
In any case, it still wouldn't use bytes in the 0x00-0x1f range.
Control characters in file names
Posted Nov 29, 2010 10:09 UTC (Mon) by jamesh (guest, #1159)
[Link]
Those arguments could equally be made against UTF-8, where there are different byte sequences that some UTF-8 parsers will consider equal while others will consider to be invalid (e.g. encoding a '\u0000' as '\xC0\x80'). The solution to this problem is to require that inputs be in a canonical form.
Of course, once you start working with Unicode it isn't really enough to just require unique representations for each code point. You can have multiple sequences of unicode code points that have the same meaning. So you really want a normalised code point sequence encoded in a canonical form.
Control characters in file names
Posted Nov 29, 2010 18:18 UTC (Mon) by iabervon (subscriber, #722)
[Link]
UTF-8 actually specifies only one valid byte sequence for a given sequence of code points; which some parsers will accept other sequences, only one is valid and therefore canonical. UTF-7, on the other hand, doesn't have a single valid byte sequence, and doesn't seem to have any obvious canonical form.
The code point sequence issue is real (which is why I was careful not to say "character" anywhere), and unfortunately, there are multiple possible normalizations. So not only do you need a normalized code point sequence, you need one with a particular normalization that everything will agree on. (Also, since the availability of characters may affect the normalization, you might in principle have to specify the version of Unicode, although I think they're careful not to introduce new ways of getting the same character.) And, of course, you have to avoid using Apple products, because they silently rename your files to have a different normalization from what everybody else uses.
Control characters in file names
Posted Dec 1, 2010 2:32 UTC (Wed) by jamesh (guest, #1159)
[Link]
I understand that the non-canonical sequences are invalid. However, when UTF-8 was new it was common for decoders to accept the alternative byte sequences (and this often led to security bugs).
My point was that if you picked a canonical representation for UTF-7, and required that file names used it, then it would work okay as a file name encoding. That said, it still isn't a very good idea ...