User: Password:
|
|
Subscribe / Log in / New account

Control characters in file names

Control characters in file names

Posted Nov 23, 2010 22:06 UTC (Tue) by Simetrical (guest, #53439)
In reply to: Control characters in file names by iabervon
Parent article: Ghosts of Unix past, part 4: High-maintenance designs

UTF-7, UTF-16, UTF-32, and EBCDIC all treat some byte values below 0x20 differently from ASCII.


(Log in to post comments)

Control characters in file names

Posted Nov 23, 2010 22:29 UTC (Tue) by foom (subscriber, #14868) [Link]

...and you can't use any of those as a locale encoding on an ASCII-centric UNIX system. It is expressly prohibited by POSIX.

(If you didn't have any ASCII locales, you could use an EBCDIC locale -- your system just needs to be self-consistent for all the characters in the Portable Character Set, across locales. UTF-7/16/32 are right out, though, since all characters in the Portable Character need to be encoded by a single byte.)

Control characters in file names

Posted Nov 25, 2010 16:19 UTC (Thu) by Spudd86 (guest, #51683) [Link]

UTF16 and UTF32 are out entirely since they would end up with nul bytes, you could conceivably use UTF7 to name a file and it would work, it just wouldn't show the correct name anywhere...

Control characters in file names

Posted Nov 25, 2010 21:03 UTC (Thu) by iabervon (subscriber, #722) [Link]

UTF-7 would be terrible, because the encoded form isn't even unique for a sequence of codepoints. (That is, even if you knew the character sequence for a filename and how it was decomposed and knew it was encoded as UTF-7 in the filesystem, you wouldn't know what sequence of bytes to ask the kernel for.) Also, encoders may not represent a '/' literally in between two blocks of characters outside the Latin-1 range, because it can be more efficient to use all 16 bits instead of the necessary padding to finish the encoded chunk.

In any case, it still wouldn't use bytes in the 0x00-0x1f range.

Control characters in file names

Posted Nov 29, 2010 10:09 UTC (Mon) by jamesh (guest, #1159) [Link]

Those arguments could equally be made against UTF-8, where there are different byte sequences that some UTF-8 parsers will consider equal while others will consider to be invalid (e.g. encoding a '\u0000' as '\xC0\x80'). The solution to this problem is to require that inputs be in a canonical form.

Of course, once you start working with Unicode it isn't really enough to just require unique representations for each code point. You can have multiple sequences of unicode code points that have the same meaning. So you really want a normalised code point sequence encoded in a canonical form.

Control characters in file names

Posted Nov 29, 2010 18:18 UTC (Mon) by iabervon (subscriber, #722) [Link]

UTF-8 actually specifies only one valid byte sequence for a given sequence of code points; which some parsers will accept other sequences, only one is valid and therefore canonical. UTF-7, on the other hand, doesn't have a single valid byte sequence, and doesn't seem to have any obvious canonical form.

The code point sequence issue is real (which is why I was careful not to say "character" anywhere), and unfortunately, there are multiple possible normalizations. So not only do you need a normalized code point sequence, you need one with a particular normalization that everything will agree on. (Also, since the availability of characters may affect the normalization, you might in principle have to specify the version of Unicode, although I think they're careful not to introduce new ways of getting the same character.) And, of course, you have to avoid using Apple products, because they silently rename your files to have a different normalization from what everybody else uses.

Control characters in file names

Posted Dec 1, 2010 2:32 UTC (Wed) by jamesh (guest, #1159) [Link]

I understand that the non-canonical sequences are invalid. However, when UTF-8 was new it was common for decoders to accept the alternative byte sequences (and this often led to security bugs).

My point was that if you picked a canonical representation for UTF-7, and required that file names used it, then it would work okay as a file name encoding. That said, it still isn't a very good idea ...


Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds