LWN.net Logo

Control characters in file names

Control characters in file names

Posted Nov 23, 2010 19:48 UTC (Tue) by zlynx (subscriber, #2285)
In reply to: Control characters in file names by Yorick
Parent article: Ghosts of Unix past, part 4: High-maintenance designs

I think that Unix filesystems treating names as pure binary (excepting / and \0) is actually an advantage.

On other operating systems I have ended up with filenames that cannot be deleted. Windows NT with its POSIX layer can create names that Win32 can't handle. OSX HFS can also create names it can't handle.

That happens because the filesystem has to have huge complicated rulesets that provide binary to character mapping, character equivalency mapping and allowable characters. These rules have to duplicate the identical rules in user space. The rules often DON'T MATCH. Leading to all of the above problems.

A Linux with EXT2 on the other hand, can handle UTF-8 filenames even though UTF-8 didn't exist in wide use when EXT2 was invented. And the shell can rename or delete a filename encoded in KOI8 even though the shell doesn't understand the encoding.


(Log in to post comments)

Control characters in file names

Posted Nov 23, 2010 20:55 UTC (Tue) by Yorick (subscriber, #19241) [Link]

I'm in no way suggesting "huge complicated rulesets", only to expand the set of disallowed bytes from {0, '/'} to {0..31, '/'}. I believe the benefits of doing so would outweigh the disadvantages, of which precious few have been shown.

I'm also curious what file names can be created on OS X that "it can't handle", and how.
(Editing raw bytes on disk doesn't count - that way invalid file names could be created in ext2 as well.)

Control characters in file names

Posted Nov 23, 2010 21:26 UTC (Tue) by jzbiciak (✭ supporter ✭, #5246) [Link]

I wonder if this could just be controlled with a feature flag and tune2fs and/or a mount option? When set, just disallow creating files which have the troublesome characters. Still allow access to existing files with the troublesome characters, though, so you never have "files you can't get to."

Invariably, whenever I've created filenames with control characters in them, it's been through some strange fat-fingering. I would rather have the OS throw those files away. :-)

Control characters in file names

Posted Nov 24, 2010 0:40 UTC (Wed) by zlynx (subscriber, #2285) [Link]

An example from my Mac laptop. It was created by a recursive wget from the terminal. This file has been in my .Trash for a year now...

$ ls ShowXml.asp?user_group=5&user_path=user1%2F30462&userid=30462&blogname=ѩӣ֮%C0%E1

$ ls | xxd
0000000: 5368 6f77 586d 6c2e 6173 703f 7573 6572  ShowXml.asp?user
0000010: 5f67 726f 7570 3d35 2675 7365 725f 7061  _group=5&user_pa
0000020: 7468 3d75 7365 7231 2532 4633 3034 3632  th=user1%2F30462
0000030: 2675 7365 7269 643d 3330 3436 3226 626c  &userid=30462&bl
0000040: 6f67 6e61 6d65 3dd1 a9d0 b8cc 84d6 ae25  ogname=........%
0000050: 4330 2545 310a                           C0%E1.

I think that is just the representation the terminal sees and not what is actually in the filesystem because any attempt to delete with that name results in a file not found.

Control characters in file names

Posted Nov 24, 2010 11:08 UTC (Wed) by vonbrand (subscriber, #4458) [Link]

Have you tried quoting that (with ', not ")? There are many characters special to the shell in there. Sometimes a "rm -i *" helps by giving the "correct" filename to the selection. In very recalcitrant cases, you could write a proggie that unlinks the file by hardcoded name...

Control characters in file names

Posted Nov 24, 2010 12:50 UTC (Wed) by Yorick (subscriber, #19241) [Link]

Interesting - I tried creating a file by that name in OS 10.5, but when reading out the resulting name, the last two combining characters (U+0304 and U+05ae, corresponding to cc 84 and d6 ae respectively) had been transposed, presumably for reasons of canonical order. I had no problems removing it afterwards.

This would explain why you had trouble removing the file but now how it came to be created in the first place. I have heard claims that the normalisation algorithm in OS X has changed between versions; perhaps you upgraded your system between the creation and removal of the file? It could also just be a plain bug, of course, or a RAM single-bit error, etc. If you can reproduce it, I'm sure Apple would like to know about the bug.

Control characters in file names

Posted Nov 24, 2010 17:01 UTC (Wed) by michaeljt (subscriber, #39183) [Link]

> I'm in no way suggesting "huge complicated rulesets", only to expand the set of disallowed bytes from {0, '/'} to {0..31, '/'}. I believe the benefits of doing so would outweigh the disadvantages, of which precious few have been shown.

It seems to me that what is broken here is the shell language, not the filesystem, and encouraging people to use better languages is the right fix. For what it's worth, I have had occasion to miss the '/' character that you can't have in Unix filenames. I find it rather silly that that character is encoded into the low-level APIs (the kernel in this case, although having it in the libc API would amount to the same) instead of letting higher levels handle it.

Control characters in file names

Posted Nov 29, 2010 9:57 UTC (Mon) by quotemstr (subscriber, #45331) [Link]

Good luck with that. Unix shells won't change any time soon. It's hard enough to get filenames with whitespace working properly.

Forbidding control characters has an immediate upside and comes at almost zero cost. We could do it tomorrow, and nobody would notice except for increased robustness. Not doing that and instead pining for a perfect solution is just unrealistic.

Control characters in file names

Posted Nov 29, 2010 10:12 UTC (Mon) by michaeljt (subscriber, #39183) [Link]

> Good luck with that. Unix shells won't change any time soon.

I wouldn't be quite that pessimistic. Unix shells won't change, but now there is python which is being used for a lot of things that shell used to be, and things like upstart and systemd are also reducing the need for new shell code. Old shell code won't all go away, but problems in it can be fixed, and if less new shell code is written the problem is greatly reduced.

Control characters in file names

Posted Nov 29, 2010 18:24 UTC (Mon) by dlang (✭ supporter ✭, #313) [Link]

Python has not replaced shell in many areas, and probably never will (just like Perl is used in a lot of places where shell used to be used, but will never replace shell)

Control characters in file names

Posted Dec 2, 2010 17:25 UTC (Thu) by Ross (subscriber, #4065) [Link]

I don't know anyone using Python as their login shell. That would be a pretty terrible idea IMHO. On the other hand shell does make it easy to handle filenames wrong (just leave out some quotes and it will still seem to work) so python scripts are more likely to do a good job. That's not really going to fix the problem though, no matter how popular Python becomes for scripting compared to shell.

Control characters in file names

Posted Dec 2, 2010 17:40 UTC (Thu) by michaeljt (subscriber, #39183) [Link]

> I don't know anyone using Python as their login shell.

I was assuming that the biggest problem with filenames and shell script came from actual well-known script files that got exploited, not stuff typed in at the command line. I'm sure that can be a problem too of course.

Control characters in file names

Posted Nov 29, 2010 14:40 UTC (Mon) by nix (subscriber, #2304) [Link]

Handling directory separation at a higher level would be a classic That Hideous Name nightmare, converting paths from a simple string to a complex structure involving N components with associated lengths, which would *still* have to be somehow converted to a string a lot of the time: and if you can do that, you need a quoting mechanism to make it unambiguous, and if you have *that*, you could use the same quoting mechanism at input time, and still retain the /.

The current situation with /-and-no-quoting-characters is simplest of all, and eliminates the numerous attacks we have seen on SQL and other languages involving incorrectly processed quoting characters.

Control characters in file names

Posted Dec 2, 2010 17:27 UTC (Thu) by Ross (subscriber, #4065) [Link]

Thank you for the sanity :)

Yes, having to concoct paths with some helper function in some nasty encoding would not be an improvement. If people think it's too hard for scripts to handle spaces -- wait until filenames can have slashes in them too!

Control characters in file names

Posted Nov 24, 2010 18:41 UTC (Wed) by brother_rat (subscriber, #1895) [Link]

I don't know about "can't handle", but there are definitely quirks.

One quirk with OSX is that the GUI is consistent with earlier Macs that permitted / in filenames (when : was used as the folder separator), but as / is now restricted to be the folder separator the two characters are swapped over behind the scenes.

This causes very odd bugs with GUI tools that launch CLI utilities. For example, Hugin uses make to process photos, and make doesn't support : in filepaths. However many users put photos in folders with a date in the name, and the dates.

Control characters in file names

Posted Nov 23, 2010 21:47 UTC (Tue) by iabervon (subscriber, #722) [Link]

All of the encodings I could think of consider byte values less than 0x20 to be either invalid or control characters in any context. In fact, I couldn't find any that disagree with ASCII on the interpretation of any valid byte less than 0x40, and only Shift-JIS seems to disagree with ASCII at all below 0x80 (and there only as the second byte of two-byte characters, aside from a few direct character replacements). So it should be viable to consider filenames to be a sequence of bytes with only 0x2F and 0x00 having special meanings, but 0x01-0x1F prohibited entirely. (I think 0x7F could be prohibited as well.). Unfortunately, there are also other control characters, in the 0x80-0x9F range, which cannot be recognized directly from bytes, where 0x9B is the interesting one, because it can start ANSI escape sequences.

Control characters in file names

Posted Nov 23, 2010 22:06 UTC (Tue) by Simetrical (guest, #53439) [Link]

UTF-7, UTF-16, UTF-32, and EBCDIC all treat some byte values below 0x20 differently from ASCII.

Control characters in file names

Posted Nov 23, 2010 22:29 UTC (Tue) by foom (subscriber, #14868) [Link]

...and you can't use any of those as a locale encoding on an ASCII-centric UNIX system. It is expressly prohibited by POSIX.

(If you didn't have any ASCII locales, you could use an EBCDIC locale -- your system just needs to be self-consistent for all the characters in the Portable Character Set, across locales. UTF-7/16/32 are right out, though, since all characters in the Portable Character need to be encoded by a single byte.)

Control characters in file names

Posted Nov 25, 2010 16:19 UTC (Thu) by Spudd86 (guest, #51683) [Link]

UTF16 and UTF32 are out entirely since they would end up with nul bytes, you could conceivably use UTF7 to name a file and it would work, it just wouldn't show the correct name anywhere...

Control characters in file names

Posted Nov 25, 2010 21:03 UTC (Thu) by iabervon (subscriber, #722) [Link]

UTF-7 would be terrible, because the encoded form isn't even unique for a sequence of codepoints. (That is, even if you knew the character sequence for a filename and how it was decomposed and knew it was encoded as UTF-7 in the filesystem, you wouldn't know what sequence of bytes to ask the kernel for.) Also, encoders may not represent a '/' literally in between two blocks of characters outside the Latin-1 range, because it can be more efficient to use all 16 bits instead of the necessary padding to finish the encoded chunk.

In any case, it still wouldn't use bytes in the 0x00-0x1f range.

Control characters in file names

Posted Nov 29, 2010 10:09 UTC (Mon) by jamesh (guest, #1159) [Link]

Those arguments could equally be made against UTF-8, where there are different byte sequences that some UTF-8 parsers will consider equal while others will consider to be invalid (e.g. encoding a '\u0000' as '\xC0\x80'). The solution to this problem is to require that inputs be in a canonical form.

Of course, once you start working with Unicode it isn't really enough to just require unique representations for each code point. You can have multiple sequences of unicode code points that have the same meaning. So you really want a normalised code point sequence encoded in a canonical form.

Control characters in file names

Posted Nov 29, 2010 18:18 UTC (Mon) by iabervon (subscriber, #722) [Link]

UTF-8 actually specifies only one valid byte sequence for a given sequence of code points; which some parsers will accept other sequences, only one is valid and therefore canonical. UTF-7, on the other hand, doesn't have a single valid byte sequence, and doesn't seem to have any obvious canonical form.

The code point sequence issue is real (which is why I was careful not to say "character" anywhere), and unfortunately, there are multiple possible normalizations. So not only do you need a normalized code point sequence, you need one with a particular normalization that everything will agree on. (Also, since the availability of characters may affect the normalization, you might in principle have to specify the version of Unicode, although I think they're careful not to introduce new ways of getting the same character.) And, of course, you have to avoid using Apple products, because they silently rename your files to have a different normalization from what everybody else uses.

Control characters in file names

Posted Dec 1, 2010 2:32 UTC (Wed) by jamesh (guest, #1159) [Link]

I understand that the non-canonical sequences are invalid. However, when UTF-8 was new it was common for decoders to accept the alternative byte sequences (and this often led to security bugs).

My point was that if you picked a canonical representation for UTF-7, and required that file names used it, then it would work okay as a file name encoding. That said, it still isn't a very good idea ...

Control characters in file names

Posted Nov 24, 2010 6:31 UTC (Wed) by error27 (subscriber, #8346) [Link]

If you restricted the filenames, you would do it per mount point and not in the VFS layer. You'd still be able to delete all the files on your network mounted NTFS directory. You just wouldn't be able to copy them to your home directory without a rename.

So you wouldn't have filenames that couldn't be deleted, you'd only have filenames that couldn't be created.

Control characters in file names

Posted Nov 24, 2010 8:04 UTC (Wed) by nix (subscriber, #2304) [Link]

That's horribly nonorthogonal, but might be worthwhile nonetheless (as a mount option, probably on by default).

Control characters in file names

Posted Nov 25, 2010 16:23 UTC (Thu) by Spudd86 (guest, #51683) [Link]

Well part of the point is that such file names are hard to use, so hopefully you don't have any.

(They are the sort of names that make correctly handling file names in a shell script end up taking hundreds of lines, which means nobody EVER does it, which means pretty much nobody has files with that kind of name, except that the breakage from those names can sometimes be a security hole)

Control characters in file names

Posted Dec 2, 2010 19:07 UTC (Thu) by Ross (subscriber, #4065) [Link]

Why would it take hundreds of lines to handle them? Actually the shell doesn't so much care about control characters as characters found in $IFS which is space, tab, and newline by default. The proposal isn't to remove space, so it won't solve any problems for people writing shell scripts will it?

In any case, someone gave some examples of how to handle whitespace (and anything else) properly in shell scripts below. Use of arrays and proper quoting or find0/xargs0 combinations aren't too complicated and work correctly. The problem is that if there is a mistake, it won't be obvious since it will work with most input.

Control characters in file names

Posted Dec 2, 2010 19:44 UTC (Thu) by cesarb (subscriber, #6266) [Link]

If you remove control characters, you remove tab and newline; just set IFS to tab and newline (removing space) and you can easily and safely deal with filenames with spaces.

Control characters in file names

Posted Nov 24, 2010 17:03 UTC (Wed) by michaeljt (subscriber, #39183) [Link]

> I think that Unix filesystems treating names as pure binary (excepting / and \0) is actually an advantage.

I think that many people would appreciate having at least a hint as to the character encoding in use. Although in these days of Utf-8 it is less and less relevant of course.

Control characters in file names

Posted Nov 24, 2010 17:57 UTC (Wed) by ikm (subscriber, #493) [Link]

> I think that many people would appreciate having at least a hint as to the character encoding in use.

My guess is that unix systems just take the easiest approach here - treat the filename as a binary blob, and let userspace do the rest :) I have got to admit that in practice there's less hassle with FSes which are Unicode-aware (think Microsoft), unless you actually start trying to figure just what is that you are allowed to use there for filenames. Then you'd basically just stick to base64 or percent-encoding, which would be the right thing to do in any case.

Control characters in file names

Posted Nov 24, 2010 19:15 UTC (Wed) by michaeljt (subscriber, #39183) [Link]

> I have got to admit that in practice there's less hassle with FSes which are Unicode-aware (think Microsoft), unless you actually start trying to figure just what is that you are allowed to use there for filenames.

There have been a number of complaints on this thread about filesystems that are encoding-aware and the problems that causes. But actually the filesystem could carry encoding hints without being encoding-aware itself. For example, it could tell user space that a file name is Utf-8 but still just treat the name as a binary blob. The hint would just tell applications how best to display the name.

Control characters in file names

Posted Nov 27, 2010 8:11 UTC (Sat) by cmccabe (guest, #60281) [Link]

> There have been a number of complaints on this thread about filesystems
> that are encoding-aware and the problems that causes. But actually the
> filesystem could carry encoding hints without being encoding-aware itself.
> For example, it could tell user space that a file name is Utf-8 but still
> just treat the name as a binary blob. The hint would just tell
> applications how best to display the name

Well, you could use an extended attribute to represent the encoding of the filename. However, it would be a huge amount of work to change all the applications to check this attribute and act appropriately.

I'm pretty far from being an expert in internationalization, but my understanding is that non-unicode character encodings are considered deprecated. Based on comments made elsewhere in this thread, MacOS and Windows have already decreed that all filenames should be unicode. So is it really worth rewriting all software that dislays filenames in order to better support this legacy stuff? Especially when no other platforms support it at all? As Linus constantly points out, Linux-specific filesystem interfaces don't get used that much, even when they offer great benefits.

I think I agree with Spudd86's solution: there should be some kind of mount option that puts a ruleset in place for filenames. Probably nearly every Linux distribution would disallow filenames that were not UTF-8. A few people running special-purpose systems might mount their rootfs with more restrictive rulesets. Most system administrators already have an unwritten policy about filenames-- they don't create filenames with embedded control characters, crazy stuff like leading dashes, or embedded newlines. Letting system administrators turn their implicit policy into an explicit one would close a lot of security holes.

I wonder if it would be feasible to use the "escaping" option talked about on Wheeler's page. Basically, under this option, the kernel continues to treat filenames as binary blobs on the disk. But when presenting them to userspace, it escapes certain characters in a predictable way. I'm not sure whether this is really feasible, but it seems like the best choice if it is.

Control characters in file names

Posted Nov 30, 2010 1:39 UTC (Tue) by jamesh (guest, #1159) [Link]

As well as being a lot of work, using extended attributes introduces ambiguity. Some extra problems with that suggestion are:

  • You could have two files in a directory with the same sequence of unicode code points but different byte representations due to be encoding differently.
  • Applications might encounter paths like /latin1-part/utf8-part/sjis-part and need to check the encoding of each path component in order to display it to the user. Perhaps more difficult would be resolving a unicode path to something like this.
  • Extended attributes are associated with the file rather than the file name. What do you do if a file has two hard links with differently encoded file names?

Picking one encoding/normalisation is the only sane option, and it would be nice if the kernel would help enforce such a choice.

Control characters in file names

Posted Dec 2, 2010 18:22 UTC (Thu) by Wol (guest, #4433) [Link]

One problem with that ... (administrators enforcing policy, that is)

I've worked on a system where a file was composed of sub-files (Pr1mos). This was emulated on nix by using a directory with "special" names inside, namely all the subfiles were "<space><backspace><number>". Because nobody is supposed to touch these subfiles directly.

So if you enforce a policy like that, you could bust a bunch of apps ...

Cheers,
Wol

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds