User: Password:
|
|
Subscribe / Log in / New account

Control characters in file names

Control characters in file names

Posted Nov 23, 2010 18:57 UTC (Tue) by Yorick (subscriber, #19241)
Parent article: Ghosts of Unix past, part 4: High-maintenance designs

Allowing control characters (0x01-0x1f) in file names is clearly a high-maintenance design that we have come to regret ever since. I don't remember ever having seen a legitimate use of this liberty. On the other hand, it could mean that it could be removed, given some courage.

Spaces in file names also cause trouble but can be justified as long as file name are used by people to name documents.


(Log in to post comments)

Control characters in file names

Posted Nov 23, 2010 19:48 UTC (Tue) by zlynx (subscriber, #2285) [Link]

I think that Unix filesystems treating names as pure binary (excepting / and \0) is actually an advantage.

On other operating systems I have ended up with filenames that cannot be deleted. Windows NT with its POSIX layer can create names that Win32 can't handle. OSX HFS can also create names it can't handle.

That happens because the filesystem has to have huge complicated rulesets that provide binary to character mapping, character equivalency mapping and allowable characters. These rules have to duplicate the identical rules in user space. The rules often DON'T MATCH. Leading to all of the above problems.

A Linux with EXT2 on the other hand, can handle UTF-8 filenames even though UTF-8 didn't exist in wide use when EXT2 was invented. And the shell can rename or delete a filename encoded in KOI8 even though the shell doesn't understand the encoding.

Control characters in file names

Posted Nov 23, 2010 20:55 UTC (Tue) by Yorick (subscriber, #19241) [Link]

I'm in no way suggesting "huge complicated rulesets", only to expand the set of disallowed bytes from {0, '/'} to {0..31, '/'}. I believe the benefits of doing so would outweigh the disadvantages, of which precious few have been shown.

I'm also curious what file names can be created on OS X that "it can't handle", and how.
(Editing raw bytes on disk doesn't count - that way invalid file names could be created in ext2 as well.)

Control characters in file names

Posted Nov 23, 2010 21:26 UTC (Tue) by jzbiciak (subscriber, #5246) [Link]

I wonder if this could just be controlled with a feature flag and tune2fs and/or a mount option? When set, just disallow creating files which have the troublesome characters. Still allow access to existing files with the troublesome characters, though, so you never have "files you can't get to."

Invariably, whenever I've created filenames with control characters in them, it's been through some strange fat-fingering. I would rather have the OS throw those files away. :-)

Control characters in file names

Posted Nov 24, 2010 0:40 UTC (Wed) by zlynx (subscriber, #2285) [Link]

An example from my Mac laptop. It was created by a recursive wget from the terminal. This file has been in my .Trash for a year now...

$ ls ShowXml.asp?user_group=5&user_path=user1%2F30462&userid=30462&blogname=ѩӣ֮%C0%E1

$ ls | xxd
0000000: 5368 6f77 586d 6c2e 6173 703f 7573 6572  ShowXml.asp?user
0000010: 5f67 726f 7570 3d35 2675 7365 725f 7061  _group=5&user_pa
0000020: 7468 3d75 7365 7231 2532 4633 3034 3632  th=user1%2F30462
0000030: 2675 7365 7269 643d 3330 3436 3226 626c  &userid=30462&bl
0000040: 6f67 6e61 6d65 3dd1 a9d0 b8cc 84d6 ae25  ogname=........%
0000050: 4330 2545 310a                           C0%E1.

I think that is just the representation the terminal sees and not what is actually in the filesystem because any attempt to delete with that name results in a file not found.

Control characters in file names

Posted Nov 24, 2010 11:08 UTC (Wed) by vonbrand (guest, #4458) [Link]

Have you tried quoting that (with ', not ")? There are many characters special to the shell in there. Sometimes a "rm -i *" helps by giving the "correct" filename to the selection. In very recalcitrant cases, you could write a proggie that unlinks the file by hardcoded name...

Control characters in file names

Posted Nov 24, 2010 12:50 UTC (Wed) by Yorick (subscriber, #19241) [Link]

Interesting - I tried creating a file by that name in OS 10.5, but when reading out the resulting name, the last two combining characters (U+0304 and U+05ae, corresponding to cc 84 and d6 ae respectively) had been transposed, presumably for reasons of canonical order. I had no problems removing it afterwards.

This would explain why you had trouble removing the file but now how it came to be created in the first place. I have heard claims that the normalisation algorithm in OS X has changed between versions; perhaps you upgraded your system between the creation and removal of the file? It could also just be a plain bug, of course, or a RAM single-bit error, etc. If you can reproduce it, I'm sure Apple would like to know about the bug.

Control characters in file names

Posted Nov 24, 2010 17:01 UTC (Wed) by michaeljt (subscriber, #39183) [Link]

> I'm in no way suggesting "huge complicated rulesets", only to expand the set of disallowed bytes from {0, '/'} to {0..31, '/'}. I believe the benefits of doing so would outweigh the disadvantages, of which precious few have been shown.

It seems to me that what is broken here is the shell language, not the filesystem, and encouraging people to use better languages is the right fix. For what it's worth, I have had occasion to miss the '/' character that you can't have in Unix filenames. I find it rather silly that that character is encoded into the low-level APIs (the kernel in this case, although having it in the libc API would amount to the same) instead of letting higher levels handle it.

Control characters in file names

Posted Nov 29, 2010 9:57 UTC (Mon) by quotemstr (subscriber, #45331) [Link]

Good luck with that. Unix shells won't change any time soon. It's hard enough to get filenames with whitespace working properly.

Forbidding control characters has an immediate upside and comes at almost zero cost. We could do it tomorrow, and nobody would notice except for increased robustness. Not doing that and instead pining for a perfect solution is just unrealistic.

Control characters in file names

Posted Nov 29, 2010 10:12 UTC (Mon) by michaeljt (subscriber, #39183) [Link]

> Good luck with that. Unix shells won't change any time soon.

I wouldn't be quite that pessimistic. Unix shells won't change, but now there is python which is being used for a lot of things that shell used to be, and things like upstart and systemd are also reducing the need for new shell code. Old shell code won't all go away, but problems in it can be fixed, and if less new shell code is written the problem is greatly reduced.

Control characters in file names

Posted Nov 29, 2010 18:24 UTC (Mon) by dlang (subscriber, #313) [Link]

Python has not replaced shell in many areas, and probably never will (just like Perl is used in a lot of places where shell used to be used, but will never replace shell)

Control characters in file names

Posted Dec 2, 2010 17:25 UTC (Thu) by Ross (guest, #4065) [Link]

I don't know anyone using Python as their login shell. That would be a pretty terrible idea IMHO. On the other hand shell does make it easy to handle filenames wrong (just leave out some quotes and it will still seem to work) so python scripts are more likely to do a good job. That's not really going to fix the problem though, no matter how popular Python becomes for scripting compared to shell.

Control characters in file names

Posted Dec 2, 2010 17:40 UTC (Thu) by michaeljt (subscriber, #39183) [Link]

> I don't know anyone using Python as their login shell.

I was assuming that the biggest problem with filenames and shell script came from actual well-known script files that got exploited, not stuff typed in at the command line. I'm sure that can be a problem too of course.

Control characters in file names

Posted Nov 29, 2010 14:40 UTC (Mon) by nix (subscriber, #2304) [Link]

Handling directory separation at a higher level would be a classic That Hideous Name nightmare, converting paths from a simple string to a complex structure involving N components with associated lengths, which would *still* have to be somehow converted to a string a lot of the time: and if you can do that, you need a quoting mechanism to make it unambiguous, and if you have *that*, you could use the same quoting mechanism at input time, and still retain the /.

The current situation with /-and-no-quoting-characters is simplest of all, and eliminates the numerous attacks we have seen on SQL and other languages involving incorrectly processed quoting characters.

Control characters in file names

Posted Dec 2, 2010 17:27 UTC (Thu) by Ross (guest, #4065) [Link]

Thank you for the sanity :)

Yes, having to concoct paths with some helper function in some nasty encoding would not be an improvement. If people think it's too hard for scripts to handle spaces -- wait until filenames can have slashes in them too!

Control characters in file names

Posted Nov 24, 2010 18:41 UTC (Wed) by brother_rat (subscriber, #1895) [Link]

I don't know about "can't handle", but there are definitely quirks.

One quirk with OSX is that the GUI is consistent with earlier Macs that permitted / in filenames (when : was used as the folder separator), but as / is now restricted to be the folder separator the two characters are swapped over behind the scenes.

This causes very odd bugs with GUI tools that launch CLI utilities. For example, Hugin uses make to process photos, and make doesn't support : in filepaths. However many users put photos in folders with a date in the name, and the dates.

Control characters in file names

Posted Nov 23, 2010 21:47 UTC (Tue) by iabervon (subscriber, #722) [Link]

All of the encodings I could think of consider byte values less than 0x20 to be either invalid or control characters in any context. In fact, I couldn't find any that disagree with ASCII on the interpretation of any valid byte less than 0x40, and only Shift-JIS seems to disagree with ASCII at all below 0x80 (and there only as the second byte of two-byte characters, aside from a few direct character replacements). So it should be viable to consider filenames to be a sequence of bytes with only 0x2F and 0x00 having special meanings, but 0x01-0x1F prohibited entirely. (I think 0x7F could be prohibited as well.). Unfortunately, there are also other control characters, in the 0x80-0x9F range, which cannot be recognized directly from bytes, where 0x9B is the interesting one, because it can start ANSI escape sequences.

Control characters in file names

Posted Nov 23, 2010 22:06 UTC (Tue) by Simetrical (guest, #53439) [Link]

UTF-7, UTF-16, UTF-32, and EBCDIC all treat some byte values below 0x20 differently from ASCII.

Control characters in file names

Posted Nov 23, 2010 22:29 UTC (Tue) by foom (subscriber, #14868) [Link]

...and you can't use any of those as a locale encoding on an ASCII-centric UNIX system. It is expressly prohibited by POSIX.

(If you didn't have any ASCII locales, you could use an EBCDIC locale -- your system just needs to be self-consistent for all the characters in the Portable Character Set, across locales. UTF-7/16/32 are right out, though, since all characters in the Portable Character need to be encoded by a single byte.)

Control characters in file names

Posted Nov 25, 2010 16:19 UTC (Thu) by Spudd86 (guest, #51683) [Link]

UTF16 and UTF32 are out entirely since they would end up with nul bytes, you could conceivably use UTF7 to name a file and it would work, it just wouldn't show the correct name anywhere...

Control characters in file names

Posted Nov 25, 2010 21:03 UTC (Thu) by iabervon (subscriber, #722) [Link]

UTF-7 would be terrible, because the encoded form isn't even unique for a sequence of codepoints. (That is, even if you knew the character sequence for a filename and how it was decomposed and knew it was encoded as UTF-7 in the filesystem, you wouldn't know what sequence of bytes to ask the kernel for.) Also, encoders may not represent a '/' literally in between two blocks of characters outside the Latin-1 range, because it can be more efficient to use all 16 bits instead of the necessary padding to finish the encoded chunk.

In any case, it still wouldn't use bytes in the 0x00-0x1f range.

Control characters in file names

Posted Nov 29, 2010 10:09 UTC (Mon) by jamesh (guest, #1159) [Link]

Those arguments could equally be made against UTF-8, where there are different byte sequences that some UTF-8 parsers will consider equal while others will consider to be invalid (e.g. encoding a '\u0000' as '\xC0\x80'). The solution to this problem is to require that inputs be in a canonical form.

Of course, once you start working with Unicode it isn't really enough to just require unique representations for each code point. You can have multiple sequences of unicode code points that have the same meaning. So you really want a normalised code point sequence encoded in a canonical form.

Control characters in file names

Posted Nov 29, 2010 18:18 UTC (Mon) by iabervon (subscriber, #722) [Link]

UTF-8 actually specifies only one valid byte sequence for a given sequence of code points; which some parsers will accept other sequences, only one is valid and therefore canonical. UTF-7, on the other hand, doesn't have a single valid byte sequence, and doesn't seem to have any obvious canonical form.

The code point sequence issue is real (which is why I was careful not to say "character" anywhere), and unfortunately, there are multiple possible normalizations. So not only do you need a normalized code point sequence, you need one with a particular normalization that everything will agree on. (Also, since the availability of characters may affect the normalization, you might in principle have to specify the version of Unicode, although I think they're careful not to introduce new ways of getting the same character.) And, of course, you have to avoid using Apple products, because they silently rename your files to have a different normalization from what everybody else uses.

Control characters in file names

Posted Dec 1, 2010 2:32 UTC (Wed) by jamesh (guest, #1159) [Link]

I understand that the non-canonical sequences are invalid. However, when UTF-8 was new it was common for decoders to accept the alternative byte sequences (and this often led to security bugs).

My point was that if you picked a canonical representation for UTF-7, and required that file names used it, then it would work okay as a file name encoding. That said, it still isn't a very good idea ...

Control characters in file names

Posted Nov 24, 2010 6:31 UTC (Wed) by error27 (subscriber, #8346) [Link]

If you restricted the filenames, you would do it per mount point and not in the VFS layer. You'd still be able to delete all the files on your network mounted NTFS directory. You just wouldn't be able to copy them to your home directory without a rename.

So you wouldn't have filenames that couldn't be deleted, you'd only have filenames that couldn't be created.

Control characters in file names

Posted Nov 24, 2010 8:04 UTC (Wed) by nix (subscriber, #2304) [Link]

That's horribly nonorthogonal, but might be worthwhile nonetheless (as a mount option, probably on by default).

Control characters in file names

Posted Nov 25, 2010 16:23 UTC (Thu) by Spudd86 (guest, #51683) [Link]

Well part of the point is that such file names are hard to use, so hopefully you don't have any.

(They are the sort of names that make correctly handling file names in a shell script end up taking hundreds of lines, which means nobody EVER does it, which means pretty much nobody has files with that kind of name, except that the breakage from those names can sometimes be a security hole)

Control characters in file names

Posted Dec 2, 2010 19:07 UTC (Thu) by Ross (guest, #4065) [Link]

Why would it take hundreds of lines to handle them? Actually the shell doesn't so much care about control characters as characters found in $IFS which is space, tab, and newline by default. The proposal isn't to remove space, so it won't solve any problems for people writing shell scripts will it?

In any case, someone gave some examples of how to handle whitespace (and anything else) properly in shell scripts below. Use of arrays and proper quoting or find0/xargs0 combinations aren't too complicated and work correctly. The problem is that if there is a mistake, it won't be obvious since it will work with most input.

Control characters in file names

Posted Dec 2, 2010 19:44 UTC (Thu) by cesarb (subscriber, #6266) [Link]

If you remove control characters, you remove tab and newline; just set IFS to tab and newline (removing space) and you can easily and safely deal with filenames with spaces.

Control characters in file names

Posted Nov 24, 2010 17:03 UTC (Wed) by michaeljt (subscriber, #39183) [Link]

> I think that Unix filesystems treating names as pure binary (excepting / and \0) is actually an advantage.

I think that many people would appreciate having at least a hint as to the character encoding in use. Although in these days of Utf-8 it is less and less relevant of course.

Control characters in file names

Posted Nov 24, 2010 17:57 UTC (Wed) by ikm (subscriber, #493) [Link]

> I think that many people would appreciate having at least a hint as to the character encoding in use.

My guess is that unix systems just take the easiest approach here - treat the filename as a binary blob, and let userspace do the rest :) I have got to admit that in practice there's less hassle with FSes which are Unicode-aware (think Microsoft), unless you actually start trying to figure just what is that you are allowed to use there for filenames. Then you'd basically just stick to base64 or percent-encoding, which would be the right thing to do in any case.

Control characters in file names

Posted Nov 24, 2010 19:15 UTC (Wed) by michaeljt (subscriber, #39183) [Link]

> I have got to admit that in practice there's less hassle with FSes which are Unicode-aware (think Microsoft), unless you actually start trying to figure just what is that you are allowed to use there for filenames.

There have been a number of complaints on this thread about filesystems that are encoding-aware and the problems that causes. But actually the filesystem could carry encoding hints without being encoding-aware itself. For example, it could tell user space that a file name is Utf-8 but still just treat the name as a binary blob. The hint would just tell applications how best to display the name.

Control characters in file names

Posted Nov 27, 2010 8:11 UTC (Sat) by cmccabe (guest, #60281) [Link]

> There have been a number of complaints on this thread about filesystems
> that are encoding-aware and the problems that causes. But actually the
> filesystem could carry encoding hints without being encoding-aware itself.
> For example, it could tell user space that a file name is Utf-8 but still
> just treat the name as a binary blob. The hint would just tell
> applications how best to display the name

Well, you could use an extended attribute to represent the encoding of the filename. However, it would be a huge amount of work to change all the applications to check this attribute and act appropriately.

I'm pretty far from being an expert in internationalization, but my understanding is that non-unicode character encodings are considered deprecated. Based on comments made elsewhere in this thread, MacOS and Windows have already decreed that all filenames should be unicode. So is it really worth rewriting all software that dislays filenames in order to better support this legacy stuff? Especially when no other platforms support it at all? As Linus constantly points out, Linux-specific filesystem interfaces don't get used that much, even when they offer great benefits.

I think I agree with Spudd86's solution: there should be some kind of mount option that puts a ruleset in place for filenames. Probably nearly every Linux distribution would disallow filenames that were not UTF-8. A few people running special-purpose systems might mount their rootfs with more restrictive rulesets. Most system administrators already have an unwritten policy about filenames-- they don't create filenames with embedded control characters, crazy stuff like leading dashes, or embedded newlines. Letting system administrators turn their implicit policy into an explicit one would close a lot of security holes.

I wonder if it would be feasible to use the "escaping" option talked about on Wheeler's page. Basically, under this option, the kernel continues to treat filenames as binary blobs on the disk. But when presenting them to userspace, it escapes certain characters in a predictable way. I'm not sure whether this is really feasible, but it seems like the best choice if it is.

Control characters in file names

Posted Nov 30, 2010 1:39 UTC (Tue) by jamesh (guest, #1159) [Link]

As well as being a lot of work, using extended attributes introduces ambiguity. Some extra problems with that suggestion are:

  • You could have two files in a directory with the same sequence of unicode code points but different byte representations due to be encoding differently.
  • Applications might encounter paths like /latin1-part/utf8-part/sjis-part and need to check the encoding of each path component in order to display it to the user. Perhaps more difficult would be resolving a unicode path to something like this.
  • Extended attributes are associated with the file rather than the file name. What do you do if a file has two hard links with differently encoded file names?

Picking one encoding/normalisation is the only sane option, and it would be nice if the kernel would help enforce such a choice.

Control characters in file names

Posted Dec 2, 2010 18:22 UTC (Thu) by Wol (guest, #4433) [Link]

One problem with that ... (administrators enforcing policy, that is)

I've worked on a system where a file was composed of sub-files (Pr1mos). This was emulated on nix by using a directory with "special" names inside, namely all the subfiles were "<space><backspace><number>". Because nobody is supposed to touch these subfiles directly.

So if you enforce a policy like that, you could bust a bunch of apps ...

Cheers,
Wol

Control characters in file names

Posted Nov 23, 2010 19:50 UTC (Tue) by vonbrand (guest, #4458) [Link]

Please don't. The "control characters" in the filenames could well be regular characters in other encodings, or be part of e.g. an UTF-8 character. "Not all the world's a VAXASCII"

Control characters in file names

Posted Nov 23, 2010 20:44 UTC (Tue) by Yorick (subscriber, #19241) [Link]

Please don't. The "control characters" in the filenames could well be regular characters in other encodings, or be part of e.g. an UTF-8 character.

Since you ask me not to, please tell me exactly what encoding you are concerned about. Multi-byte UTF-8 characters do not contain byte 0-127.

Control characters in file names

Posted Nov 23, 2010 21:37 UTC (Tue) by ballombe (subscriber, #9523) [Link]

Probably ISO2022 still widely used in Japan (fortunately less than it used to).

Control characters in file names

Posted Nov 27, 2010 13:10 UTC (Sat) by Cato (subscriber, #7643) [Link]

ISO2022 is a truly horrible encoding that should never be used, and should certainly not be supported - it can embed normal ASCII characters within a "wide" character, making it very difficult to process.

Having looked into many different encodings, I'd agree with the suggestion to use UTF-*, but in reality systems still need to support legacy 8-bit and 16-bit encodings - there are many filesystems out there with filenames in legacy encodings, and often a mix of encodings.

The ability to mix legacy encodings in a single filesystem is sometimes useful for applications but it creates major data conversion issues when users do this.

Generally I'd agree with banning control characters by default from pathnames in a new OS, but it's too late to do that now with Linux/Unix.

Putting the encoding into the filesystem is suspect, particularly considering the deep unpleasantness of Apple's use of their own two variants of Unicode normalisation form D (NFD) in HFS+ and other filesystems, whereas the rest of the world including Linux and the Web uses normalisation form C (NFC).

Control characters in file names

Posted Nov 29, 2010 10:03 UTC (Mon) by quotemstr (subscriber, #45331) [Link]

Generally I'd agree with banning control characters by default from pathnames in a new OS, but it's too late to do that now with Linux/Unix.
I don't think it's too late at all. The overwhelming majority of legitimate filenames do not contain characters in the range proposed for blacklisting. As an option that's turned on by default, forbidding control characters would present no practical problems whatsoever. Nobody relies on filenames containing ^V or newline.

Control characters in file names

Posted Nov 25, 2010 16:29 UTC (Thu) by Spudd86 (guest, #51683) [Link]

They are also nearly impossible to handle correctly in shell scripts, and you should be using UTF8 for file names.

No one is suggesting this be something done in a non-optional way, but the encodings it would actually break that are also in use on Linux systems are very few and far between (probably largely because EVERYTHING expects those to be control characters, and they break shell scripts, etc. Plus we have UTF8)

Control characters in file names

Posted Nov 23, 2010 20:11 UTC (Tue) by jengelh (subscriber, #33263) [Link]

Wheeler: "this lack of limitations [...] makes it impossible to consistently and accurately display filenames". So yeah.

Let's ban 0x80-0xFF too next to 0x01-0x1F, because they too cannot be accurately be displayed (think of the byte sequence "0x20 0xC2 0x20" when used in contemporary Linux systems)!!11

Control characters in file names

Posted Nov 25, 2010 16:37 UTC (Thu) by Spudd86 (guest, #51683) [Link]

There are strong reasons to disallow 0x01-0x1F in file names (do you know how man lines of shell it takes to write something that can iterate over files and run one command on each when it must handle files that have those in the name? Hundreds, nobody does it so pretty much every shell script ever written will explode if it runs across such a file).

It has nothing to do with accurately displaying them, and everything to do with the fact that the cause actual problems and in a utf8 locale you gain NOTHING from being able to use 0x01-0x1F in file names (if you're going to bring up storing 'arbitrary' binary keys in file names, don't, you already CAN'T because you can't use '\0' or '/')

There's an article about exactly this somewhere, IIRC linked from LWN at some point in the past when the patch that allowed to you disable those chars came up I think.

Control characters in file names

Posted Nov 25, 2010 19:15 UTC (Thu) by jengelh (subscriber, #33263) [Link]

>do you know how man[y] lines of shell it takes

Aha. So... everybody knows it is possible to have files with odd filenames, and everybody keeps on using shells or shell constructs that cannot deal with this properly? I can see the flaw in that.

>something that can iterate over files and run one command on each when it must handle files that have those in the name?

for i in *; do cmd "$i"; done;
find . -whatever -exec cmd \;
find . -whateverelse -print0 | xargs -0 cmd;

There are so many safe ways available. I am really not responsible for people doing UUOC or thelike.

Control characters in file names

Posted Nov 25, 2010 20:31 UTC (Thu) by Spudd86 (guest, #51683) [Link]

The for loop won't work... the find examples only work if you want to run a single, non-shell command.

Control characters in file names

Posted Nov 25, 2010 21:48 UTC (Thu) by jengelh (subscriber, #33263) [Link]

In which case will the for loop not work? (Other than * not globbing files starting with a dot.)

Control characters in file names

Posted Nov 25, 2010 22:00 UTC (Thu) by Spudd86 (guest, #51683) [Link]

if there's a file that starts with - or has any sort of control character it will break.

see here: http://www.dwheeler.com/essays/filenames-in-shell.html and here: http://www.dwheeler.com/essays/fixing-unix-linux-filename... although for some reason I remember it being much worse than that, though being correct everywhere in your script could eventually be a pain.

Control characters in file names

Posted Dec 2, 2010 19:19 UTC (Thu) by Ross (guest, #4065) [Link]

Are you proposing to remove hyphens from filenames too, or is this getting off-topic? :)

Control characters in file names

Posted Nov 25, 2010 23:30 UTC (Thu) by cmccabe (guest, #60281) [Link]

> In which case will the for loop not work? (Other than * not globbing files
> starting with a dot.)

The for loop should be

for i in *; do cmd "./$i"; done;

In case one of the filenames begins with a dash.

Control characters in file names

Posted Nov 26, 2010 10:28 UTC (Fri) by Yorick (subscriber, #19241) [Link]

Of course file names can be handled safely in most languages, but that's not the point. Wheeler describes it better and in more detail, but briefly, the aim is:
  • Make it harder to make mistakes, brittle and/or exploitable code. Even flawless programmers are affected by other people's errors.
  • Eliminate a dangerous class of control character exploits, mainly when displaying file names on terminals.
  • Allow for more design options. Remember, restricting data formats can be a way to give the programmer more freedom, not less.

To illustrate the last point: The only possible delimiter for files names is currently the null byte, which is not very practical in many languages and in shell scripting in particular. Linefeeds would be much more natural and are supported by many more tools.

The benefits are clear, and the costs appear to be very low. The only serious objection I have seen so far concerns existing file names using an ISO 2022-based encoding. There are several possible solutions: allowing the control character restriction to be lifted as a per-mount option (possibly only allowing ESC, SI and SO), or a mount option that recodes into UTF-8.

Control characters in file names

Posted Nov 29, 2010 16:30 UTC (Mon) by nix (subscriber, #2304) [Link]

The xargs only works if you have at least one matching file. You want -0r. (Of course this is totally GNU-only.)

Control characters in file names

Posted Dec 2, 2010 19:17 UTC (Thu) by Ross (guest, #4065) [Link]

You repeat at least three times in the thread that it takes hundreds of lines to handle control characters in shell scripts. That's just not true. But worse, it's a terrible argument even if it were true.

You aren't proposing to remove all the characters that make it difficult to write correct shell scripts. In fact tab and newline and the worst "offenders" in your list of control characters. Most shells don't care about control characters at all. This can't be an argument for implementing the character set limitation because implementing it won't fix the problem -- the same script would still be broken by files with spaces in them (and any number of shell metacharacters).

And even if it did, I'm not sure the features of Bourne shell should dictate how the filesystem interface should work. The existing kernel and shell were designed together -- if you want to redo the filename encoding in the kernel, you should consider how the shell could be changed and also how other tools besides the shell are affected. Only looking at the shell is just too much focus.

Control characters in file names

Posted Nov 23, 2010 20:24 UTC (Tue) by jreiser (subscriber, #11027) [Link]

I don't remember ever having seen a legitimate use of this liberty.

My customers enjoy better performance at lower cost because of the difference (log2(254) - log2(223)), and I make money from that. [Hint: a database index encoded in filenames, accessed only by the database and the backup system.] If you wish to exclude [\x01-\x1f] from filenames that customarily are manipulated by your users and programs, then please write plugins/extensions/whatever to implement this constraint in the command-line shell programs of your choice.

Control characters in file names

Posted Nov 23, 2010 21:15 UTC (Tue) by Yorick (subscriber, #19241) [Link]

The very point of such a restriction would be that programs would not need to implement it themselves.

I'm somewhat surprised that you believe that 3 % longer file names would make a noticeable difference in performance for your application; have you measured this? Most cases of data encoded in file names that I have come across would happily use something like base64, with the added benefit of portability and easier manipulation and inspection of the directories with standard tools.

Control characters in file names

Posted Nov 25, 2010 16:47 UTC (Thu) by Spudd86 (guest, #51683) [Link]

Well since you don't want it you CAN just turn it off. Fact is that it's almost NEVER sane to have such a file name, and when it is you can disable the restriction.

No one who has proposed the restriction has ever done so as a non-optional, always on thing, on by default, yes, but never as something you couldn't turn off. If you can't ask that people running your app disable a feature of that type, then your app is already broken and you're just waiting for someone to hit the brokenness.

Control characters in file names

Posted Nov 23, 2010 21:26 UTC (Tue) by ballombe (subscriber, #9523) [Link]

You never seen a legitimate use ? I thought this was the oldest trick in the book.

do
touch `printf "\01"`
chmod 000 `printf "\01"`
and now, if you ever do 'rm -f *' by mistake, you will get a chance to abort before any files are deleted.

This is also useful to name image files: use a two dimensional ascii-art rendition of the image, this is much more intuitive than _dsc2919.jpg. After all a picture is worth a thousand word... and in ascii-art it _is_ a thousand word!)

Control characters in file names

Posted Nov 23, 2010 21:31 UTC (Tue) by jzbiciak (subscriber, #5246) [Link]

I'd hate to type filenames on your computer. Even tab expansion would be murder with your digital pictures.

Control characters in file names

Posted Nov 24, 2010 18:29 UTC (Wed) by ballombe (subscriber, #9523) [Link]

This is a minor quibble. Just add the picture number in the left-up corner and add a subdirectory with symlinks from the numbers to the image.

Control characters in file names

Posted Nov 24, 2010 6:20 UTC (Wed) by mfedyk (guest, #55303) [Link]

oooh, I like.

now if the camera could write vfat long file names without patents...

Control characters in file names

Posted Nov 24, 2010 8:05 UTC (Wed) by nix (subscriber, #2304) [Link]

I don't understand why people would want cameras to write long filenames anyway. They all name their files in robotic and boring fashion and the filenames are essentially meaningless nonces: all the actual lookup is always done via EXIF tags from some reader application.

Control characters in file names

Posted Nov 24, 2010 17:10 UTC (Wed) by sorpigal (subscriber, #36106) [Link]

It seems like the point would be to make cameras save pictures with ascii-art names depicting their contents, which would be arguably more useful than the robotic, boring names they use now.

ASCII Art file names

Posted Nov 25, 2010 7:49 UTC (Thu) by rgmoore (✭ supporter ✭, #75) [Link]

It would be great until you took two pictures that gave the same ASCII-art representation but were different in some other important way. For example, astrophotographers who want to do image stacking would find it very inconvenient. So would people who use some bracketing options, like color balance bracketing or even exposure bracketing with 1/3 stop increments. Sequence numbers may be boring and uninformative, but they stop you from accidentally overwriting an important picture.

Control characters in file names

Posted Nov 25, 2010 3:15 UTC (Thu) by mrshiny (subscriber, #4266) [Link]

You know, KDE and other modern shells can show you a tiny representation of the image on your disk without resorting to ascii-art filenames... Just sayin'.

Control characters in file names

Posted Dec 1, 2010 21:56 UTC (Wed) by cgwaldman (subscriber, #9061) [Link]

No, if you do 'rm -f *' it will delete every file. That's what the '-f' (force) flag does. This trick will only protect you from 'rm *', not the stronger 'rm -f *'. (Cute idea though...)

Control characters in file names

Posted Nov 25, 2010 3:44 UTC (Thu) by jthill (subscriber, #56558) [Link]

I once considered using .^A(name)^B(value^C to implement bundles/streams/forks in a portable way, so .^Asourceurl^Bhttp://lwn.net/Articles/416824/^C might be a good thing to include for a saved copy of this web page. It's hard to misinterpret, unlikely to conflict or be accidentally damaged, and dead easy to implement.

To keep from eating inodes you could just hardlink them all to a conventional spot, maybe .^ABUNDLE^BTAGINODE^C at the volume root. That would also make it possible to transport the trick in tar archives. Heh. Two of the subject design choices at once. I'm proud of myself.

Control characters in file names

Posted Nov 25, 2010 6:14 UTC (Thu) by cmccabe (guest, #60281) [Link]

Wow. The idea that displaying filenames on your terminal emulator could be a security hole is mindblowing-- but, apparently, true...

http://seclists.org/fulldisclosure/2003/Feb/att-341/Termu...
(from the Wheeler link)

Also, I suddenly don't feel so happy about using GNU screen all the time...

Control characters in file names

Posted Nov 25, 2010 16:52 UTC (Thu) by Spudd86 (guest, #51683) [Link]

Wait 'till you start running shell scripts on directories! (Handling file names with control characters in the name correctly can take HUNDREDS of lines of code in shell, people frequently write scripts that break when ask them to handle names with spaces, and that's EASY)

Control characters in file names

Posted Nov 25, 2010 23:20 UTC (Thu) by cmccabe (guest, #60281) [Link]

After reading that essay, I am convinced that we should ban control characters in filenames through one of the mechanisms described. UTF-8 doesn't use them, and all human languages should be representable with UTF-8. So allowing control characters is just a pointless duplication of functionality, like supporting pascal-style strings alongside C-style strings in the syscall API.

Control characters in file names

Posted Dec 2, 2010 19:46 UTC (Thu) by Ross (guest, #4065) [Link]

Yeah, great link. People don't have enough fear about their terminals. Some of the more horrific terminal codes that do things like open files in your home directory have been removed from xterm and rxvt (no idea about others) but it's by no means safe to just allow random characters to be written to your screen and it hasn't been even back to physical terminals.

Allowing write to write to your terminal is a security problem though I think talk filters out control characters. Running any program even as an unprivileged user with no filesystem access is a problem if the output is going to your terminal. Just cating a file from an unknown source is an issue. Running a program on an unsanitized input might cause it to print error messages or other strings without stripping out special characters.

Basically the terminal is full of security issues because it obeys control characters no matter how they get there and traditionally lots of stuff gets written to your screen from unsanitized sources.

Instead of changing the filesystem to fix a very small part of that (and let's face it, if you have something writing out malicious filenames, it's probably writing out malicious file contents), there should be a more comprehensive approach. For example, there could be a mechanism to add a tty filter process which could sanitize the output for your specific terminal. Ideally the terminal program would set it up before starting the shell (console and remote logins would need to be handled too, and remote logins are harder because the terminal type isn't known until login, if ever). The hard part is that you want some control characters to get through -- and probably different ones from different sources (setting the xterm title in your shell prompt code for example). There would need to be a way to get different interfaces for the shell, trusted programs, and untrusted programs. How to do this without redesigning the shell and all the utilities? :(

Control characters in file names

Posted Dec 2, 2010 17:20 UTC (Thu) by Ross (guest, #4065) [Link]

Lots of characters cause trouble in filenames. Shells hate whitespace for example. Terminals hate control characters. Other things get confused by commas, quotes, etc. Lots of shell utilities hate files starting with a hyphen, some with a plus. If you want to copy files to a Windows system or an MP3 player you need to avoid lots of things like question mark, star, less than, greater than, colon, dollar sign.

I guess my point is you can't design around everything which wouldn't like filenames to have specific characters. In fact it's kind of nice that the kernel doesn't know about the encoding system except that it won't produce single bytes that have the same ASCII value as / or NUL. That's kind of the bare minimum it needs to know to be able to handle entire paths and not just path components.

Sure I hate seeing filenames with crazy characters too. They usually appear to be accidently created. Or are the output of some terrible script which is trying to store too much information in filenames.

I don't think this qualifies as high-maintenance because most software pretty much seems to ignore it and treat filenames as sequences of bytes terminated by a NUL, and not do so very carefully with respect to whitespace or other unusual characters. I'd suggest that any kernel-level fix won't address enough of the issues to be a complete solution, and would require additional work in most applications and scripts to handle all filenames perfectly, just like now.


Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds