Wheeler: Fixing Unix/Linux/POSIX Filenames [LWN.net]

At last, a hope of progress

Posted Mar 25, 2009 14:05 UTC (Wed) by epa (subscriber, #39769) [Link] (3 responses)

I thoroughly agree. If using a single character for end-of-line was the best design decision in UNIX, then allowing any character sequence in filenames (while at the same time including a shell and scripting environment that's easily tripped up by them) was the worst.

Look at the recent Python version that got tripped up by filenames that are not valid UTF-8. Currently on a Unix-like system you cannot assume anything more about filenames than that they're a string of bytes. This frustrates efforts to treat them as Unicode strings and cleanly allow international characters.

Or look at the whole succession of security holes in shell scripts and even other languages caused by control characters in filenames. My particular favourite is the way many innocuous-looking perl programs (containing 'while (<>)') can be induced to overwrite random files by making filenames beginning '>'.

A system-wide policy guaranteeing that only sane characters can appear in filenames would eliminate at a stroke a lot of tedious sanity-checking you have to do in userspace (not to mention the hidden bugs and security holes in many programs because the sanity-checking was not paranoid enough). Given the natural conservatism of developers, I can't be optimistic it will happen soon. But, like defaulting to relatime instead of updating atime on each access, it's a long-overdue spring clean to a particularly musty corner of the Unix way.

At last, a hope of progress

Posted Mar 25, 2009 16:52 UTC (Wed) by mjthayer (guest, #39183) [Link] (2 responses)

Actually, I think that the shell and the scripting environment are greater problems than the permissive file names. The fact that everything is a text string to the shell is the source of so many security holes. But of course, in this case the file names are probably easier to fix by far.

At last, a hope of progress

Posted Mar 25, 2009 20:02 UTC (Wed) by mjthayer (guest, #39183) [Link] (1 responses)

Actually, the shell could help a bit. At least one thing that it could do would be to ignore files starting with a dash when expanding '*', the same way it ignores files starting with a dot. I don't know if that would be POSIX compliant, but there are more bad reasons than good for that sort of expansion. Recognising ASCII-zero as a separator for file names in a text stream might also be useful, although I have no idea what other implications that would have, and it would probably fail under many circumstances.

At last, a hope of progress

Posted Mar 29, 2009 0:01 UTC (Sun) by mikachu (guest, #5333) [Link]

On days when I'm feeling paranoid I always say ./* instead of just *, especially when talking to /bin/rm. On the other hand, touch -- -i in directories where you have important files is a nice trick too.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 25, 2009 15:12 UTC (Wed) by rsidd (subscriber, #2582) [Link] (13 responses)

Yet another topic that was extensively discussed in the Unix Haters Handbook.

One gotcha that is not covered by limiting the allowed character set in filenames is this: how do you remove all your configuration files and directories (that begin with a ".")? "rm -rf .*" will have very undesirable results. And yes, I have done this to myself -- luckily the system had nightly backups.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 25, 2009 15:32 UTC (Wed) by drag (guest, #31333) [Link] (2 responses)

Uh. Wow.

Maybe if Linux folks keep identifying and fixing legacy Unix usability issues people will start referring to Linux as 'The way Unix should of been' or 'Unix done right'.

I am always really paranoid about file names in Linux when doing scripting or whatnot. It's difficult and dealing with them is always a "oh god I hope I did the escaping right" sort of deal. Because I know I can have a script that I use for lots of stuff, but sometime I may make a goober'd up filename late at night or something that can end up destroying data unless I did things exactly correct in my scripts.

Filenames with ~ in them, or < > or () or all sorts of odd things I make by mistake.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 25, 2009 16:20 UTC (Wed) by epa (subscriber, #39769) [Link] (1 responses)

Maybe if Linux folks keep identifying and fixing legacy Unix usability issues people will start referring to Linux as 'The way Unix should of been' or 'Unix done right'.

I thought that was Plan 9...

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 25, 2009 22:14 UTC (Wed) by jordanb (guest, #45668) [Link]

While I'm not entirely sure what "Unix done right" is, but I *am* sure that a system where you get to choose between ed, sam, and acme isn't it.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 25, 2009 15:42 UTC (Wed) by gnb (subscriber, #5132) [Link]

True, but that's not really a problem with the choice of names. The fact of
having named files in a directory that are actually links to the directory
and its parent necessarily gives rise to such a hazard unless you put
special-case magic in globbing to exclude them from wildcard matches, in
which case it doesn't really matter what names are picked for them as long as
they're standard. Specifying a wildcard that doesn't catch them is no harder
with the current names, the problem is remembering to do it.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 25, 2009 18:43 UTC (Wed) by danielthaler (guest, #24764) [Link] (2 responses)

I just tested this, as I remeber doing this a few times without any catastrophic removals of "..". I got:

rm: cannot remove directory `.'
rm: cannot remove directory `..'

so something did prevent it from happening...

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 25, 2009 21:43 UTC (Wed) by mjthayer (guest, #39183) [Link] (1 responses)

I believe that GNU rm does a few additional sanity checks.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 25, 2009 22:46 UTC (Wed) by nix (subscriber, #2304) [Link]

A *few*? Have a read of coreutils-7.1/src/remove.c one of these days. 50K
of mostly sanity checks and fantastically elaborate code for portably and
securely recursing down directory trees, even those longer than PATH_MAX.

GNU coreutils, like gnulib, is a goldmine of fantastic tricks (and evil
ones that it's best to, ahem, admire from a distance: e.g. the less said
about NEED_REWIND and the need for rm running everywhere to work around a
MacOS X bug, the better).

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 25, 2009 22:15 UTC (Wed) by csigler (subscriber, #1224) [Link]

> how do you remove all your configuration files and directories (that begin
> with a ".")? "rm -rf .*" will have very undesirable results.

IIRC, you can do something like "rm -fr .[^.]*" -- at least this WFM in the command "ls -Fa .[^.]*"

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 25, 2009 23:05 UTC (Wed) by Jonno (subscriber, #49613) [Link]

> How do you remove all your configuration files and directories
> (that begin with a ".")?

`rm -rf .??*` is a good start. It will miss configuration files with only
one or two characters after the dot, but I've not found any program using
such yet...

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 1:32 UTC (Thu) by bojan (subscriber, #14302) [Link]

Not sure if that entirely correct, but I usually use "find -mindepth 1 -name '.*'" to find such files. Doesn't seem to find "." or "..". YMMV and other warnings apply.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 2:34 UTC (Thu) by k8to (guest, #15413) [Link]

You can do horrible things with GLOBIGNORE in bash.

Personally I just walk the list and filter out . and ..

Yes, it sucks.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 4:58 UTC (Thu) by shimei (guest, #54776) [Link]

An interesting side effect of this proposal to make sensible file names is that a hack I use to prevent "rm -rf *" from deleting everything won't work. I keep a file named "-i" in / so if I accidentally try to rm everything, it will go into interactive mode and stop me. Of course, I think that isn't a great loss if it fixes other things.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 27, 2009 1:28 UTC (Fri) by no_treble (guest, #49534) [Link]

rm -rf .[!.]*

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 25, 2009 16:22 UTC (Wed) by jreiser (subscriber, #11027) [Link] (15 responses)

Keep those filename rules out of my filesystems, please. Some of my programs use such "bad" filenames systematically on purpose, and achieve strictly greater utility and efficiency than would be possible without them. For instance, one claim of section 5, "Yet you must be able to display filenames", is false. There are whole worlds where filenames are touched only by application-specific programs and the OS (and the backup+restore system.)

If you are truly concerned about portability, then work on the problem which arises because Microsoft Windows [FAT and NTFS] allows a filename consisting of a US customary calendar date, i.e. "03/25/09" as an eight-character filename.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 25, 2009 16:42 UTC (Wed) by epa (subscriber, #39769) [Link] (14 responses)

Some of my programs use such "bad" filenames systematically on purpose, and achieve strictly greater utility and efficiency than would be possible without them.

Can you give an example?

There is a certain old-school appeal in just being able to use the filesystem as a key-value store with no restrictions on what bytes can appear in the key. But it's spoiled a bit by the prohibition of NUL and / characters, and trivially you can adapt such code to base64-encode the key into a sanitized filename. It may look a bit uglier, but if only application-specific programs and the OS access the files anyway, that does not matter.

If you are truly concerned about portability, then work on the problem which arises because Microsoft Windows [FAT and NTFS] allows a filename consisting of a US customary calendar date, i.e. "03/25/09" as an eight-character filename.

It's also possible for an iso9660 CD-ROM to have filenames containing the / character, or at least, I possess such a disc. This shows that in general there is a need for Linux to sanitize filenames coming from alien filesystems.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 25, 2009 17:46 UTC (Wed) by nix (subscriber, #2304) [Link] (9 responses)

I use the filename as a key-value store for a system (not yet released) which implements an object model of sorts in the shell (inspired by shoop but not derived from it). dot-prepended names are used to signify private fields, and dash-prepended ones, *specifically because they are so hard to use* and thus unlikely to be desirable field names, are used by the inside of the object model as field metadata: e.g. '-creator-blah' is the ID of the object that triggered creation of field 'blah'.

(I could equally use a directory full of stuff here, but it too would need a name that's hard to type. I pondered a \n-prepended filename because it's even harder to trip over by mistake, but decided that it would look too odd in directory listings of object directories when debugging. There's no danger of user code interpreting these names as options, because user code accesses files in this directory only via a shell-function API.)

And if I've done it, I guarantee you that lots and lots of other people have done it too.

David's proposed constraints on filenames are constraints which can never be imposed by default, at the very least. The semantics of Unix filesystems have been fixed de facto for many years: nobody expects files with odd characters to work on FAT, but nobody expects a Unix system to use a FAT filesystem as a primary datastore either.

Hardwired filename encodings are a good idea only if you can guarantee that this encoding has been the standard for the lifetime of the filesystem. You can't assume that for any existing filesystem: thus you have to decide what to do if filenames are not representable in the encoding scheme chosen. This also conflicts with 'no control characters' in that a good bunch of Unicode characters >127 can be considered 'control characters' of a sort, and there's no guarantee that more won't be added. How to exclude control characters which may be added in the future?

You also can't sensibly exclude shell metacharacters, because you don't know what they are, because they're shell-dependent, and some shells (like zsh) have globbing schemes so complex that ruling out all filenames that might be interpretable as a glob is wholly impractical.

But I agree that these rules all make sense for parts of the filesystem that users might manipulate with arbitrary programs, as opposed to those that are using part of the FS as a single-program-managed datastore. What I think we need is an analogue of pathconf() (setpathconf()?) combined with extra fs metadata, such that constraints of this sort can be imposed and relaxed for *specific directories* (inherited by subdirectories on mkdir(), presumably). *That* might stand a chance of not breaking the entire world in the name of fixing it.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 25, 2009 21:56 UTC (Wed) by dwheeler (guest, #1216) [Link] (8 responses)

A few thoughts based on nix's comments...

I use the filename as a key-value store for a system (not yet released) which implements an object model of sorts in the shell (inspired by shoop but not derived from it). dot-prepended names are used to signify private fields, and dash-prepended ones, *specifically because they are so hard to use* and thus unlikely to be desirable field names, are used by the inside of the object model as field metadata:

Such a key-value storage will have trouble with "/" in the key, since it's the directory separator. So if you truly need arbitrary keys, you already have to do some encoding anyway - so why not encode to something more convenient? If you don't need arbitrary encoding, then let's find some reasonable limits that stop the worst of the bleeding. Also, there's no need to have all those weird filenames merged with other stuff in the same directory; you could create a single directory with "." as the first character in the name, and create the key-value store in that subdirectory.

I pondered a \n-prepended filename because it's even harder to trip over by mistake, but decided that it would look too odd in directory listings of object directories when debugging. There's no danger of user code interpreting these names as options, because user code accesses files in this directory only via a shell-function API.)

That's exactly my point. Even in your case, filenames with \n are a pain. And let's say that a user runs a "find" that traverses your directory... if the filenames are troublesome (e.g., include \n or \t) you'll almost certainly cause the user grief, even if they had no idea that you implemented a keystore. And even if you don't want users (or their programs) going into these directories, people WILL need to do so, to do debugging.

The semantics of Unix filesystems have been fixed de facto for many years...

"We've always done it that way" may be true, but that doesn't justify the status quo. The status quo is causing pain, for little gain. Let's fix it.

Hardwired filename encodings are a good idea only if you can guarantee that this encoding has been the standard for the lifetime of the filesystem. You can't assume that for any existing filesystem: thus you have to decide what to do if filenames are not representable in the encoding scheme chosen. This also conflicts with 'no control characters' in that a good bunch of Unicode characters >127 can be considered 'control characters' of a sort, and there's no guarantee that more won't be added. How to exclude control characters which may be added in the future?

Lots of filesystems ALREADY mandate specific on-disk encodings; I believe all Windows and MacOS filesystems already specify them. The problem is that the system doesn't know how to map them to the userspace API. So, let's define the userspace API, so that people can actually do the mapping correctly. As far as "forever" goes, the program "convmv" does mass file renames for encoding; you can use it to convert a given filesystem from whatever encoding you've been using to UTF-8 (problem solved). The distros are already moving this way. As far as "no control characters", there's no need to do anything locale-dependent; excluding 1-31 would be adequate, and I'd also exclude 127 to to be complete for 7-bit ASCII (how do you print DEL in a GUI anyway?!?). Control characters unique to other locales don't bite people the way these characters do.

You also can't sensibly exclude shell metacharacters, because you don't know what they are, because they're shell-dependent, and some shells (like zsh) have globbing schemes so complex that ruling out all filenames that might be interpretable as a glob is wholly impractical.

I completely agree that this limitation cannot be applied everywhere. In fact, my article said, "I doubt these limitations could be agreed upon across all POSIX systems, but it'd be nice if administrators could configure specific systems to prevent such filenames on higher-value systems." But on some systems, I do know what shells are in use, and their metacharacters, and the system is never supposed to be creating filenames with metacharacters in the first place. I'd like to be able to install a "special exclusion list", just like I can install SELinux today to create additional limitations on what this particular system can do.

But I agree that these rules all make sense for parts of the filesystem that users might manipulate with arbitrary programs, as opposed to those that are using part of the FS as a single-program-managed datastore. What I think we need is an analogue of pathconf() (setpathconf()?) combined with extra fs metadata, such that constraints of this sort can be imposed and relaxed for *specific directories* (inherited by subdirectories on mkdir(), presumably). *That* might stand a chance of not breaking the entire world in the name of fixing it.

That's a very interesting idea, I like it! In fact, there's already a mechanism in the Linux kernel that might do this job already: getfattr(1)/setfattr(1). If it were implemented this way, I'd suggest that by default directories would "prevent bad filenames" (e.g., control chars and leading "-"); you could then use "setfattr" on directories to permit badness. New directories could then inherit the state of their parent. I would make those "trusted extended attributes" - you'd have to be CAP_SYS_ADMIN (typically superuser) to be able to create such directories.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 3:07 UTC (Thu) by drag (guest, #31333) [Link]

Mac OS X eliminated the utility of having case sensitive filenames, and while annoying if your porting software, I have not heard much complaints about it and software has been fixed.

And that is much more extreme then having a filesystem mount option to stop tabs and newlines being used to define files.

It'll be future proof also, as much that matters. You don't make a whitelist of allowed characters, you make a blacklist of troublesome characters and allow everything else. If you create more troublesome characters, which is very unlikely, you can add them to the black list. (and even if it is going to happen it will be exceedingly rare) Any new characters that get made, or any new encodings, then they will just be allowed to slide on through.

I mean if you have a future encoding scheme that conflicts with a previously established and well known encoding such as ascii, then it is just too dumb to be supported by anybody.

-----------------

Here is a challenge:

Somebody write me a script that will go and count all the uses of tabs, <, >, and newlines in their file names on their systems...

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 22:12 UTC (Thu) by nix (subscriber, #2304) [Link]

David, thanks for responding.

Such a key-value storage will have trouble with "/" in the key, since it's the directory separator. So if you truly need arbitrary keys, you already have to do some encoding anyway - so why not encode to something more convenient? If you don't need arbitrary encoding, then let's find some reasonable limits that stop the worst of the bleeding. Also, there's no need to have all those weird filenames merged with other stuff in the same directory; you could create a single directory with "." as the first character in the name, and create the key-value store in that subdirectory.

I claim mental block: this solution became obvious to me a day or so back. (Rather, since I already use . as a metacharacter to mean 'private', use .. to mean 'extra-private: metadata'. Yes, this too is bizarre, but at least it's not dash-prepended.)

But I have seen a system in production use at Big Banks (first saw it yesterday, or first noticed it, probably thanks to this conversation) that uses the filesystem as a base-254-of-key to value store. It's gross but it's sometimes done.

But then, we know how competent Big Banks are. (Especially this one, did you but know who it was.)

As far as "forever" goes, the program "convmv" does mass file renames for encoding; you can use it to convert a given filesystem from whatever encoding you've been using to UTF-8 (problem solved).

Yes, but this only works if you can mandate a no-encoding transparent view of filenames! As soon as you start to automatically encode them, this sort of transcoding is impossible.

I have no objection to making the things you propose options. What I object to is making them mandatory, because this would make some things impossible. (Strange things, but still.)

In fact, there's already a mechanism in the Linux kernel that might do this job already: getfattr(1)/setfattr(1). If it were implemented this way, I'd suggest that by default directories would "prevent bad filenames" (e.g., control chars and leading "-"); you could then use "setfattr" on directories to permit badness. New directories could then inherit the state of their parent. I would make those "trusted extended attributes" - you'd have to be CAP_SYS_ADMIN (typically superuser) to be able to create such directories.

It depends how harsh the limits are. I'd say that 'no control characters' is certainly reasonable to have only the superuser lift. Perhaps a less harsh constraint to impose is that regular users cannot set this attribute on directories readable by 'other', and that chmodding a directory after the fact strips this attribute off it. Now users cannot dump landmines in that directory for users outside their group (root is assumed to know what he's doing).

I'd say that setting this attribute flips a pathconf-viewable attribute as well, so that other POSIX-compliant systems can adopt the same approach and applications can portably query it without needing to implement/depend on all of the ACL machinery.

NT (Windows kernel) doesn't care about filenames any more than Linux

Posted Mar 28, 2009 15:36 UTC (Sat) by tialaramex (subscriber, #21167) [Link] (5 responses)

It's always worth telling people this, because it tends to make them rock back on their heels if they've been (wrongly) believing that NT is doing something special here.

NT (the kernel API in Windows NT, 2000, XP and etc.) doesn't care about filename encodings. The only thing that makes NT's attitude to such things different from that of Linux's is that NT's arbitrary sequences of non-zero code units used for filenames use 16-bit code units, and in Linux obviously they're 8-bit.

Everything else you see, such as case-insensitivity, bans on certain characters or sequences of characters, is implemented in other layers of the OS or even in language runtimes, not the kernel. Low-level programmers, just as on Unix, can call a file anything they like.

And the consequence is the same thing being lamented in this article - badly written Windows programs crash or do insane things when faced with filenames that don't look like the ones the poor third rate programmer who wrote the code was familiar with. In the absence of defensive programming this software also doesn't like leap years, or leap seconds, or files that are more than 2GB long, or... you could go on all day, badly written programs suck.

On encodings - I encourage you to use UTF-8. I encourage people with other encodings to migrate to UTF-8, but using UTF-8 and blindly trusting that everything you work with is actually legal and meaningful display-safe UTF-8 are quite different things. People who can't keep them separate are doing a bad job, whether handling filenames or displaying email.

NT (Windows kernel) doesn't care about filenames any more than Linux

Posted Mar 29, 2009 14:36 UTC (Sun) by epa (subscriber, #39769) [Link] (3 responses)

NT (the kernel API in Windows NT, 2000, XP and etc.) doesn't care about filename encodings. The only thing that makes NT's attitude to such things different from that of Linux's is that NT's arbitrary sequences of non-zero code units used for filenames use 16-bit code units, and in Linux obviously they're 8-bit.
Everything else you see, such as case-insensitivity, bans on certain characters or sequences of characters, is implemented in other layers of the OS or even in language runtimes, not the kernel. Low-level programmers, just as on Unix, can call a file anything they like.

Does that mean if you code against the NT API directly, you can create files foo and FOO in the same directory? I expect that opens up all sorts of juicy security holes - many of them theoretical, since a typical NT system has just one user and there is not much need for privelege escalation - but still it sounds fun.

using UTF-8 and blindly trusting that everything you work with is actually legal and meaningful display-safe UTF-8 are quite different things.

Indeed. Hence the benefit of enforcing this at the OS level: it gets rid of the need for sanity checks that slow down the good programmers and were never written anyway by the bad programmers.

NT (Windows kernel) doesn't care about filenames any more than Linux

Posted Mar 30, 2009 10:55 UTC (Mon) by nye (subscriber, #51576) [Link] (2 responses)

>Does that mean if you code against the NT API directly, you can create files foo and FOO in the same directory?

Yes. This is what the POSIX subsystems for NT do; they're implemented on top of the native API, as is the Win32 API. Note that Cygwin doesn't count here as it's a compatibility layer on top of the Win32 API rather than its own separate subsystem.

Unfortunately the Win32 API *does* enforce things like file naming conventions, so it's impossible (at least without major voodoo) to write Win32 applications which handle things like a colon in a file name, and since different subsytems are isolated, that means that no normal Windows software is going to be able to do it.

(I learnt all this when I copied my music collection to an NTFS filesystem, and discovered that bits of it were unaccessible to Windows without SFU/SUA, which is unavailable for the version of Windows I was using.)

http://en.wikipedia.org/wiki/Native_API

NT (Windows kernel) doesn't care about filenames any more than Linux

Posted Mar 30, 2009 15:13 UTC (Mon) by foom (subscriber, #14868) [Link] (1 responses)

>> Does that mean if you code against the NT API directly, you can create files foo and FOO in the same directory?
> Yes. This is what the POSIX subsystems for NT do

You can actually do this through the Win32 API: see the FILE_FLAG_POSIX_SEMANTICS flag for CreateFile. However, MS realized this was a security problem, so as of WinXP, this option will in normal circumstances do absolutely nothing. You now have to explicitly enable case-sensitive support on the system for either the "Native" or Win32 APIs to allow it.

(the SFU installer asks if you want to this, but even SFU has no special dispensation)

NT (Windows kernel) doesn't care about filenames any more than Linux

Posted Nov 15, 2009 0:06 UTC (Sun) by yuhong (guest, #57183) [Link]

Another trick you can use with CreateFile is to start the filename with \\.\.
If that is done, the only processing done on the filename before CreateFile
calls NtCreateFile with the name is that \\.\ is replace with \??\, which is
an alias of \DosDevices\.

NT (Windows kernel) doesn't care about filenames any more than Linux

Posted Nov 14, 2009 23:58 UTC (Sat) by yuhong (guest, #57183) [Link]

"files that are more than 2GB long"
Yep, NT had supported both files and disks larger than 2GB from the first
version (NT 3.1) using the NTFS filesystem. Exercise: compare the design of
the GetDiskFreeSpace and SetFilePointer APIs (look them up using MSDN or
Google), both of which has existed since NT 3.1. Which one was so much more
error-prone that the versions of Windows released in 1996 had to cap the
result to 2GB, even though older versions of NT supported returning more than
2GB using it, and why?

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 25, 2009 17:57 UTC (Wed) by jd (guest, #26381) [Link]

I can see the argument for non-printable filenames. However, you'd need to distinguish between generic non-printable filenames (ie: any character array that can be used as a filename), tokens used synonymously with filenames (ie: a short fixed-length ID that denotes the file, regardless of the logical name or logical directory) and hashes used synonymously with filenames (ie: a long fixed-length ID that serves the same role as a token but can be generated rather than looked up).

IMHO, the different roles all speak to different problems and all have their limitations outside of the problems they're meant for. The first step in finding a solution is to define the problem, but a filesystem solves a very wide range of problems, making a definition less clear-cut.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 25, 2009 23:27 UTC (Wed) by jreiser (subscriber, #11027) [Link] (1 responses)

As nix says, the filename encodes a key to what the file contains. The encoding is radix-254 (NUL and '/' excluded.) This fully utilizes the ASCII control characters [\x01-\x1f] and also the sequences such as subsets of [\xfc-\xff]* which are disallowed by UTF-8. Radix-254 is almost 2 bits per byte more dense than the proposed radix-65 (26 upper case, 26 lower case, 10 digits, dot hyphen underscore). The OS imposes an upper bound on the length of a filename, and there are critical points at various shorter lengths where there are jumps in space*time costs. Enough utility is discarded by radix-65 (as opposed to radix-254) that customers complain.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 14:44 UTC (Thu) by dwheeler (guest, #1216) [Link]

I never proposed radix-65. Radix-65 (26 upper case, 26 lower case, 10 digits, dot hyphen underscore) is what the POSIX standard ALREADY says is all you can depend on; nothing else is portable by that spec.

I want to be able to count on more than what the POSIX spec says; I want to be able to use the entire Unicode character set, minus the control chars and a few additional constraints to prevent lots of problems for the general-purpose user.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 13:38 UTC (Thu) by Wol (subscriber, #4433) [Link]

A system I'm playing with copies something called PI/Open.

A file inside this system is actually stored as a directory at the OS level, and it created filenames of the form <space><backspace>nnn.

I copied this, and found that Midnight Commander was great at managing the resulting files :-) It's done so that people can't tamper - corrupting one of the (many) OS-level files would do serious damage to the PI file.

Cheers,
Wol

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 25, 2009 16:31 UTC (Wed) by mgross (guest, #38112) [Link] (6 responses)

I'm sure I'll get burned for asking, but how hard would it be to implement the kernel code to reject the creation of wonky file names? It seems like a simple thing to implement. (I guess it would add code to the file create path and perhaps slow down some benchmarks)

Also, this seems like a pretty sensible idea, why hasn't it been implemented already?

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 25, 2009 16:50 UTC (Wed) by epa (subscriber, #39769) [Link] (2 responses)

Typically, the simpler and more obvious the idea, the longer it takes to overcome resistance and be implemented. (See for example making relatime the default, or the .desktop file security problems discussed a little while back on LWN.)

Some will argue that the answer is user education (teach your users not to use bad characters in filenames), and perhaps even a cron job you can run on your PDP-11 overnight to look for filenames containing these characters and send a message via local mail to the user responsible. Furthermore, if it was good enough for V7 UNIX, it's good enough for us now. (Note that in Plan 9, there are sensible restrictions on characters in filenames; but it's common for followers of a particular system or language to become rabidly conservative, even when the original designers of the system have moved on.)

In other words it is sheer inertia, and reluctance by any one Unix-like system to add such a feature when the others do not. You can bet that if Linux added a filename character check, it would immediately be branded 'broken' by many BSD or Solaris enthusiasts - not all, but certainly those that make the most noise online.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 2:19 UTC (Thu) by dirtyepic (guest, #30178) [Link] (1 responses)

Also see: "If it was such a good idea, someone else would have done it already. Therefore it must be flawed".

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 15:23 UTC (Thu) by dwheeler (guest, #1216) [Link]

"The reasonable man adapts himself to the world; the unreasonable one persists in trying to adapt the world to himself. Therefore, all progress depends on the unreasonable man." (George Bernard Shaw)

I'm well aware that this is different than the historical past. But that doesn't make past decisions correct for the present. So, let's chat about the pros and cons; I believe that the cons for "anything goes" now outweigh the pros.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 28, 2009 16:33 UTC (Sat) by tialaramex (subscriber, #21167) [Link] (2 responses)

It would be a really nasty, expensive change, or else, it would be a token effort that's worthless for the very things it's supposed to fix.

To actually make this work, in the kernel (where you're perf critical and this is all unwanted overhead that's costing everyone who uses your "improved" system) you need to absolutely, as a matter of "Linus will veto if you don't" policy:

* Validate every filename to check that it conforms. This has to be done either at mount time, or when syscalls interact with the filenames (e.g. directory reading, and opening files). As a network file system client the OS must either screen every filename going over the network, or else punt and rely on promises from the server (if available).

* When you find an invalid filename, you need to deal with it, it's not clear what the kernel should or even could do. Perhaps the file should just not exist as far as userspace is concerned, and fsck would unlink it?

Meanwhile application developers get no benefit for many years because of compatibility considerations. It could be a decade before it makes any sense to write a program which assumes one of the restrictions, and that's if EVERY SINGLE OS fixes this tomorrow. Wheeler mistakenly believes this is a POSIX problem, but it isn't, the problem exists everywhere that filenames are treated as opaque, which in fact includes Windows (and I have my doubts about OS X, but its API documentation promises they aren't opaque, so app developers who rely on that promise would be entitled to scream blue murder when someone finds a way to get non-Unicode into an OS X filename...*)

Personally I think the issue to look at is spaces. Spaces are legal. They are undoubtedly going to remain legal. But they are inconvenient. How can we tweak our basic Unix processes (including the shell and many old tools) so that spaces are harmless ? Once you've done this, you'll have the right mindset to tackle initial hyphen, control characters and so on from the same angle, rather than screwing the poor kernel into doing your dirty work and making everybody (including those of us for whom opaque filenames are just dandy) pay.

* Something that should make you pause, OS X's approach to filenames as Unicode strings makes Unicode composition/ decomposition into an OS ABI feature. It had been doing this for years before Unicode actually pledged to stop changing the decomposition rules (ie until that happened new versions of OS X made previously legal filenames illegal and vice versa, with no warning...)

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 29, 2009 14:31 UTC (Sun) by epa (subscriber, #39769) [Link]

Yes, validate every filename that comes from user space to check it is valid UTF-8 and does not have control characters. This is not in fact an expensive operation (especially not compared to the cost of opening or creating a file in the first place).

Every non-Unix OS already forbids control characters in filenames so there would not be much extra checking to do in filesystems like smbfs or ntfs. (Except out of paranoia to detect disk corruption, which is probably a good thing to do anyway.) As you point out, there remains the question of network filesystems like NFS, where the server could legitimately return filenames containing arbitrary byte sequences. And there would have to be some policy decision about what to do. But I would rather have one single place to deal with the mess rather than leave it to 101 different bits of code in user space. (Python 3.0 pretends that invalid-UTF-8 filenames do not exist when returning a directory listing; other programs will show them but may or may not escape control characters when displaying to the terminal; goodness knows what different Java implementations do.)

I would favour silently discarding filenames that contain control characters from the directory listing, and for those in some legacy encoding like Latin-1 or Shift-JIS, translating them to UTF-8. (The legacy encoding would be specified with a mount parameter. Again, this is a bit awkward but a hundred times less complicated than leaving every userspace program to do its own peculiar thing.)

Meanwhile application developers get no benefit for many years because of compatibility considerations.

Not really true. The benefit in closing existing security holes is immediate. In writing new code, you can note that there may be corner-case bugs on systems that permit control characters in filenames, but for 90% of the user base they do not exist. That is 90% better than the current situation, where everyone just writes code assuming that filenames are sane, but no system enforces it. By analogy, consider that many classic UNIX utilities had fixed limits on line length. If I write a shell script that uses sort(1), I just write it for GNU sort and other modern implementations. I might note that people on older systems may encounter interesting effects using my script with large input data, but I don't have to wait for every last Xenix system to be switched off before I can get the benefit in new code.

Personally I think the issue to look at is spaces. Spaces are legal. They are undoubtedly going to remain legal. But they are inconvenient. How can we tweak our basic Unix processes (including the shell and many old tools) so that spaces are harmless ?

This is true in principle but in thirty years of Unix, essentially no progress has been made on this. Nobody bothers to fix the shell or utilities such as make(1) to cope with arbitrary characters, despite much wishing that they would. Nobody bothers to write shell scripts that cope with all legal filenames, mostly because it is all but impossible. Instead, people who care about bug-free code end up rewriting shell scripts in other languages such as C (for example, some of the git utilities), people who think life is too short are happy to distribute software that misbehaves or has security holes, and many others just don't realize there is a problem.

OS X is something of a special case because of case insensitivity. If you don't want case insensitivity then you do not need to worry about Unicode composition; just a simple byte sequence check that you have valid UTF-8. But OS X is a useful example in another way: a case-insensitive filesystem is a much bigger break with Unix tradition that what's proposed here, and yet the world did not come to an end, and it was trivial for most Unix software to adapt.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 31, 2009 5:00 UTC (Tue) by njs (subscriber, #40338) [Link]

I think you're overcomplicating things -- I wouldn't implement UTF-8 requirements at the VFS level (it just doesn't make sense, since there manifestly exist filesystems where you don't know the encoding, both from pre-existing Linux installs and with "foreign" filesystems). I'd make it a filesystem feature -- a flag in the ext2/3/4 header that's set at mkfs time, say. That removes all the issues about translating invalid filenames -- if that flag is set and a filename is invalid, then *your filesystem is corrupt*. fsck can check for such corruption if it feels like it.

Then you just get distros to set that flag on the root filesystem by default, add a few bits of API for programs who want to know "is this filesystem utf8-only?" or "how does this filesystem normalize names?" (which would be really useful calls anyway), and away you go.

(It's unfortunate that the Win32 designers screwed this up, but that's hardly an argument to perpetuate their mistake.)

TALPA?

Posted Mar 25, 2009 16:54 UTC (Wed) by dmarti (subscriber, #11625) [Link]

Hey, cool! A use for TALPA!

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 25, 2009 17:00 UTC (Wed) by adobriyan (subscriber, #30858) [Link]

And some blame Unix for taking out slash out of characters usable in a filename. :-)

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 25, 2009 17:51 UTC (Wed) by ajb (subscriber, #9694) [Link]

I think this is a sensible idea. It should be possible to make the transition relatively painless:

- add a new inheritable process capability, 'BADFILENAMES', without which processes can't see or create files with bad names.
- add a command 'access_bad_filenames' which creates a shell with the capability.
- /bin/ls also needs the capability, but should not display bad filenames
unless an additional option is passed.

That way, most processes can run happily in the ignorance of any bad filenames. If you need to access one, you run the commands you need to access it with under the special shell.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 25, 2009 18:09 UTC (Wed) by mrshiny (guest, #4266) [Link] (23 responses)

You can pry my spaces from my filenames out of my cold dead fingers. But frankly spaces are no different than other shell meta-characters. If a filename is properly handled for spaces, doesn't it automatically work for all the other chars? If not, it should be easy enough to fix the SHELLS in this case.

Mr. Wheeler makes a mistake in the article as well. Windows has no problem with files starting with a dot. It's only Explorer and a handful of other tools that have problems. Otherwise Cygwin would be pretty annoying to use.

Overall, however, I like the idea of restricting certain things, especially the character encoding. The sooner the other encodings can die, the sooner I can be happy.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 25, 2009 19:52 UTC (Wed) by emk (subscriber, #1128) [Link] (3 responses)

If a filename is properly handled for spaces, doesn't it automatically work for all the other chars?

Unfortunately, no. One example mentioned in the article is files with names like "-rf", which will appear at the start of any glob list. To deal with this, you generally need to add "--" before any globs, but different commands behave differently, and not all commands support "--".

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 1:12 UTC (Thu) by mrshiny (guest, #4266) [Link] (1 responses)

I was actually referring to the other special characters that cause problems, such as shell control characters. The dash is a different case because it's actually the programs (not the shell) that are interpreting certain strings as filenames and others as arguments. There can't really be a generic solution to this because of the way file globbing works: the globbing happens outside the program so it has no input into the command line that is passed in. If filenames can't start with a dash, but a command was ported from DOS and uses backslash as its option separator, shell globbing will confuse that program too.

Not that preventing files like '-rf' isn't a bad idea. I think it would prevent a number of mistakes.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 30, 2009 16:41 UTC (Mon) by Hawke (guest, #6978) [Link]

I don't think any DOS applications use backslash for their option marker. Some use dash, and most use slash. But I'm pretty sure that practically none if any use backslash

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 15:38 UTC (Thu) by dwheeler (guest, #1216) [Link]

Actually, there is a general solution for the dash: whenever you glob in the current directory, stick "./" in front of the glob. So always use "cat ./*" instead of "cat *". I do mention that in my article.

Problem is, nobody does that. It's too easy to use "*", it's what all the documents say, and it's what all the users actually do. You have to train GUI programs to do this, too. So instead of constantly trying to get developers to do something "unnatural", let's change the system so the "obvious" way is always correct.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 25, 2009 22:45 UTC (Wed) by epa (subscriber, #39769) [Link] (12 responses)

If not, it should be easy enough to fix the SHELLS in this case.

Three decades of unhappy experience says otherwise. Nobody has a reasonable proposal to fix all the shells, all the scripting languages and all the user applications so that they don't make unsafe assumptions about filenames (e.g. assuming a filename can never begin with - or never contain the \n character).

On the other hand, a kernel-level check for bad characters is simple to implement and obviously solves these problems at a stroke.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 1:16 UTC (Thu) by mrshiny (guest, #4266) [Link] (7 responses)

I was actually thinking more along the lines of:

1. Prevent files that start with dash (technically not a shell problem)
2. Prevent files that contain control characters (newline included)
3. Make the shells easy to use in the face of filenames with spaces, semi-colons, colons, quotes, punctuation, etc.

The first item is more of an interaction between programs and the shell and not specifically a shell problem. If a program doesn't support -- then it can never be used securely.

The second item seems like an obvious step to take with no downside.

The third item is what I meant by fixing the shells: shells should make it braindead-easy to manipulate filenames without them turning into commands or other nonsense. Once a filename is loaded into a variable you shouldn't have to worry about characters in the name turning into shell commands. Once that's in place we can start fixing scripts. Maybe an environment variable can determine how that instance of the shell works: in secure mode or legacy mode.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 14:45 UTC (Thu) by mjthayer (guest, #39183) [Link] (6 responses)

One thing that would help make the shell more solid would be treating -* as hidden files and skip over them when expanding wildcards.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 15:08 UTC (Thu) by mjthayer (guest, #39183) [Link] (5 responses)

It could also recognise the null character as an argument separator as in 'find -print0'. It could even set some environment variable to tell tools like find that this is supported so that they can use it by default when not outputting to the console. And when substituting environment variables and backticked commands to the arguments for other commands, it could sanitise out anything starting with a hyphen. While this would break a few things, it would probably fix many more. While on that subject, the shell could enforce that substitutions that resolve to the arguments for other commands are not allowed to spill over (e.g. VAR='myfile; rm -rf /'; ls $VAR).

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 19:49 UTC (Thu) by dwheeler (guest, #1216) [Link] (3 responses)

[The shell] could also recognise the null character as an argument separator as in 'find -print0'. It could even set some environment variable to tell tools like find that this is supported so that they can use it by default when not outputting to the console.

Yes, I already added the "shell could recognize null as separator". And you're right, adding an environment variable could help (though it could also backfire on older scripts!).

While on that subject, the shell could enforce that substitutions that resolve to the arguments for other commands are not allowed to spill over (e.g. VAR='myfile; rm -rf /'; ls $VAR).

This particular example doesn't do quite what you think; it just passes to ls several values: "myfile;", "rm", "-rf", and "/", and you end up with some error messages and a listing of "/". But with more tweaking, you can definitely get some exploits out of this approach. Which is why removing the space character from IFS is a big help - then VAR would become a single parameter again.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 28, 2009 1:11 UTC (Sat) by nix (subscriber, #2304) [Link] (2 responses)

bash implemented an environment variable to tell subprocesses where
arguments began and ended at one point.

It was removed, but I can't remember why: some sort of compatibility
problem?

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 31, 2009 7:47 UTC (Tue) by mjthayer (guest, #39183) [Link] (1 responses)

I was wondering now whether to ask about this on the Bash mailing lists. Just out of interest, are you involved with the development of Bash/the GNU tools in any way? You seem well informed about them.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 31, 2009 19:28 UTC (Tue) by nix (subscriber, #2304) [Link]

I've contributed fixes now and then, but I just read a lot. :) The
projects are public, after all.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Apr 3, 2009 18:49 UTC (Fri) by anton (subscriber, #25547) [Link]

It could also recognise the null character as an argument separator as in 'find -print0'.

A few weeks ago I wanted to process my .ogg files which contain all kinds of characters that are treated as meta-characters by the shell or other programs I use in sheel scripts. I eventually ended up writing a new shell dumbsh that uses NUL as argument separator, and feeding it from find, with some intermediate processing in awk (which is quite flexible about meta-characters).

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 21:11 UTC (Thu) by explodingferret (guest, #57530) [Link] (3 responses)

Well, this is a good point. There are basically two uses of shell scripts:

1) Portable scripts (of a kind), init scripts, and build scripts. In all these cases the scripts need to have #!/bin/sh at the top, and contain just about every fix for every problem ever, including [ "x$var" = x ] and ${1:-"${@}"} and various other monstrosities.

In these scripts, the quotes around variables; ./ in front of filenames; IFS= for read; and filename=`foo; printf x`; filename="${filename%x}" crap will *always* have to be there. So no point trying to fix anything for those.

2) The other use is scripts that are used on either one system (personal scripts) or one "class" of system, like "only Debian GNU/Linux".

These scripts can use a particular shell like #!/bin/bash and assume the existence of -print0 and -printf to find and -d '' to read and all the other little conveniences which make a lot of the problem go away.

Well, other than newlines at the end of filenames. That's the only case that I refuse to take account of in my scripts, unless security issues might arise.

----

I'm not saying that I disagree with the ideas in this article (although I'd like to keep spaces and shell special characters in my filenames, actually). I'm just saying that as far as shell scripting is concerned, it may not actually help all that much. The main gain for me would be the security fixes and less typing in my interactive shell. Even though I'm pretty sure I don't have any newlines or control characters in any of my filenames, I just can't bring myself to write bad scripts, and that's kinda sad.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 28, 2009 1:18 UTC (Sat) by nix (subscriber, #2304) [Link] (2 responses)

At work I co-maintain scripts in a third class: scripts that come with
an 'appliance' (for this purpose, a set of software which is the raison
d'etre of the hardware for which it is bought: this could be a tiny
embedded system or a giant bank database or simulation box). In this case,
they can dictate whatever shell they damn well like.

I dictated zsh 4, simply because for this application C was far too
unpleasant, ksh was too buggy (thanks, Linux, for pdksh, with its broken
propagation of variables out of loops-with-redirection), and there was no
hope of getting the clients' systems people to install Perl: but they were
perfectly happy to install a recent zsh: fewer dependencies and no scary
modules (well, actually zsh *does* have a module system but they didn't
realise that!)

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Nov 15, 2009 1:06 UTC (Sun) by yuhong (guest, #57183) [Link] (1 responses)

"ksh was too buggy (thanks, Linux, for pdksh, with its broken
propagation of variables out of loops-with-redirection)"
Was ksh93 tried?

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Nov 15, 2009 13:15 UTC (Sun) by nix (subscriber, #2304) [Link]

zsh93 was too sodding hard to require because building it was a nightmare.
At the time it wasn't free enough either.

dot files in Windows and

Posted Mar 25, 2009 23:09 UTC (Wed) by pr1268 (guest, #24648) [Link] (4 responses)

Windows has no problem with files starting with a dot.

Oddly enough, Windows will not allow the name of a directory to end in a dot. I discovered this when, back in my Windows days, I had to name an artist directory R.E.M without a final dot. Windows wouldn't allow me to put that trailing dot in the file name. Go figure. Linux doesn't have any issue with it (and since I've abandoned Windows on my home computers, I was able to rename the directory to include that dot).

Going off on a tangent: here are some files in my music directory which would make Mr. Wheeler cringe:

Beatles/Help!/01_-_Help!.ogg ('!' in directory and file names)
Donald_Fagen/The_Nightfly/01_-_I.G.Y..ogg (two dots before the file extension - not really an issue but interesting)
Sugar_Ray/14:59/ (':' in directory name)
Coldplay/X&Y/ ('&' in directory name)
John_Cage/4'33".ogg (single- and double-quotes - never mind that this is a really quiet song :) )
Radiohead/Hail_To_The_Theif/01_-_2_+_2_=_5_(The_Lukewarm.).ogg (A whole bunch of issues here)

In a digital forensics class the professor had us searching through a filesystem that contained directories named "..." (minus quotes). Good times...

dot files in Windows and

Posted Mar 25, 2009 23:42 UTC (Wed) by dwheeler (guest, #1216) [Link] (1 responses)

No cringe. I didn't see any control characters there, nor leading dashes. And you don't seem to require non-UTF-8. If we could get those done, the rest are gravy.

dot files in Windows and

Posted Mar 26, 2009 0:07 UTC (Thu) by pr1268 (guest, #24648) [Link]

Wow, thanks for the reply! And thank you for the original article--I found myself nodding in agreement many times while reading it.

Of course, even with your non-cringing approval, I certainly had lots of shell escaping to do with these files (and many others--my collection is approaching 10,000 audio files from almost 900 music CDs).

dot files in Windows and

Posted Mar 26, 2009 1:04 UTC (Thu) by nix (subscriber, #2304) [Link] (1 responses)

Hah, that's nothing: I saw a directory called '.. ' a while back, while
looking at an attacked system's disk image (for fun, I have no life).

dot files in Windows and

Posted Mar 26, 2009 10:21 UTC (Thu) by mjj29 (guest, #49653) [Link]

I've seen ... in that context too

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 30, 2009 19:36 UTC (Mon) by rickmoen (subscriber, #6943) [Link]

mrshiny wrote:

You can pry my spaces from my filenames out of my cold dead fingers.

ObMenInBlack: "Your offer is acceptable."

(I remember having to write AppleScript to recurse through directories cleaning up files created on network shares by MacOS-using munchkins who put space characters at the ends of filenames, in order for them to become valid filenames when seen by MS-Windows-using employees looking at the same network shares. The converse problem was files, from MS-Windows users, with names containing colon, which is a reserved character in MacOS file namespace. What a pain in the tochis.)

Rick Moen
rick@linuxmafia.com

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 25, 2009 18:36 UTC (Wed) by njs (subscriber, #40338) [Link] (18 responses)

I pretty much agree with all dwheeler's points (not sure about banning shell metacharacters).

The section on Unicode-in-the-filesystem seemed quite incomplete. We know this can work, since the most widely used Unix *already* does it. OS X basically extends POSIX to say "all those char * pathnames you give me, those are UTF-8". However, there are a lot of complexities not mentioned here -- you need to worry about Unicode normalization (whether or not to allow different files to have names containing the same characters but with different bytestring representations), if there is any normalization then you need a new API to say "hey filesystem, what did you actually call that file I just opened?" (OS X has this, but it's very well hidden), and so on.

But these problems all exist now, they're just overshadowed by the terrible awful even worse problems caused by filenames all being in random unguessable charsets. I really dislike many things about Apple, but in this case we could do worse than to sit down and steal (with appropriate modification) most of the stuff in http://developer.apple.com/technotes/tn/tn1150.html#Unico...

Maybe the ext4 folks could add unicode filenames as a mount option -- they haven't done anything controversial lately ;-).

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 25, 2009 20:57 UTC (Wed) by clugstj (subscriber, #4020) [Link] (16 responses)

Needing this wacky "hey filesystem, what did you actually call that file I just opened" feature in OS X is an improvement? I don't see how.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 25, 2009 21:29 UTC (Wed) by foom (subscriber, #14868) [Link] (14 responses)

That's not the feature, it's a side effect. The feature is that both u'\N{latin capital letter a with
acute}' and u'a\N{combining acute accent}' refer to the same file. Also, more obviously, 'a' and 'A'.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 25, 2009 22:36 UTC (Wed) by nix (subscriber, #2304) [Link] (11 responses)

You do know that case conversion rules are necessarily locale-dependent,
right? Case-insensitivity in filesystems is thus an astoundingly awful
idea.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 2:25 UTC (Thu) by njs (subscriber, #40338) [Link] (7 responses)

Too true. OS X solves this by defining a bespoke case normalization rule, the "HFS-Plus case-insensitive string comparison algorithm". They have their own big table and everything, you can download it at the first URL I gave. Awesome, huh?

I'd be happy if we could just make a rule that filenames are valid UTF-8. Unicode normalization (composing characters and all that) is probably a good idea, but reasonable people could disagree. I'm just as happy without case normalization (though the arguments for it aren't entirely without merit, even if it can't be done perfectly). And *any* of these would be better than what we have now...

(The "so what did you call this file?" API is also useful if your system ever deals with case-insensitive or unicode-normalizing filesystems. Which Linux does, whether it becomes common for the root filesystem or not.)

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 13:42 UTC (Thu) by clugstj (subscriber, #4020) [Link] (6 responses)

"And *any* of these would be better than what we have now"

Why? How is the current condition so bad that we should run headlong into any of these "solutions" without knowing what the eventual outcome will be?

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 17:46 UTC (Thu) by quotemstr (subscriber, #45331) [Link]

We're talking about the possible outcomes. You're telling us we shouldn't discuss the possible problems and solutions because we don't know the problems yet? That's bunk.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 18:59 UTC (Thu) by njs (subscriber, #40338) [Link] (4 responses)

...I started a thread by pointing to technical documentation on how this has worked for the last 8 years in the world's most widely-deployed Unix, and your response is that this is crazy to even consider because we have no way to know what will happen? C'mon... engage the actual arguments. I don't think it's obvious what the technically best solution is, but just because you haven't thought about the relevant details doesn't mean they aren't knowable.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 29, 2009 21:27 UTC (Sun) by clugstj (subscriber, #4020) [Link] (1 responses)

I'm sorry, but when you said, that any of these propositions is better than the current situation, I HAD to disagree. In what way is the current situation so bad that any proposal is better that the current situation?

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 30, 2009 0:07 UTC (Mon) by njs (subscriber, #40338) [Link]

You cannot, in general, convert a filename to text. That's the fundamental problem that any of the proposals would solve.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 29, 2009 21:30 UTC (Sun) by clugstj (subscriber, #4020) [Link] (1 responses)

OS X is trivial to handle. It only has to continue to work in a compatible way with the previous Mac OS - which wasn't UNIX. So using it as an example of how to "fix" these problems is not a good idea if you care about supporting 40+ years of UNIX programs - which is why this is difficult to change.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 29, 2009 22:07 UTC (Sun) by foom (subscriber, #14868) [Link]

Eh...but OSX *does* run 40+ years of UNIX programs. It's pretty clear that the change to require
UTF-8 (and even the change to be case insensitive!) didn't bother most programs.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 14:23 UTC (Thu) by mjthayer (guest, #39183) [Link] (2 responses)

The case I always hear made for the difficulty of case-insensitivity is French lower-case accents and the Turkish i. Are these really such an issue? If we are already treating a as being identical to A, can't we just treat É as being identical to E, and i as being identical to ı? In French it would not really cause problems (although the pedants would complain :) ), though I don't know how many words Turkish has that only differ by a dot over the i...

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 28, 2009 1:00 UTC (Sat) by nix (subscriber, #2304) [Link] (1 responses)

German ß is problematic too. Whether 'SS' turns into ß or not on
downcasing is *context-dependent* and to a certain extent a matter of
controversy and thus taste (this wasn't always true, but successive waves
of largely-failed spelling reforms have introduced a nice steaming heap of
uncertainty into this part of the written language).

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Apr 2, 2009 15:54 UTC (Thu) by forthy (guest, #1525) [Link]

It is actually not that bad. As collating sequence, ß=ss (i.e. Mass and Maß sort to the same bin). Except for Austrian telephone books, where ß follows ss, but comes before st (though St. follows Sankt ;-).

However, there's a huge mess in the CJK part of UCS: short and long forms of the same character (sometimes even a special variant for the Japanese character). This should never have happend, the different forms of the same character should be encoded in fonts, not in UCS. So far, not even Mac OS X normalizes these characters, but it is obvious that a mainland China file called "中国" and a Taiwan file called "中國" not only mean the same, but they also refer to the same word, and can be interchanged at will (see for example the Chinese wikipedia entry: the lemma is the short form, the headline is the long form). And it is not easy to access long and short forms with usual input methods (mainland China: Pinyin, Canton: Cantonese Pinyin (gives traditional characters, bug you need to know Cantonese), etc.).

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 13:40 UTC (Thu) by clugstj (subscriber, #4020) [Link] (1 responses)

Yes, and until you know what all of the unfortunate side-effects are of enforcing these limitations, YOU SHOULD NOT DO IT.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 19:52 UTC (Thu) by leoc (guest, #39773) [Link]

Cool, you've solved the halting problem!

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 25, 2009 21:39 UTC (Wed) by njs (subscriber, #40338) [Link]

The alternative is you have files with character-identical names that the filesystem considers different. Historically this leads to wacky security holes, files that some programs can't access (what's *your* UI for typing a precomposed vs. decomposed o-with-umlaut?), stupid stuff like that.

I don't *like* either alternative much, but I doubt you're going to get everyone to switch back to ASCII, either. The problem isn't going away.

So... we can whine about how unfair it is that character systems are complicated and ignore the problem, or we can hold our noses and pick a least-bad option. The latter is probably more productive (though inertia suggests the former is most likely).

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 16:23 UTC (Thu) by dlau (guest, #4540) [Link]

If you normalize filenames, you don't need a new API the way OS X apparently does. Just reject unnormalized filenames just as you would reject invalid UTF-8.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 25, 2009 19:50 UTC (Wed) by szh (guest, #23558) [Link] (1 responses)

I hope I'll never come across $'\t' , $'\n' , '\' in filenames in my life!

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 0:52 UTC (Thu) by tbrownaw (guest, #45457) [Link]

Heh.

Our incoming ftp server at work once got a file named "C:\something_or_other.zip". Which was perfectly fine, until someone tried to open it in Windows using the samba share. It actually did show up, but with a completely garbled name.

I also accidentally generated a file where the name had a leading '\r' (carriage return). That was a lot of fun to track down and fix, it looked perfectly normal in 'ls' unless you noticed that it wasn't in proper alphabetical order and the rest of the row was one character out of alignment.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 25, 2009 20:50 UTC (Wed) by clugstj (subscriber, #4020) [Link] (12 responses)

I would have to disagree.

Most of the things he wants to force on everyone are already available by convention. What is the benefit of disallowing other usages? If you want to imagine that all your filenames are UTF-8, go ahead, who's stopping you! The UNIX kernel contains as little policy as possible. This results in it being more simple than it would be otherwise. Yes, this is a double-edged sword, but doing the things he suggests are not an automatic win.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 25, 2009 23:34 UTC (Wed) by dwheeler (guest, #1216) [Link]

Sure, almost all files already follow these conventions. Except when they don't. And when they don't, millions of programs subtly stop working. Everyone who does "find . blah | stuff" is writing bad code, because filenames can contain newlines deep in the directory tree. If we get rid of the nonsense, then it's easy to write correct programs; today, it takes herculean effort, and few people do so. It's a double-edged sword, but users get cut by both sides. I have yet to find a real use case for including control characters in filenames, for example, but plenty of reasons why it shouldn't ever happen.

Conventions are great! Let's go back to FAT!

Posted Mar 26, 2009 8:18 UTC (Thu) by khim (subscriber, #9252) [Link] (3 responses)

Most of the things he wants to force on everyone are already available by convention. What is the benefit of disallowing other usages?

What's the benefit of all these ACL's? Traditional Unix permissions, capabilities, POSIX Acls, memory protection, etc. You can just use conventions for that. And if someone violates convention he or she can be fired.

This is you proposal in nutshell - and it just does not work.

Conventions are great! Let's go back to FAT!

Posted Mar 26, 2009 13:38 UTC (Thu) by clugstj (subscriber, #4020) [Link] (2 responses)

Wow, it does not work? Apparently UNIX is completely broken. And ACL's are so complicated and a drain on performance as to be nearly useless - which is why they are not used much.

Shell scripts are where this is the biggest problem. I do shell scripting for a living and don't see this issue as being anywhere near as big a problem as Mr. Wheeler thinks it is.

Also, I'm completely confused by your title. I suggest conventions and then you suggest, perhaps facetiously, FAT (which is not a convention, but enforcement of a very stupidly limited set of possible filenames).

Conventions are great! Let's go back to FAT!

Posted Mar 26, 2009 14:07 UTC (Thu) by khim (subscriber, #9252) [Link] (1 responses)

Wow, it does not work?

Nope.

Apparently UNIX is completely broken.

Nope. UNIX is not broken. Your head, on the other hand, is.

And ACL's are so complicated and a drain on performance as to be nearly useless - which is why they are not used much.

Traditional unix permissions are used on most systems - and are ARE ACL's too. They are quite limited but often adequate - that's why other forms are not used much. Still they are deficient in many situations and other forms are used more and more.

Shell scripts are where this is the biggest problem. I do shell scripting for a living and don't see this issue as being anywhere near as big a problem as Mr. Wheeler thinks it is.

Number of correct scripts is not important metric. Number of bad scripts is. And it's MUCH higher then warranted: I've fixed tons of scripts which failed on names with spaces, files with dash in first position, etc. If such files are excluded from the start life will be much easier.

Also, I'm completely confused by your title. I suggest conventions and then you suggest, perhaps facetiously, FAT (which is not a convention, but enforcement of a very stupidly limited set of possible filenames).

I propose FAT as a way to get rid of these pesky ACLs. It's one of the few filesystems today with any form of access control (except read-only flag). We can extend it to allow all forms of filenames - it's not hard. Or we can just run all programs with UID==0 - it'll give us the same flexibility. Somehow noone wants to go in this direction, though.

Conventions are great! Let's go back to FAT!

Posted Mar 29, 2009 21:44 UTC (Sun) by clugstj (subscriber, #4020) [Link]

"UNIX is not broken. Your head, on the other hand, is"

Wow, childish personal attacks. How droll.

"Number of correct scripts is not important metric. Number of bad scripts is"

I would think that the percentage of each would (possibly) be a useful metric. But, what is the damage from these "bad scripts"? If you are writing shell scripts that MUST be absoutely bullet-proof from bad input, perhaps because they run setuid-root, then you are already making a much worse mistake than the possible bugs in the script.

Still don't understand the FAT reference. Sorry, maybe I'm just slow.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 9:51 UTC (Thu) by epa (subscriber, #39769) [Link] (6 responses)

'By convention' files do not contain control characters. The problem is that you cannot rely on convention when writing robust, secure software. Either you put in endless sanity checks which cruft up your code and are liable to be forgotten, or you end up with subtle bugs that are tickled by the existence of files called '>foo' or '|/bin/sh' or countless other variations.

Such bugs are made more insidious by the fact that 'by convention', they cannot ever be triggered. But for someone trying to make a working exploit, or widen a small security hole into a larger one, convention is no barrier.

If you want to have certainty that your code works correctly, 100% of the time, no ifs and no buts - rather than just waving your hands and hoping that everyone else in the world makes filenames that follow the same convention as you - then you need a guarantee that the assumptions you make are guaranteed to be true.

If you want to imagine that all your filenames are UTF-8, go ahead, who's stopping you!

You could equally well say that disk quotas are not needed; if you want to limit yourself to use 100 megabytes of space, who's stopping you? Indeed what is the point of file permissions - if you want to pretend that all your files are read-only, who's stopping you? And why should the kernel forbid hard links to directories - surely it should be up to the user to decide whether their filesystem is a tree or a general DAG, and the kernel should not enforce this policy.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 27, 2009 19:23 UTC (Fri) by drag (guest, #31333) [Link] (5 responses)

> 'By convention' files do not contain control characters. The problem is that you cannot rely on convention when writing robust, secure software. Either you put in endless sanity checks which cruft up your code and are liable to be forgotten, or you end up with subtle bugs that are tickled by the existence of files called '>foo' or '|/bin/sh' or countless other variations.

YA.

All I want is for the system to cancel out malicious filename characters and things that obviously make little sense. STuff like control characters, newlines, etc etc.

As for encoding the encoding stuff... meh. Filenames being treated as a string of bytes mostly makes sense, except in a few special cases.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 28, 2009 11:45 UTC (Sat) by epa (subscriber, #39769) [Link] (4 responses)

Of course no existing software treats filenames purely as a string of bytes - that is just rhetoric. At the very least, filenames are treated as ASCII character encoding and displayed to the user as such. Of course, this breaks down when a filename contains control characters.

If Unix really did treat filenames as merely 'a string of bytes', with no implied character set or encoding, and displayed them to the user as a hex dump or something, then it would be truly encoding-agnostic and would have no difficulties with arbitrary byte values in filenames. Of course, it would also have been a total failure that nobody uses. For a filesystem to be useful, it needs to have some amount of meaning (or 'policy' if you will) attached to the filenames it stores. The question is how much: is the current situation of 'ASCII for characters below 128, and above that you're on your own' the best one?

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 28, 2009 16:53 UTC (Sat) by tialaramex (subscriber, #21167) [Link] (3 responses)

The two major pieces of in-house software I develop both treat filenames purely as a string of bytes. The names chosen happen to be meaningful to the programmers, but they are of no importance to the program or its users.

I'd be surprised if the /majority/ of programs other than shell scripts aren't like this. Even in the majority of GUI software, what's needed isn't a revision of the kernel API (in fact that will barely help) but only a function which takes a zero-terminated byte array representing a filename and returns a string suitable for display. Such a function is nearly inevitable anyway - to deal with dozens of other issues unrelated to Wheeler's thesis. And such functions exist today (I can't say if they're bug free of course)

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 29, 2009 14:43 UTC (Sun) by epa (subscriber, #39769) [Link] (2 responses)

a function which takes a zero-terminated byte array representing a filename and returns a string suitable for display

Currently it is impossible to reliably write such a function, because you don't know whether the byte array is encoded in Latin-1, Shift-JIS, UTF-8 or whatever.

Imagine removing the character encoding headers from the http protocol. There would then be no reliable way to take the content of a page and display it to the user - just a panoply of hacks and rules of thumb that differed from one browser to another. This is the situation we have now with filenames, which are *names* and intended for human consumption just as much as the content of a typical web page. The two choices are (a) add headers to the protocol saying what encoding is in use (or in the case of filenames, an extra parameter in all FS calls), or (b) mandate a single encoding everywhere.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 29, 2009 21:58 UTC (Sun) by clugstj (subscriber, #4020) [Link] (1 responses)

No, it is very possible to write such a function. The character encoding issue only prevents you from assuring that the string matches what the file's creator thought it should be. This doesn't represent a security problem.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 29, 2009 22:37 UTC (Sun) by epa (subscriber, #39769) [Link]

No, it is very possible to write such a function. The character encoding issue only prevents you from assuring that the string matches what the file's creator thought it should be.

Well, yeah. If you allow the function to return the wrong answer, then it is easy to write. But it is not possible to in all cases return the correct filename to the user, matching the original one chosen by the user. If you pick a known encoding everywhere (UTF-8 being the obvious choice) then the problem goes away.

This doesn't represent a security problem.

Correct (at least none that I can think of). The security issue is with special characters and control characters in filenames, and is separate to the issue of how to encode characters that don't fit in ASCII.

a filename issue or a shell issue?

Posted Mar 25, 2009 21:10 UTC (Wed) by renox (guest, #23785) [Link] (3 responses)

As the biggest problem is linked to text interpretation, I think that a 'PowerShell' kind of shell which use objects instead of text for communication between programs would be more robust against 'weird' characters in the filenames.

a filename issue or a shell issue?

Posted Mar 25, 2009 21:53 UTC (Wed) by alecs1 (guest, #46699) [Link] (1 responses)

noatime/relatime discussion
accelerated indirect rendering and combinations
kernel mode setting
"better than POSIX"
restrictions on filenames
Microsoft shell competes with Unix shell

Keep them coming :)

a filename issue or a shell issue?

Posted Mar 26, 2009 18:30 UTC (Thu) by quotemstr (subscriber, #45331) [Link]

By making hell freeze over, we can win the global warming battle. :-)

a filename issue or a shell issue?

Posted Mar 25, 2009 23:37 UTC (Wed) by dwheeler (guest, #1216) [Link]

Perhaps, but even systems which have objects get burned. As noted earlier, the Python developers have had a hard time.... they've switched to Unicode as their main text representation, but Unix filenames aren't text... they are essentially binary blobs! If filenames were always UTF-8, there'd be no problems. Similarly, perl programs get trashed if filenames begin with <.

Case in point

Posted Mar 25, 2009 21:11 UTC (Wed) by proski (subscriber, #104) [Link]

One of the reasons git replaced many shell scripts with C code was support for weird file names. C is better at handling them. In absence of such issues, many commands would have remained shell scripts, which are easier to improve.

Simplicity is better than complexity.

Posted Mar 26, 2009 2:17 UTC (Thu) by k8to (guest, #15413) [Link] (22 responses)

Simplicity is better than complexity.

This proposal is a whole lot of complexity in kernel code and the API.
UNWANT.

Simplicity is better than complexity.

Posted Mar 26, 2009 2:22 UTC (Thu) by k8to (guest, #15413) [Link] (11 responses)

As for the find, gnu find already has -print0 and xargs is compatable.
The scripting implementation is thus trivially solved.

Sure, some find implementations don't have it. Fix them.

Simplicity is better than complexity.

Posted Mar 26, 2009 2:29 UTC (Thu) by k8to (guest, #15413) [Link] (9 responses)

As evidence for my position, here are some real-world filenames that my software needed to create to correctly archive some digital music history of the personal computer as an instrument.

jrodman@calufrax:/opt/kmods/mods/artists/Karl> ls d_* ¦*
d_    .it  d_   .it  d_  .it  d_ .it  d_1151.it  d_1152.it  d_1153.it  d_1154.it  ¦¦¦¦¯¯Ì_.it
jrodman@calufrax:/opt/kmods/mods/artists/Karl> ls d_* ¦* |xxd
0000000: 645f 2020 2020 2e69 740a 645f 2020 202e  d_    .it.d_   .
0000010: 6974 0a64 5f20 202e 6974 0a64 5f20 2e69  it.d_  .it.d_ .i
0000020: 740a 645f 3131 3531 2e69 740a 645f 3131  t.d_1151.it.d_11
0000030: 3532 2e69 740a 645f 3131 3533 2e69 740a  52.it.d_1153.it.
0000040: 645f 3131 3534 2e69 740a a6a6 a6a6 afaf  d_1154.it.......
0000050: cc5f 2e69 740a                           ._.it.

These files are handled by a combination of python and shellscripts, and one piece of C code (wrapping a library which knew how to read certain binary formats.) All of these pieces can handle newlines, tabs, spaces, control characters, leading dahes, and so on. I'm not really that smart. It wasn't much work. If shellscripts are 5 second hackjobs, then they will always fail in some cases: strange filenames, permissions problems, etc. If you take a few minutes to apply correct safeguards, then thigns work fine.

Simplicity is better than complexity.

Posted Mar 26, 2009 2:31 UTC (Thu) by k8to (guest, #15413) [Link]

Hah, the html markup I wasn't familiar with (needed to prevent the forum from mangling the listing) led me to mangle my post. Case in point, bad input, bad output. Shrug, learn, move on.

Simplicity is better than complexity.

Posted Mar 26, 2009 2:39 UTC (Thu) by foom (subscriber, #14868) [Link] (3 responses)

> needed to create

I find it very hard to believe that your software *needed* to create unintelligible filenames. And if it
did, I'd claim it needs to be fixed.

Simplicity is better than complexity.

Posted Mar 26, 2009 2:48 UTC (Thu) by k8to (guest, #15413) [Link] (2 responses)

If someone else cares about the design constraints that led to this necessity, let me know. Foom: i've watched you refuse to converse in a reasonable way across many threads. you have no clue about my software or the project but you claim to know what is correct and incorrect. Shut up.

Simplicity is better than complexity.

Posted Mar 26, 2009 3:40 UTC (Thu) by foom (subscriber, #14868) [Link]

I claim that I'd find software that creates filenames like that on my disk to be irritating. So I'd certainly prefer if no software actually did so, and probably wouldn't mind if it was impossible to do so.

If, in some alternative universe, it was already impossible to create those filenames, I have little doubt you could still have created working software which didn't require the impossible.

Sorry I come off as unreasonable to you. *hugs*

Do you know difference between two words: "need" and "want"?

Posted Mar 26, 2009 8:41 UTC (Thu) by khim (subscriber, #9252) [Link]

you have no clue about my software or the project but you claim to know what is correct and incorrect.

I don't have a clue. And I don't need it to know anything about your project to know you are lying. Any project can be implemented with exactly two filenames: "0" and "1". You'll need infinite depth of directory structure to do so, true, but thankfully there are no practical limitations in Linux. Is it feasible? Probably no. Is it possible? Of course. And if we'll start with the position that your software does not need these filenames but you current design needs these suddenly you have much weaker argument: you are reducing complexity of your software by increasing complexity of everyone's else's software. Is it good trade-off? May be yes, may be no. But it's weak argument at best - no matter what your project is and what it needs to be done.

Simplicity is better than complexity.

Posted Mar 26, 2009 10:01 UTC (Thu) by epa (subscriber, #39769) [Link] (1 responses)

Surely base64-encoding the filenames would not be too much hardship? Personally, I would be inclined to do that anyway, because trying to use shell commands to manipulate a file called ¦¦¦¦¯¯Ì_.it will be painful.

I know this is a matter of taste, and merely trying to impose one person's tastes on everyone is not a reason to change the kernel. But on the other hand, the marginal extra disk space saving (ten bytes?) from being able to put arbitrary binary stuff in filenames without encoding does not outweigh the many good reasons that Wheeler gave for changing.

Simplicity is better than complexity.

Posted Mar 26, 2009 13:54 UTC (Thu) by clugstj (subscriber, #4020) [Link]

"manipulate a file called ¦¦¦¦¯¯Ì_"

Put quotes around it?

Simplicity is better than complexity.

Posted Mar 26, 2009 10:27 UTC (Thu) by mjj29 (guest, #49653) [Link]

Would there be any problem with uuencoding those filenames?

Simplicity is better than complexity.

Posted Mar 26, 2009 21:22 UTC (Thu) by explodingferret (guest, #57530) [Link]

If we can do ANYTHING to prevent you writing programs which try to store information in filenames this way, I'm all for it.

Are you able to make the source of your shell scripts available? I'm sure I can find something in them that is breakable. :-)

DANGER! DANGER! DANGER! HYPOCRISY LEVEL IS OVER 9000!!!

Posted Mar 26, 2009 8:30 UTC (Thu) by khim (subscriber, #9252) [Link]

This argument:

Simplicity is better than complexity.

plus this one

As for the find, gnu find already has -print0 and xargs is compatable.

equals hypocrite.

And you can not even claim that "we already solved thsi problem so it's old code vs new code". A lot of programs just don't work with currect approach (especially script). You need to write and fix literally millions lines of code vs few thoiusands in kernel.

Sorry, but you are advocating more complex solution while preaching simplicity.

Simplicity is better than complexity.

Posted Mar 26, 2009 2:49 UTC (Thu) by njs (subscriber, #40338) [Link] (1 responses)

Have you ever written a program where it was important to handle filename character sets Right? Were you able to do so?

(My opinion on this issue is strongly influenced by writing the filesystem interaction code for a VCS. Users in different locales may want to work on the same project, but they write the same filenames differently, and some charsets may not be able to even represent filenames created in other locales, and...)

There are arguments for the current system, but "simplicity" is really not one of them.

Simplicity is better than complexity.

Posted Mar 27, 2009 0:46 UTC (Fri) by nix (subscriber, #2304) [Link]

But handling multiple charsets is so simple!

(As in, I looked at XEmacs/MULE's code and my brain dribbled out of my
ears, following which I was simple.)

Simplicity is better than complexity.

Posted Mar 26, 2009 5:34 UTC (Thu) by flewellyn (subscriber, #5047) [Link] (3 responses)

Simplicity is precisely the goal of this proposal. And how complex, really, is it to check a filename and return an error if it contains a disallowed character? Really, it isn't.

Simplicity is better than complexity.

Posted Mar 26, 2009 13:49 UTC (Thu) by clugstj (subscriber, #4020) [Link] (2 responses)

Yes, it is simple to reject a filename. Now every program in the world has to be changed to handle this new error case. Not sure how this is simplicity.

Simplicity is better than complexity.

Posted Mar 26, 2009 13:54 UTC (Thu) by epa (subscriber, #39769) [Link]

You are right, it is an error case to handle; but then creating or renaming a file is already allowed to fail for all sorts of reasons, so all programs already have to check it succeeded and handle errors gracefully. Besides, EINVAL is already returned for bad characters if the filesystem happens to be one that disallows them (like FAT or indeed most other non-Unix-native filesystems), so apps already have to handle that error case too.

Simplicity is better than complexity.

Posted Mar 26, 2009 16:02 UTC (Thu) by flewellyn (subscriber, #5047) [Link]

But a major point of Wheeler's argument is that existing programs, filesystems, and indeed operating systems already assume that these restrictions are the case, as a matter of convention, but do not necessarily do anything to ensure that they are enforced. Existing software already rejects or fails to properly handle filenames which would violate these conventions, and the vast majority of existing files are named according to these conventions; at the very least, filenames with leading dashes, tab or newline characters, or shell control characters are very rare, and probably accidental or malicious.

So the entire point is that the changes required to existing software would be minimal, and existing software which could break on filenames that don't obey these restrictions when they're not enforced by the OS, would no longer have a problem.

Is kernel the whole world?

Posted Mar 26, 2009 8:24 UTC (Thu) by khim (subscriber, #9252) [Link]

Simplicity is better than complexity.

Oh so very true.

This proposal is a whole lot of complexity in kernel code and the API.

It also removes whole bunch of code from other places. Even more important: it removes the need to write and fix bunch of code.

Simplicity is better than complexity.

Posted Mar 26, 2009 9:56 UTC (Thu) by epa (subscriber, #39769) [Link] (2 responses)

Have you ever written a reasonably complex but still non-buggy shell script - one which accepts arbitrary filenames (as is the Unix way) and doesn't do random buggy things?

If you've ever tried such an exercise, you would not believe that allowing control characters in the middle of filenames and leaving userspace to deal with the resulting mess could ever be called 'simplicity'.

Wheeler's suggestion would greatly simplify a lot of code, or if you prefer, fix many hidden bugs and security holes in code that is currently buggy.

Simply checking filenames for bad characters takes about five lines of code in the kernel plus one line for each syscall that accepts a filename from userspace. It is hardly adding significant complexity.

Simplicity is better than complexity.

Posted Mar 28, 2009 17:03 UTC (Sat) by tialaramex (subscriber, #21167) [Link] (1 responses)

Simply checking filenames for bad characters takes about five lines of code in the kernel plus one line for each syscall that accepts a filename from userspace.

Show me the money. Five lines, plus one per syscall. Not a lot of work to support such a broad and sweeping claim. Write those lines carefully, we wouldn't want you to be hand-waving and have missed 99.9% of the complexity of the problem...

Simplicity is better than complexity.

Posted Mar 29, 2009 15:03 UTC (Sun) by epa (subscriber, #39769) [Link]

To check for control characters

for (const char *c = filename; *c; c++)
if (*c < 32) return EINVAL;

Adding a fixed list of 'bad characters' (please excuse lack of indentation, the LWN comment form eats it):

for (const char *c = filename; *c; c++)
if (*c < 32 || *c == '<' || *c == '>' || *c == '|') return EINVAL;
if (filename[0] == '-') return EINVAL;

To check valid UTF-8 is a little more complex, but not much. You do not need to check that assigned Unicode characters are being used, or worry about combining characters, upper and lower case, etc. See <http://www.cl.cam.ac.uk/~mgk25/unicode.html> for a list of valid byte sequences. The code would be something like

/* First pad the filename with 4 extra NUL bytes at the end. Then, */
int is_cont(char c) { return 128 <= c && c < 192 }
const char *p = filename;
while (*p) {
if (*p < 128) ++c;
else if (192 <= *p && *p < 224 && is_cont(p[1])) p += 2;
else if (224 <= *p && *p < 240 && is_cont(p[1]) && is_cont(p[2]) p += 3;
else if (240 <= *p && *p < 248 && is_cont(p[1]) && is_cont(p[2])
&& is_cont(p[3])) p += 4;
else if (248 <= *p && *p < 252 && is_cont(p[1]) && is_cont(p[2])
&& is_cont(p[3]) && is_cont(p[4])) p += 5;
else if (252 <= *p && *p < 254 && is_cont(p[1]) && is_cont(p[2])
&& is_cont(p[3]) && is_cont(p[4]) && is_cont(p[5])) p += 6;
else return EINVAL;
}

For a self-contained system, that takes care of it. Put some code like the above into a function and call it at each place a filename is taken from user space. Coping with 'foreign' filesystems (e.g. NFS servers) returning non-UTF-8 filenames is a bit more complex.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 2:44 UTC (Thu) by explodingferret (guest, #57530) [Link] (1 responses)

Nice article. I spend a lot of time on the #Bash IRC channel on
freenode, so I have to deal with a lot of issues to do with quoting,
word splitting, etc.

Here are some problems I noticed in your article:

1) "These restrictions only apply to Windows - Linux, for example, allows
use of " * : < > ? \ / | even in NTFS." -- is "/" supposed to be in that list?

2) You state that changing IFS and banning newlines and tabs in filenames would make things like 'cat $file' safer, but you should also state that shell glob characters would also need to be removed (namely *?[]).

3) You state (or at least imply) that there is no way to reliably use filenames from find, but there is a POSIX compliant and known portable method:

find . -type f -exec somecommand {} \;
or for more complex cases:
find . -type f -exec sh -c 'if true; then somecommand "$1"; fi' -- {} \;

For xargs fans, on all but files with newlines, you can do
find . -type f | sed -e 's/./\\&/g' | xargs somecommand
This is a feature of xargs and is specified by POSIX. It disables various quoting problems with xargs that you don't mention.

4) Your setting of IFS to a value of tab and newline is overly complicated. Simply use IFS=`printf \\n\\t`. It is only trailing newlines that are removed. If the different behaviour this causes with "$*" is not desired, one can set IFS=`printf \\t\\n\\t`. I know of no tool or POSIX restriction that says characters may not be repeated in IFS.

Otherwise great article! It really would be so nice to use line-separated commands in `` and not have to worry about things breaking. And although most of the thoughts expressed here are well known to me, the idea of getting the kernel to check the validity of UTF-8 filenames is fantastic!

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 28, 2009 19:50 UTC (Sat) by dwheeler (guest, #1216) [Link]

Thanks for your comments! In particular, you're absolutely right about swapping the order of \t and \n in IFS - that makes it MUCH simpler. I prefer IFS=`printf '\n\t'` because then it's immediately obvious that \n and \t are the new values. I've put that into the document, with credit.

Parentheses

Posted Mar 26, 2009 4:52 UTC (Thu) by eru (subscriber, #2753) [Link]

I don't think you could ban "()". They frequently appear in names in Windows-originated directories, apparently because some common programs generate file names containing "(1)", "(2)", ... to avoid collisions or indicate file versions.

In general, the only shell metacharacters that could be banned without causing interoperability problems are those that are special also in That Other OS.

Not A System Problem

Posted Mar 26, 2009 9:56 UTC (Thu) by ldo (guest, #40946) [Link] (5 responses)

I dont think there should be any limitations on filenames at all. Even the usual Unix/POSIX prohibition against slash and null should go. The only problems happen with userspace languages and APIs that do interpretation of special characters when youre trying to pass them a pathname, therefore all such languages/APIs should be fixed to provide ways of bypassing such interpretations, perhaps by an escape syntax. Thats the right way to solve things.

Not A System Problem

Posted Mar 27, 2009 0:56 UTC (Fri) by nix (subscriber, #2304) [Link] (4 responses)

Um, if you remove the prohibition on nulls, how do you end the filename?
This isn't Pascal.

And if you remove the prohibition on slashes, how do you distinguish
between a file called foo/bar and a file called bar in a subdirectory foo?

These limitations are there because the semantics of the filesystem itself
depends on them.

Re: Not A System Problem

Posted Mar 29, 2009 10:30 UTC (Sun) by ldo (guest, #40946) [Link] (3 responses)

nix wrote:

Um, if you remove the prohibition on nulls, how do you end the filename? This isn't Pascal.

Nothing to do with Pascal. C is perfectly capable of dealing with arbitrary data bytes, otherwise large parts of both kernel and userland code wouldnt work.

And if you remove the prohibition on slashes, how do you distinguish between a file called foo/bar and a file called bar in a subdirectory foo?

Simple. The kernel-level filesystem calls will not take a full pathname. Instead, they will take a parent directory ID and the name of an item within that directory. Other OSes, like VMS and old MacOS, were doing this sort of thing decades ago.

Full pathname parsing becomes a function of the userland runtime. The kernel no longer cares what the pathname separator, or even what the pathname syntax, might be.

Re: Not A System Problem

Posted Mar 29, 2009 13:54 UTC (Sun) by nix (subscriber, #2304) [Link] (2 responses)

What you're describing is not POSIX anymore. Every single POSIX app would
need rewriting, for essentially zero gain (ooh, you can't have nulls in
filenames: that's why UTF-8 is *defined* to avoid nulls in filenames).

I'm sure users would love not being able to type in pathnames anymore,
too.

Good luck getting anyone to do it.

Re: Not A System Problem

Posted Mar 29, 2009 19:47 UTC (Sun) by ldo (guest, #40946) [Link] (1 responses)

nix wrote:

What you're describing is not POSIX anymore.

Nothing to do with POSIX. POSIX is a userland API, it doesnt dictate how the kernel should work.

Re: Not A System Problem

Posted Mar 29, 2009 22:32 UTC (Sun) by nix (subscriber, #2304) [Link]

You don't get it. In order to permit / and \0 as valid filename
characters, syscalls like open() must change. Library calls like fopen()
have to change, because they too accept a \0-terminated string, with /s
separating path components. Every single call in every library that
accepts pathnames has to change. Probably the very notion of a string has
to change to something non-\0-terminated.

So whatever you're describing, userspace cannot any longer use standard
POSIX calls: in fact, it can't any longer use ANSI C calls! I suspect that
such a system would be almost unusable with C, simply because you couldn't
use C string literals for anything.

If you want VMS, you know where to find it.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 10:45 UTC (Thu) by kerolasa (guest, #56089) [Link]

Doing exactly what D Wheeler suggest is wrong. Problems of representation and interpretation is not a problem of filesystem. Better approach would be to apply punycode for file names which is the same that IDN uses.

http://en.wikipedia.org/wiki/Internationalized_domain_name

That would mean that there is encoded and unencoded versions of filenames, or two representations, match with same inode. The version you want to see depends on environment variable or perhaps command line option. For me this sounds like libc hack and the guys making changes to that are conservative (thank god they are, who'll want unstable libc anyway). Even the safe mode sounds like good idea I don't expect to see such for next couple of years. Well we'll see.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 12:23 UTC (Thu) by jpetso (subscriber, #36230) [Link] (1 responses)

I totally agree with control characters and uncool stuff like trailing
spaces, but cutting down on the number of "regular" characters like ":
cope with them, and amongst the obvious usefulness of parentheses, those
characters often occur in music tracks that I'm ripping from my CDs. It
would be a shame to lose the ability of naming them with their actual name.

Why don't we instead fix the mechanism that transports strings in the bash?
Like, "All glob expansions are automatically enclosed in strings", "If it's
a string then don't f*cking interpret it as an option", and maybe even
"Here's an array of return values" instead of "If the viewer is a program
then split by newlines, if the viewer is a user then make a table". Type
safety ftw?

I'm all for reasonable defaults and constraining unnecessary stuff, but
when there's an actual *sensible* use case then that use case should not be
made impossible just because the implementation is crappy.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 12:26 UTC (Thu) by jpetso (subscriber, #36230) [Link]

Oops, LWN swallows less-than & Co. even in plain-text mode... whatever, imagine exclamation/question marks, parentheses etc. on the second line. Plus some suffixed text that says I disagree that we disallow those characters just because we don't cope with them.

Leading spaces are common, actually

Posted Mar 26, 2009 13:17 UTC (Thu) by barryn (subscriber, #5996) [Link] (10 responses)

There is a use for leading spaces: They force files to appear earlier than usual in a lexicographic
sort. (For instance, a program might create a menu at run time in lexicographic order based on
the contents of a directory, or you may want to force a file to appear near the beginning of a
listing.) This is especially common in the Mac world.

I was originally going to argue that leading spaces are necessary since people still have data
from Mac OS 9.x and earlier systems, but it turns out that this practice is far more common on
modern Mac OS X than I expected:

# find / -name ' *'
/Applications/Microsoft Office 2004/Clipart/Animations/ Animations Clip Package
/Applications/Microsoft Office 2004/Clipart/Clip Art/ Clip Art Clip Package
/Applications/Microsoft Office 2004/Clipart/Photos/ Photos Clip Package
/Applications/Microsoft Office 2004/Office/Entourage First Run/Entourage Script Menu Items/
About This Menu...
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Master/ Basic
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Master/ Basic/
Default.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Master/Ambient/
Ambient Basic.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Master/Classical/
Classical Basic.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Master/Dance/
Dance Basic.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Master/Hip Hop/
Hip Hop Basic.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Master/Jazz/ Jazz
Basic.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Master/Pop/ Pop
Basic.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Master/Rock/ Rock
Basic.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Master/Stadium
Rock/ Stadium Rock Basic.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Real/Band
Instruments/ No Effects.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Real/Basic Track/
No Effects.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Real/Bass/ No
Effects.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Real/Drums/ No
Effects.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Real/Effects/ No
Effects.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Real/Guitars/ No
Effects.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Real/Podcasting/
No Effects.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Real/Vocals/ No
Effects.cst
/Library/Scripts/Mail Scripts/Rule Actions/ Help With Rule Actions.scpt

Leading spaces are common, actually

Posted Mar 26, 2009 19:21 UTC (Thu) by njs (subscriber, #40338) [Link] (7 responses)

There is a use for leading spaces: They force files to appear earlier than usual in a lexicographic sort.

Are you sure? I've seen this in real-world uses too, but I thought that all the common systems were fixed to do human-style (non-ASCIIbetical) sorting a few years ago. I don't have any proprietary systems around to test, but I'll be *really* amused if the OS X Finder is missing this usability feature of GNU ls:

~/t$ touch "a" " b" "c"
~/t$ ls -l
total 0
-rw-r--r-- 1 njs njs 0 2009-03-26 12:16 a
-rw-r--r-- 1 njs njs 0 2009-03-26 12:16  b
-rw-r--r-- 1 njs njs 0 2009-03-26 12:16 c

Leading spaces are common, actually

Posted Mar 26, 2009 21:23 UTC (Thu) by foom (subscriber, #14868) [Link] (3 responses)

Finder does not ignore spaces. I'm quite glad, because I use the space-prefix trick rather regularly. I am occasionally annoyed at how GNU ls sorts "A B" between "AA" and "AC" instead of before them: that's certainly not how my brain sorts.

Finder does however sort numbers like this, which GNU ls does not: 1 2 10

I don't really see what the point of the "human-style" sorting is if it can't even sort numbers. That seems kind of basic to me.

Leading spaces are common, actually

Posted Mar 28, 2009 1:21 UTC (Sat) by nix (subscriber, #2304) [Link] (2 responses)

Sorting numerically in GNU ls is done by 'ls -v'.

(By default, despite comments elsewhere in this thread, ls sorts
ASCIIbetically, so " 2" comes before "1".)

Leading spaces are common, actually

Posted Mar 28, 2009 1:57 UTC (Sat) by foom (subscriber, #14868) [Link] (1 responses)

> Sorting numerically in GNU ls is done by 'ls -v'.

Huh, never knew that, interesting! Never would have found that from the man page, which says "-v sort by version". That seems a remarkably poor description of what it actually does.

> (By default, despite comments elsewhere in this thread, ls sorts ASCIIbetically, so " 2" comes before "1".)

Well, not exactly: GNU ls has a default sort which depends on the locale's collation settings, and most systems default to a locale like en_US.UTF-8, so most people have it sorting in a case/accent-insensitive manner by default on their systems.

Leading spaces are common, actually

Posted Mar 28, 2009 20:36 UTC (Sat) by nix (subscriber, #2304) [Link]

It's called 'sort by version' because the function it calls (strverscmp())
was designed to sort version numbers, and because the expected use of
ls -v was sorting a directory full of version-named directories in version
order.

(And you're right on the collation sort thing: I spoke carelessly.)

Leading spaces are common, actually

Posted Mar 27, 2009 4:41 UTC (Fri) by barryn (subscriber, #5996) [Link] (2 responses)

Behavior of ls in Mac OS X 10.5.6 build 9G55:

$ pwd
/Library/Application Support/GarageBand/Instrument Library/Track Settings
$ ls -l Master | head
total 0
drwxrwxrwx   3 root  admin  102 May  3  2008  Basic
drwxrwxrwx   6 root  admin  204 May  3  2008 Ambient
drwxrwxrwx   6 root  admin  204 May  3  2008 Classical
drwxrwxrwx  11 root  admin  374 May  3  2008 Dance
drwxrwxrwx   5 root  admin  170 May  3  2008 Hip Hop
drwxrwxrwx   5 root  admin  170 May  3  2008 Jazz
drwxrwxrwx   7 root  admin  238 May  3  2008 Pop
drwxrwxrwx   7 root  admin  238 May  3  2008 Rock
drwxrwxrwx   5 root  admin  170 May  3  2008 Stadium Rock
$

And this matches the Finder's behavior. BTW, if the Finder behaved any other way, it would be more difficult to properly recover broken Mac OS 9.x or earlier installations using OS X -- Classic Mac OS loads files in /System Folder/Extensions in lexicographic order, and the load order matters, and the leading space trick is used very frequently there. Mac OS X 10.5 can dual-boot with Mac OS 9.x, so this still matters for some users.

Leading spaces are common, actually

Posted Mar 27, 2009 15:45 UTC (Fri) by foom (subscriber, #14868) [Link]

>Behavior of ls in Mac OS X 10.5.6 build 9G55

Well, OSX's "ls" is actually just doing a traditional strcmp sort, not anything fancy (note that it puts all uppercase characters before all lowercase).

But the Finder's sort routine is fancy. They seem to be a sort order based on Unicode TR10.

Leading spaces are common, actually

Posted Nov 15, 2009 0:32 UTC (Sun) by yuhong (guest, #57183) [Link]

"Classic Mac OS loads files in /System Folder/Extensions in lexicographic
order, and the load order matters, and the leading space trick is used very
frequently there. "
Yep, look at what they had to do about this when Apple introduced HFS+ in Mac
OS 8.1:
http://developer.apple.com/legacy/mac/library/technotes/t...
s
http://developer.apple.com/legacy/mac/library/technotes/t...

Leading spaces are common, actually

Posted Mar 27, 2009 5:36 UTC (Fri) by quotemstr (subscriber, #45331) [Link] (1 responses)

The users on a filesystem I administer use six or seven levels of leading space to sort their common jobs-in-progress directory. I've long since given up on getting them to move to a hierarchical setup.

The way I see it, if a program can correctly work with filenames containing spaces, it can work with a filename containing leading spaces.

It's most important to eliminate newlines and control characters in filenames. The second most important consideration is specifying UTF-8 as the preferred filename encoding. Let's not get caught up in all sorts of other wishes that will just encourage endless debate and prevent these very important changes from getting made at all.

Leading spaces are common, actually

Posted Mar 28, 2009 9:21 UTC (Sat) by explodingferret (guest, #57530) [Link]

Leading and trailing spaces *do* have particular problems in shell scripts, because the 'read' command (which is needed to read line-separated data from commands) strips leading and trailing spaces/tabs unless IFS is set and does not contain spaces/tabs.

perl also has problems with leading spaces in filenames, unless you use the right kind of open command (perldoc -f open).

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 14:29 UTC (Thu) by kenjennings (guest, #57559) [Link] (2 responses)

If you had a petition I'd sign it. I agree with all six of your fixes at the end of your article.

Having been working with computers since 1979 and subject to the various limitations of dozens of file systems, I automatically exercise self-restraint and never put any of those characters into filenames.

People should not be using filenames as data storage.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 28, 2009 1:01 UTC (Sat) by nix (subscriber, #2304) [Link]

Uh, what do you think filenames *are* if not data? What you mean
is 'people should only be storing data with certain restrictions in
filenames', but that's a circular argument.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 28, 2009 1:06 UTC (Sat) by pr1268 (guest, #24648) [Link]

People should not be using filenames as data storage.

How about metadata storage? In my PC troubleshooting days, I came across a Windows XP box with a folder of videos of adult content whose file names were lengthy and explicit descriptions of the activities portrayed in the videos. Just reading the directory listing alone conjured up many vivid and disturbing thoughts. Fortunately this wasn't Windows Vista—its Explorer even creates video thumbnails. :-o

Meta-discussion

Posted Mar 26, 2009 18:37 UTC (Thu) by quotemstr (subscriber, #45331) [Link] (1 responses)

I bet there's a rather strong correlation between the side a person takes on the fsync debate and how he feels about this issue.

(For the record, I support all the proposed restrictions on filenames except for a ban on shell meta-characters.)

Meta-discussion

Posted Mar 28, 2009 22:21 UTC (Sat) by man_ls (guest, #15091) [Link]

Hmmm, I'm not so sure. I feel strongly about ext4 losing data, but I don't have a strong opinion about this issue. Really. Not for lack of sensitivity to the problem -- I've had an administrator at work erase a whole directory of files because of a leading space (so that 'rm -rf /dir/file' became 'rm -rf /dir/ file'). But there are advantages and disadvantages, and I cannot pick a side.

Bojan has only posted once, and his message contains the words "not sure". I would say that this debate attracts a different subset of (opinionated) people.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 19:34 UTC (Thu) by az (guest, #46701) [Link] (1 responses)

I would agree with everything in the article except for the parts which suggest that forbidding punctuation in filenames is acceptable.

Filenames need to be usable for describing the contents of files. You sure as hell don't need newlines and tabs for that, but you certainly should be able to use the same punctuation you can use in a sentence - in English, that's !@#$%&()-:;"',.? but in other languages more characters may be needed (but they would be no different from any other non-ascii utf-8 character). I think the requirement that a filename can't start or end with certain characters is acceptable - you don't expect a sentence to start with most of them - but inside the string being forbidden from using them would be very constraining.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 19:36 UTC (Thu) by az (guest, #46701) [Link]

(and space, of course)

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 23:04 UTC (Thu) by zooko (guest, #2589) [Link] (5 responses)

Thanks, David Wheeler! Good note, and timely. I recently delved into this topic for the tahoe system and came to a rather depressing conclusion:

"When reading in a filename, what we really want is to get the unicode of that filename without risk of corruption due to false decoding. ... Unfortunately this isn't possible except on Windows and Macintosh."

http://allmydata.org/pipermail/tahoe-dev/2009-March/00137...

Hopefully Linux folks will follow D. Wheeler's lead on this and make it so that some future version of Tahoe can reliably get filenames from Linux just as it currently can from Windows and Mac OS X.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 23:35 UTC (Thu) by zooko (guest, #2589) [Link] (4 responses)

By the way, when writing on the tahoe-dev list about your proposal, I realized that a future version
of Tahoe, which wanted to take advantage of the reliably-encoded filenames of a future version of
Linux, would have to have some way to reliably distinguish between old-fashioned-linux (where
you get a string of bytes and some "suggested" encoding which may or may not correctly decode
those bytes), and new-fashioned-linux (where, like on Windows and Mac, you get unicode
filenames).

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 28, 2009 16:00 UTC (Sat) by tialaramex (subscriber, #21167) [Link] (3 responses)

The guarantee you're relying on in Windows doesn't exist.

There, I said it.

It really doesn't exist. You, up in Win32 land, are forbidden from creating certain filenames, but everybody else running on the same NT kernel and sharing a filesystem with you is allowed to continue doing as they please, and so the APIs you're using /explicitly/ don't promise what you're relying on.

The filenames you get from NT will be sequences of 16-bit code units. They might be Unicode. The filenames you get from Linux will be sequences of 8-bit code units. They might be Unicode (in this case UTF-8) too.

In both cases you will /usually/ not see sequences that are crazy and undisplayable or not legal in some (non-filename) context, but you might, and when you do the OS vendor will say "I never promised otherwise". So you need defensive coding.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 28, 2009 16:41 UTC (Sat) by zooko (guest, #2589) [Link] (2 responses)

"The filenames you get from NT will be sequences of 16-bit code units. They might be Unicode. The filenames you get from Linux will be sequences of 8-bit code units. They might be Unicode (in this case UTF-8) too."

I don't think this is true. The bytes in the filenames in NT are defined to be UTF-16 encodings of characters. The bytes in the filenames in Mac are defined to be UTF-8 encodings. The bytes in the filenames in Linux are not defined to be any particular encoding. It isn't just a definitional issue -- the result is that reading a filename from the Windows or Mac filesystem can't cause you to lose information -- the filename you get in your application is guaranteed to be the same as the filename that is stored in the filesystem. On the other hand, when you read a filename from the filesystem in Linux, then you need to decide how to attempt to decode it, and there is no way to guarantee that your attempt won't corrupt the filename.

Please correct me if I'm wrong, because I'm intending to make the Tahoe p2p disk-sharing app depend on this guarantee from Windows (and from Mac), and to make it painfully work around the lack of this guarantee in Linux.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 28, 2009 17:45 UTC (Sat) by tialaramex (subscriber, #21167) [Link]

To be quite clear about what I'm saying and what I'm not saying:

* I am saying that you can't guarantee that the filenames Windows gives you are all legal UTF-16 Unicode strings. Windows makes no such promise. Non-Win32 programs (including Win32 programs which also use native low-level APIs) may create files which don't obey the convention, and filenames on disk or from a network filesystem are not checked to see if they are valid UTF-16.

* I am NOT saying that there are people running Windows whose filenames are all in SJIS or ISO-8859-8 or even Windows codepage 1252. That would be silly because those encodings (and indeed practically all legacy encodings) are 8-bit and all filenames in Windows are 16-bit. When a Windows filename "means something" at all, the meaning will be encoded as UTF-16, or perhaps if you're really unlucky, UCS-2.

So if your problem is "People keep running my program with crazy locale settings and legacy encodings of filenames" well you have my sympathy, and yes you will need to handle this for Linux (even if only by writing a FAQ entry telling them to switch to UTF-8) and might get away without on Windows.

But if the problem is "My program blindly assumes filenames are legal Unicode strings" then you're in a bad way, stop doing that because it's a bug at least on Linux and Windows, and IMO most likely on Mac OS X too (though their documentation claims otherwise).

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 28, 2009 18:54 UTC (Sat) by foom (subscriber, #14868) [Link]

> The bytes in the filenames in NT are defined to be UTF-16 encodings of characters.

That's not actually true. The windows APIs take arrays of 16-bit "things". Those are supposed to be
UTF-16, but none of the APIs will check that. So, you can easily create invalid surrogate pair
sequences. Now, it's a *lot* easier to ignore this issue on windows than on linux, because:

a) The set of invalid sequences in UTF-16 is a lot smaller than in UTF-8.
b) Nobody creates those by accident. It won't happen just because you set your LOCALE wrong.
c) the windows Unicode APIs are all 16-bit unicode, so they never try decoding the surrogate pair
sequences anyways
d) Even UTF-16->UTF-32 decoders often decode a lone surrogate pair in UTF-16 into a lone
surrogate pair value in UTF-32 (even though it's theoretically not supposed to do that).

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 27, 2009 7:10 UTC (Fri) by janpla (guest, #11093) [Link]

Isn't this just another complaint about not being able to everything the way "I" want to do them? Depending on who this "I" is, the natural and obvious way to achieve something will be different; and I don't think it is reasonable to expect that any technology can be everything for everybody.

The important thing in any theoretical framework is that it is orthogonal and logically complete; because then all it takes is clarity of mind to figure out the correct way to do things. A basic, and in my opinion very sound, principle in UNIX is that the system provides functionality, not policy. This means that unfortunately you can potentially do incredibly stupid things, but it also means that you are not excluded from doing incredibly clever things either.

UNIX is an operating system for adults who take responsibility and at least try to think before they jump. If you want a Fischer-Price interface where nothing can harm you, there are other options available.

Bad understanding of UTF-8

Posted Mar 27, 2009 22:56 UTC (Fri) by spitzak (guest, #4593) [Link] (7 responses)

The biggest error is blaming UTF-8 for what is really a stupid design decision by Posix.

An "invalid" UTF-8 string can contain only some extraneous bytes in the range 0x80-0xff. These high-order bytes do not cause any problems with any programs.

The problem is the stupid Python guys who believe in magic fairy land where all UTF-8 is valid. This is also causing havoc with using Python3 for URLs and HTML. No, I'm sorry, if a file contains UTF-8, it is going to have invalid sequences. They need to get their heads out of their *** and do something so that invalid UTF-8 is preserved ALL THE TIME and never throws an exception, unless you specifically call "throw_exception_if_not_valid_utf8()".

Because a whole lot of stupid people thought that "wide characters" are the solution and put them into certain systems we have to live with it and interoperate. The most popular solution is to translate invalid bytes in UTF-8 into 0xDCxx. This can be used as a stopgap until they finally realize that leaving the data in UTF-8 is the real solution. This substitution does not really fix things because it does not do a round trip clean conversion. Supporting round-trip means your system cannot name invalid UTF-16 file names, and if you think those don't exist you are really living in a fantasy world!

I think therefore the escape character can easily be the UTF-8 encoding of 0xDCxx. This will not conflict with the above because all the escaped characters do not have the high bit set. This will survive a translation to UTF-16 and thus provides a way to put the exact same filenames on Windows UTF-16 filesystems.

His proposed rules for disallowed bytes seem pretty reasonable though I would not disallow any printing characters in the interior of the filename, backslash escaping works pretty good in there.

Bad understanding of UTF-8

Posted Mar 28, 2009 3:40 UTC (Sat) by njs (subscriber, #40338) [Link] (5 responses)

An "invalid" UTF-8 string can contain only some extraneous bytes in the range 0x80-0xff. These high-order bytes do not cause any problems with any programs.

Are you sure?

Bad understanding of UTF-8

Posted Mar 30, 2009 16:08 UTC (Mon) by spitzak (guest, #4593) [Link] (4 responses)

Yes I am sure.

The first two references are about programs failing to recognize overlong encodings as being invalid. But those invalid sequences start with a byte with the high bit set (following bytes may not have it set, but the fact that decoders consider them part of the first byte is the decoders error, a fixed decoder would consider it a one-byte error with the high bit set, followed by normal ascii characters which are unchanged and thus cannot cause a security hole).

The last one is EXACTLY the bug I am trying to fix: stupid people who somehow believe that throwing errors or replacing with non-unique strings is how invalid UTF-8 should be handled. The bug is that it maps more than one different string to the same one. The proper solution is to stop translating UTF-8 into something else and treat it as a stream of bytes. Nothing should care that it is UTF-8 except stuff that draws it on the screen.

Bad understanding of UTF-8

Posted Mar 31, 2009 4:49 UTC (Tue) by njs (subscriber, #40338) [Link] (3 responses)

> Yes I am sure.

So -- just checking we're on the same page here -- what you're saying is that you're sure that those three security bugs I found in 5 minutes of googling were "not problems in any program".

> The first two references are about programs failing to recognize overlong encodings as being invalid.

Right -- if invalid codings are interpreted differently in different parts of a system, then that creates bugs and security holes.

> But those invalid sequences start with a byte with the high bit set (following bytes may not have it set, but the fact that decoders consider them part of the first byte is the decoders error, a fixed decoder would consider it a one-byte error with the high bit set, followed by normal ascii characters which are unchanged and thus cannot cause a security hole).

I'm sorry -- I cannot make out a word of this. The bug in the first two links is that the invalid sequences are over-long (but like all the bugs mentioned here, involve only bytes with the high bits set -- do you know how UTF-8 works?). The decoder should have an explicit check for such sequences and throw an error if they are encountered, but this check was left out.

> The last one is EXACTLY the bug I am trying to fix: stupid people who somehow believe that throwing errors or replacing with non-unique strings is how invalid UTF-8 should be handled.

Errrr... quite so. I wasn't sure how useful this was to start with, but when you say in so many words that the proper solution to XSS security holes is to stop sanitizing web form inputs and instead convert all web browsers so that they *don't interpret unicode* then... maybe it's time I step out of this thread. Best of luck to you.

Bad understanding of UTF-8

Posted Mar 31, 2009 17:59 UTC (Tue) by spitzak (guest, #4593) [Link] (2 responses)

I am sure that "errors in UTF-8 only contain bytes with the high bit set", which is what I thought you were asking.

An overlong encoding consists of a leading byte with the high bit set. This is an error. That may be followed by any byte. If it is another leading byte then it might start another UTF-8 character, or it might be an error. If it is a continuation byte then it is an error. If it is an ASCII character then it is not an error. As before, EVERY ERROR BYTE has the high bit set!

I might have misunderstood your question. You said "are you sure" in response to me saying that all error bytes have the high bit set. The reason I was confirming that all error bytes have the high bit set is that if they are mapped to a 128-long range of Unicode then the adjacent 128-long range makes a good candidate for "quoting" characters that are not allowed in filenames.

I do believe there are some serious mistakes in a lot of modern software. UTF-8 should NOT be converted until the very last moment when it is converted to "display form" for drawing on the screen. This is the only reliable way of preserving identity of invalid strings. People who think invalid strings will not occur or that it is acceptable for them to compare equal or silently be changed to other invalid strings or with valid strings are living in a fantasy land.

Bad understanding of UTF-8

Posted Apr 1, 2009 5:12 UTC (Wed) by njs (subscriber, #40338) [Link] (1 responses)

> I am sure that "errors in UTF-8 only contain bytes with the high bit set", which is what I thought you were asking.

Okay, fair enough. I agree, all ASCII characters are valid UTF-8. I was objecting to your claim that bytes with the high bits set "do not cause any problems with any programs".

> An overlong encoding consists of a leading byte with the high bit set. This is an error.

All characters with codepoint >= 128 are encoded in UTF-8 as a string of bytes with the high bit set (including on the leading byte). Having the high bit set is *certainly* not an error. I can't tell what you're saying in general, but it's just not true that the only time strings need to be interpreted as text is for display. In many, many cases text needs to be processed as text, and it's often impossible and rarely practical to write algorithms in such a way that they do something sensible with invalid encodings. Those serious security bugs I pointed out up above are examples of what happens when you try.

(You're right that invalid strings usually shouldn't be silently transmuted to valid strings; they should usually signal a hard error.)

Bad understanding of UTF-8

Posted Apr 1, 2009 16:38 UTC (Wed) by spitzak (guest, #4593) [Link]

A program that treats bytes with the high bit set as "this may be a piece of a UTF-8 character", and puts all those bytes into a single class such as "may be a part of an identifier", can safely handle UTF-8 strings (including invalid ones) as bytes. This is FAR better than trying to detect and handle errors, in particular because it is a hundred times simper and thus more reliable and less likely to have bugs.

Do NOT throw exceptions on bad strings. This turns a possible security error into a guaranteed DOS error. Working around it (as I have had to do countless times due to stupid string-drawing routines that refuse to draw a string with an error in it) means I have to write my *own* UTF-8 parser, just to remove the errors, before displaying it or using it. I hope you can see how forcing programmers to use their own code to parse the strings rather than providing reusable routines is a bad idea.

And I don't want exceptions thrown when I compare two strings for equality. That way lies madness. It is unfortunate that too much of this stuff is being designed by people who never use it or they (and you) would not make such trivial design errors.

Bad understanding of UTF-8

Posted Apr 15, 2009 10:38 UTC (Wed) by epa (subscriber, #39769) [Link]

Because a whole lot of stupid people thought that "wide characters" are the solution and put them into certain systems we have to live with it and interoperate. The most popular solution is to translate invalid bytes in UTF-8 into 0xDCxx. This can be used as a stopgap until they finally realize that leaving the data in UTF-8 is the real solution.

They cannot 'leave the data in UTF-8' because it is not in UTF-8 to start with! If it contains invalid bytes then by definition it's not UTF-8. It is just a string of arbitrary bytes and certainly, yes, the application can treat it as such. That does make life difficult when you want to display the filename to the user or otherwise treat it as human-readable text.

And indeeed, the Python developers are living in a magic fairy land where filenames are sanely encoded and are always human-readable text, but wouldn't it be better to change things so that this situation is no longer wishful thinking, but part of the ordinary things userspace can rely on? That is what Wheeler is proposing.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 28, 2009 11:04 UTC (Sat) by magnus (subscriber, #34778) [Link] (1 responses)

In my opinion, the problems described are not caused by which characters are allowed in filenames, but by the fragile process of converting file references to text strings and back again. This creates other types of problems as well, subtle race conditions and such.

There ought to be a different way of passing file references between programs and from programs to the kernel in a way that conversion from text to file reference is only ever needed on hand written file names.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 31, 2009 5:14 UTC (Tue) by njs (subscriber, #40338) [Link]

We have that -- that's what file descriptors are. It would be nice if programs passed them back and forth more often, but my guess is that they mostly get used where they should, and to make their use more ubiquitous you'd need to radically re-architect a lot of stuff. (If one wanted to be provocative, one could claim that the whole goal of EROS/Coyotos is to figure out what that re-architecting looks like.)