LWN: Comments on "Wheeler: Fixing Unix/Linux/POSIX Filenames"

Wheeler: Fixing Unix/Linux/POSIX Filenames

nix — Sun, 15 Nov 2009 13:15:57 +0000

zsh93 was too sodding hard to require because building it was a nightmare.
At the time it wasn't free enough either.

Wheeler: Fixing Unix/Linux/POSIX Filenames

yuhong — Sun, 15 Nov 2009 01:06:04 +0000

"ksh was too buggy (thanks, Linux, for pdksh, with its broken
propagation of variables out of loops-with-redirection)"
Was ksh93 tried?

Leading spaces are common, actually

yuhong — Sun, 15 Nov 2009 00:32:25 +0000

"Classic Mac OS loads files in /System Folder/Extensions in lexicographic
order, and the load order matters, and the leading space trick is used very
frequently there. "
Yep, look at what they had to do about this when Apple introduced HFS+ in Mac
OS 8.1:
http://developer.apple.com/legacy/mac/library/technotes/t...
s
http://developer.apple.com/legacy/mac/library/technotes/t...

NT (Windows kernel) doesn't care about filenames any more than Linux

yuhong — Sun, 15 Nov 2009 00:06:30 +0000

Another trick you can use with CreateFile is to start the filename with \\.\.
If that is done, the only processing done on the filename before CreateFile
calls NtCreateFile with the name is that \\.\ is replace with \??\, which is
an alias of \DosDevices\.

NT (Windows kernel) doesn't care about filenames any more than Linux

yuhong — Sat, 14 Nov 2009 23:58:04 +0000

"files that are more than 2GB long"
Yep, NT had supported both files and disks larger than 2GB from the first
version (NT 3.1) using the NTFS filesystem. Exercise: compare the design of
the GetDiskFreeSpace and SetFilePointer APIs (look them up using MSDN or
Google), both of which has existed since NT 3.1. Which one was so much more
error-prone that the versions of Windows released in 1996 had to cap the
result to 2GB, even though older versions of NT supported returning more than
2GB using it, and why?

Bad understanding of UTF-8

epa — Wed, 15 Apr 2009 10:38:01 +0000

Because a whole lot of stupid people thought that "wide characters" are the solution and put them into certain systems we have to live with it and interoperate. The most popular solution is to translate invalid bytes in UTF-8 into 0xDCxx. This can be used as a stopgap until they finally realize that leaving the data in UTF-8 is the real solution.

They cannot 'leave the data in UTF-8' because it is not in UTF-8 to start with! If it contains invalid bytes then by definition it's not UTF-8. It is just a string of arbitrary bytes and certainly, yes, the application can treat it as such. That does make life difficult when you want to display the filename to the user or otherwise treat it as human-readable text.

And indeeed, the Python developers are living in a magic fairy land where filenames are sanely encoded and are always human-readable text, but wouldn't it be better to change things so that this situation is no longer wishful thinking, but part of the ordinary things userspace can rely on? That is what Wheeler is proposing.

Wheeler: Fixing Unix/Linux/POSIX Filenames

anton — Fri, 03 Apr 2009 18:49:33 +0000

It could also recognise the null character as an argument separator as in 'find -print0'.

A few weeks ago I wanted to process my .ogg files which contain all kinds of characters that are treated as meta-characters by the shell or other programs I use in sheel scripts. I eventually ended up writing a new shell dumbsh that uses NUL as argument separator, and feeding it from find, with some intermediate processing in awk (which is quite flexible about meta-characters).

Wheeler: Fixing Unix/Linux/POSIX Filenames

forthy — Thu, 02 Apr 2009 15:54:06 +0000

It is actually not that bad. As collating sequence, ß=ss (i.e. Mass and Maß sort to the same bin). Except for Austrian telephone books, where ß follows ss, but comes before st (though St. follows Sankt ;-).

However, there's a huge mess in the CJK part of UCS: short and long forms of the same character (sometimes even a special variant for the Japanese character). This should never have happend, the different forms of the same character should be encoded in fonts, not in UCS. So far, not even Mac OS X normalizes these characters, but it is obvious that a mainland China file called "中国" and a Taiwan file called "中國" not only mean the same, but they also refer to the same word, and can be interchanged at will (see for example the Chinese wikipedia entry: the lemma is the short form, the headline is the long form). And it is not easy to access long and short forms with usual input methods (mainland China: Pinyin, Canton: Cantonese Pinyin (gives traditional characters, bug you need to know Cantonese), etc.).

Bad understanding of UTF-8

spitzak — Wed, 01 Apr 2009 16:38:38 +0000

A program that treats bytes with the high bit set as "this may be a piece of a UTF-8 character", and puts all those bytes into a single class such as "may be a part of an identifier", can safely handle UTF-8 strings (including invalid ones) as bytes. This is FAR better than trying to detect and handle errors, in particular because it is a hundred times simper and thus more reliable and less likely to have bugs.

Do NOT throw exceptions on bad strings. This turns a possible security error into a guaranteed DOS error. Working around it (as I have had to do countless times due to stupid string-drawing routines that refuse to draw a string with an error in it) means I have to write my *own* UTF-8 parser, just to remove the errors, before displaying it or using it. I hope you can see how forcing programmers to use their own code to parse the strings rather than providing reusable routines is a bad idea.

And I don't want exceptions thrown when I compare two strings for equality. That way lies madness. It is unfortunate that too much of this stuff is being designed by people who never use it or they (and you) would not make such trivial design errors.

Bad understanding of UTF-8

njs — Wed, 01 Apr 2009 05:12:40 +0000

> I am sure that "errors in UTF-8 only contain bytes with the high bit set", which is what I thought you were asking.

Okay, fair enough. I agree, all ASCII characters are valid UTF-8. I was objecting to your claim that bytes with the high bits set "do not cause any problems with any programs".

> An overlong encoding consists of a leading byte with the high bit set. This is an error.

All characters with codepoint >= 128 are encoded in UTF-8 as a string of bytes with the high bit set (including on the leading byte). Having the high bit set is *certainly* not an error. I can't tell what you're saying in general, but it's just not true that the only time strings need to be interpreted as text is for display. In many, many cases text needs to be processed as text, and it's often impossible and rarely practical to write algorithms in such a way that they do something sensible with invalid encodings. Those serious security bugs I pointed out up above are examples of what happens when you try.

(You're right that invalid strings usually shouldn't be silently transmuted to valid strings; they should usually signal a hard error.)

Wheeler: Fixing Unix/Linux/POSIX Filenames

nix — Tue, 31 Mar 2009 19:28:56 +0000

I've contributed fixes now and then, but I just read a lot. :) The
projects are public, after all.

Bad understanding of UTF-8

spitzak — Tue, 31 Mar 2009 17:59:20 +0000

I am sure that "errors in UTF-8 only contain bytes with the high bit set", which is what I thought you were asking.

An overlong encoding consists of a leading byte with the high bit set. This is an error. That may be followed by any byte. If it is another leading byte then it might start another UTF-8 character, or it might be an error. If it is a continuation byte then it is an error. If it is an ASCII character then it is not an error. As before, EVERY ERROR BYTE has the high bit set!

I might have misunderstood your question. You said "are you sure" in response to me saying that all error bytes have the high bit set. The reason I was confirming that all error bytes have the high bit set is that if they are mapped to a 128-long range of Unicode then the adjacent 128-long range makes a good candidate for "quoting" characters that are not allowed in filenames.

I do believe there are some serious mistakes in a lot of modern software. UTF-8 should NOT be converted until the very last moment when it is converted to "display form" for drawing on the screen. This is the only reliable way of preserving identity of invalid strings. People who think invalid strings will not occur or that it is acceptable for them to compare equal or silently be changed to other invalid strings or with valid strings are living in a fantasy land.

Wheeler: Fixing Unix/Linux/POSIX Filenames

mjthayer — Tue, 31 Mar 2009 07:47:30 +0000

I was wondering now whether to ask about this on the Bash mailing lists. Just out of interest, are you involved with the development of Bash/the GNU tools in any way? You seem well informed about them.

Wheeler: Fixing Unix/Linux/POSIX Filenames

njs — Tue, 31 Mar 2009 05:14:50 +0000

We have that -- that's what file descriptors are. It would be nice if programs passed them back and forth more often, but my guess is that they mostly get used where they should, and to make their use more ubiquitous you'd need to radically re-architect a lot of stuff. (If one wanted to be provocative, one could claim that the whole goal of EROS/Coyotos is to figure out what that re-architecting looks like.)

Wheeler: Fixing Unix/Linux/POSIX Filenames

njs — Tue, 31 Mar 2009 05:00:40 +0000

I think you're overcomplicating things -- I wouldn't implement UTF-8 requirements at the VFS level (it just doesn't make sense, since there manifestly exist filesystems where you don't know the encoding, both from pre-existing Linux installs and with "foreign" filesystems). I'd make it a filesystem feature -- a flag in the ext2/3/4 header that's set at mkfs time, say. That removes all the issues about translating invalid filenames -- if that flag is set and a filename is invalid, then *your filesystem is corrupt*. fsck can check for such corruption if it feels like it.

Then you just get distros to set that flag on the root filesystem by default, add a few bits of API for programs who want to know "is this filesystem utf8-only?" or "how does this filesystem normalize names?" (which would be really useful calls anyway), and away you go.

(It's unfortunate that the Win32 designers screwed this up, but that's hardly an argument to perpetuate their mistake.)

Bad understanding of UTF-8

njs — Tue, 31 Mar 2009 04:49:07 +0000

> Yes I am sure.

So -- just checking we're on the same page here -- what you're saying is that you're sure that those three security bugs I found in 5 minutes of googling were "not problems in any program".

> The first two references are about programs failing to recognize overlong encodings as being invalid.

Right -- if invalid codings are interpreted differently in different parts of a system, then that creates bugs and security holes.

> But those invalid sequences start with a byte with the high bit set (following bytes may not have it set, but the fact that decoders consider them part of the first byte is the decoders error, a fixed decoder would consider it a one-byte error with the high bit set, followed by normal ascii characters which are unchanged and thus cannot cause a security hole).

I'm sorry -- I cannot make out a word of this. The bug in the first two links is that the invalid sequences are over-long (but like all the bugs mentioned here, involve only bytes with the high bits set -- do you know how UTF-8 works?). The decoder should have an explicit check for such sequences and throw an error if they are encountered, but this check was left out.

> The last one is EXACTLY the bug I am trying to fix: stupid people who somehow believe that throwing errors or replacing with non-unique strings is how invalid UTF-8 should be handled.

Errrr... quite so. I wasn't sure how useful this was to start with, but when you say in so many words that the proper solution to XSS security holes is to stop sanitizing web form inputs and instead convert all web browsers so that they *don't interpret unicode* then... maybe it's time I step out of this thread. Best of luck to you.

Wheeler: Fixing Unix/Linux/POSIX Filenames

rickmoen — Mon, 30 Mar 2009 19:36:47 +0000

mrshiny wrote:

You can pry my spaces from my filenames out of my cold dead fingers.

ObMenInBlack: "Your offer is acceptable."

(I remember having to write AppleScript to recurse through directories cleaning up files created on network shares by MacOS-using munchkins who put space characters at the ends of filenames, in order for them to become valid filenames when seen by MS-Windows-using employees looking at the same network shares. The converse problem was files, from MS-Windows users, with names containing colon, which is a reserved character in MacOS file namespace. What a pain in the tochis.)

Rick Moen
rick@linuxmafia.com

Wheeler: Fixing Unix/Linux/POSIX Filenames

Hawke — Mon, 30 Mar 2009 16:41:21 +0000

I don't think any DOS applications use backslash for their option marker. Some use dash, and most use slash. But I'm pretty sure that practically none if any use backslash

Bad understanding of UTF-8

spitzak — Mon, 30 Mar 2009 16:08:23 +0000

Yes I am sure.

The first two references are about programs failing to recognize overlong encodings as being invalid. But those invalid sequences start with a byte with the high bit set (following bytes may not have it set, but the fact that decoders consider them part of the first byte is the decoders error, a fixed decoder would consider it a one-byte error with the high bit set, followed by normal ascii characters which are unchanged and thus cannot cause a security hole).

The last one is EXACTLY the bug I am trying to fix: stupid people who somehow believe that throwing errors or replacing with non-unique strings is how invalid UTF-8 should be handled. The bug is that it maps more than one different string to the same one. The proper solution is to stop translating UTF-8 into something else and treat it as a stream of bytes. Nothing should care that it is UTF-8 except stuff that draws it on the screen.

NT (Windows kernel) doesn't care about filenames any more than Linux

foom — Mon, 30 Mar 2009 15:13:36 +0000

>> Does that mean if you code against the NT API directly, you can create files foo and FOO in the same directory?
> Yes. This is what the POSIX subsystems for NT do

You can actually do this through the Win32 API: see the FILE_FLAG_POSIX_SEMANTICS flag for CreateFile. However, MS realized this was a security problem, so as of WinXP, this option will in normal circumstances do absolutely nothing. You now have to explicitly enable case-sensitive support on the system for either the "Native" or Win32 APIs to allow it.

(the SFU installer asks if you want to this, but even SFU has no special dispensation)

NT (Windows kernel) doesn't care about filenames any more than Linux

nye — Mon, 30 Mar 2009 10:55:30 +0000

>Does that mean if you code against the NT API directly, you can create files foo and FOO in the same directory?

Yes. This is what the POSIX subsystems for NT do; they're implemented on top of the native API, as is the Win32 API. Note that Cygwin doesn't count here as it's a compatibility layer on top of the Win32 API rather than its own separate subsystem.

Unfortunately the Win32 API *does* enforce things like file naming conventions, so it's impossible (at least without major voodoo) to write Win32 applications which handle things like a colon in a file name, and since different subsytems are isolated, that means that no normal Windows software is going to be able to do it.

(I learnt all this when I copied my music collection to an NTFS filesystem, and discovered that bits of it were unaccessible to Windows without SFU/SUA, which is unavailable for the version of Windows I was using.)

http://en.wikipedia.org/wiki/Native_API

Wheeler: Fixing Unix/Linux/POSIX Filenames

njs — Mon, 30 Mar 2009 00:07:22 +0000

You cannot, in general, convert a filename to text. That's the fundamental problem that any of the proposals would solve.

Wheeler: Fixing Unix/Linux/POSIX Filenames

epa — Sun, 29 Mar 2009 22:37:31 +0000

No, it is very possible to write such a function. The character encoding issue only prevents you from assuring that the string matches what the file's creator thought it should be.

Well, yeah. If you allow the function to return the wrong answer, then it is easy to write. But it is not possible to in all cases return the correct filename to the user, matching the original one chosen by the user. If you pick a known encoding everywhere (UTF-8 being the obvious choice) then the problem goes away.

This doesn't represent a security problem.

Correct (at least none that I can think of). The security issue is with special characters and control characters in filenames, and is separate to the issue of how to encode characters that don't fit in ASCII.

Re: Not A System Problem

nix — Sun, 29 Mar 2009 22:32:28 +0000

You don't get it. In order to permit / and \0 as valid filename
characters, syscalls like open() must change. Library calls like fopen()
have to change, because they too accept a \0-terminated string, with /s
separating path components. Every single call in every library that
accepts pathnames has to change. Probably the very notion of a string has
to change to something non-\0-terminated.

So whatever you're describing, userspace cannot any longer use standard
POSIX calls: in fact, it can't any longer use ANSI C calls! I suspect that
such a system would be almost unusable with C, simply because you couldn't
use C string literals for anything.

If you want VMS, you know where to find it.

Wheeler: Fixing Unix/Linux/POSIX Filenames

foom — Sun, 29 Mar 2009 22:07:27 +0000

Eh...but OSX *does* run 40+ years of UNIX programs. It's pretty clear that the change to require
UTF-8 (and even the change to be case insensitive!) didn't bother most programs.

Wheeler: Fixing Unix/Linux/POSIX Filenames

clugstj — Sun, 29 Mar 2009 21:58:46 +0000

No, it is very possible to write such a function. The character encoding issue only prevents you from assuring that the string matches what the file's creator thought it should be. This doesn't represent a security problem.

Conventions are great! Let's go back to FAT!

clugstj — Sun, 29 Mar 2009 21:44:21 +0000

"UNIX is not broken. Your head, on the other hand, is"

Wow, childish personal attacks. How droll.

"Number of correct scripts is not important metric. Number of bad scripts is"

I would think that the percentage of each would (possibly) be a useful metric. But, what is the damage from these "bad scripts"? If you are writing shell scripts that MUST be absoutely bullet-proof from bad input, perhaps because they run setuid-root, then you are already making a much worse mistake than the possible bugs in the script.

Still don't understand the FAT reference. Sorry, maybe I'm just slow.

Wheeler: Fixing Unix/Linux/POSIX Filenames

clugstj — Sun, 29 Mar 2009 21:30:44 +0000

OS X is trivial to handle. It only has to continue to work in a compatible way with the previous Mac OS - which wasn't UNIX. So using it as an example of how to "fix" these problems is not a good idea if you care about supporting 40+ years of UNIX programs - which is why this is difficult to change.

Wheeler: Fixing Unix/Linux/POSIX Filenames

clugstj — Sun, 29 Mar 2009 21:27:08 +0000

I'm sorry, but when you said, that any of these propositions is better than the current situation, I HAD to disagree. In what way is the current situation so bad that any proposal is better that the current situation?

Re: Not A System Problem

ldo — Sun, 29 Mar 2009 19:47:30 +0000

nix wrote:

What you're describing is not POSIX anymore.

Nothing to do with POSIX. POSIX is a userland API, it doesnt dictate how the kernel should work.

Simplicity is better than complexity.

epa — Sun, 29 Mar 2009 15:03:37 +0000

To check for control characters

for (const char *c = filename; *c; c++)
if (*c < 32) return EINVAL;

Adding a fixed list of 'bad characters' (please excuse lack of indentation, the LWN comment form eats it):

for (const char *c = filename; *c; c++)
if (*c < 32 || *c == '<' || *c == '>' || *c == '|') return EINVAL;
if (filename[0] == '-') return EINVAL;

To check valid UTF-8 is a little more complex, but not much. You do not need to check that assigned Unicode characters are being used, or worry about combining characters, upper and lower case, etc. See <http://www.cl.cam.ac.uk/~mgk25/unicode.html> for a list of valid byte sequences. The code would be something like

/* First pad the filename with 4 extra NUL bytes at the end. Then, */
int is_cont(char c) { return 128 <= c && c < 192 }
const char *p = filename;
while (*p) {
if (*p < 128) ++c;
else if (192 <= *p && *p < 224 && is_cont(p[1])) p += 2;
else if (224 <= *p && *p < 240 && is_cont(p[1]) && is_cont(p[2]) p += 3;
else if (240 <= *p && *p < 248 && is_cont(p[1]) && is_cont(p[2])
&& is_cont(p[3])) p += 4;
else if (248 <= *p && *p < 252 && is_cont(p[1]) && is_cont(p[2])
&& is_cont(p[3]) && is_cont(p[4])) p += 5;
else if (252 <= *p && *p < 254 && is_cont(p[1]) && is_cont(p[2])
&& is_cont(p[3]) && is_cont(p[4]) && is_cont(p[5])) p += 6;
else return EINVAL;
}

For a self-contained system, that takes care of it. Put some code like the above into a function and call it at each place a filename is taken from user space. Coping with 'foreign' filesystems (e.g. NFS servers) returning non-UTF-8 filenames is a bit more complex.

Wheeler: Fixing Unix/Linux/POSIX Filenames

epa — Sun, 29 Mar 2009 14:43:25 +0000

a function which takes a zero-terminated byte array representing a filename and returns a string suitable for display

Currently it is impossible to reliably write such a function, because you don't know whether the byte array is encoded in Latin-1, Shift-JIS, UTF-8 or whatever.

Imagine removing the character encoding headers from the http protocol. There would then be no reliable way to take the content of a page and display it to the user - just a panoply of hacks and rules of thumb that differed from one browser to another. This is the situation we have now with filenames, which are *names* and intended for human consumption just as much as the content of a typical web page. The two choices are (a) add headers to the protocol saying what encoding is in use (or in the case of filenames, an extra parameter in all FS calls), or (b) mandate a single encoding everywhere.

NT (Windows kernel) doesn't care about filenames any more than Linux

epa — Sun, 29 Mar 2009 14:36:20 +0000

NT (the kernel API in Windows NT, 2000, XP and etc.) doesn't care about filename encodings. The only thing that makes NT's attitude to such things different from that of Linux's is that NT's arbitrary sequences of non-zero code units used for filenames use 16-bit code units, and in Linux obviously they're 8-bit.
Everything else you see, such as case-insensitivity, bans on certain characters or sequences of characters, is implemented in other layers of the OS or even in language runtimes, not the kernel. Low-level programmers, just as on Unix, can call a file anything they like.

Does that mean if you code against the NT API directly, you can create files foo and FOO in the same directory? I expect that opens up all sorts of juicy security holes - many of them theoretical, since a typical NT system has just one user and there is not much need for privelege escalation - but still it sounds fun.

using UTF-8 and blindly trusting that everything you work with is actually legal and meaningful display-safe UTF-8 are quite different things.

Indeed. Hence the benefit of enforcing this at the OS level: it gets rid of the need for sanity checks that slow down the good programmers and were never written anyway by the bad programmers.

Wheeler: Fixing Unix/Linux/POSIX Filenames

epa — Sun, 29 Mar 2009 14:31:15 +0000

Yes, validate every filename that comes from user space to check it is valid UTF-8 and does not have control characters. This is not in fact an expensive operation (especially not compared to the cost of opening or creating a file in the first place).

Every non-Unix OS already forbids control characters in filenames so there would not be much extra checking to do in filesystems like smbfs or ntfs. (Except out of paranoia to detect disk corruption, which is probably a good thing to do anyway.) As you point out, there remains the question of network filesystems like NFS, where the server could legitimately return filenames containing arbitrary byte sequences. And there would have to be some policy decision about what to do. But I would rather have one single place to deal with the mess rather than leave it to 101 different bits of code in user space. (Python 3.0 pretends that invalid-UTF-8 filenames do not exist when returning a directory listing; other programs will show them but may or may not escape control characters when displaying to the terminal; goodness knows what different Java implementations do.)

I would favour silently discarding filenames that contain control characters from the directory listing, and for those in some legacy encoding like Latin-1 or Shift-JIS, translating them to UTF-8. (The legacy encoding would be specified with a mount parameter. Again, this is a bit awkward but a hundred times less complicated than leaving every userspace program to do its own peculiar thing.)

Meanwhile application developers get no benefit for many years because of compatibility considerations.

Not really true. The benefit in closing existing security holes is immediate. In writing new code, you can note that there may be corner-case bugs on systems that permit control characters in filenames, but for 90% of the user base they do not exist. That is 90% better than the current situation, where everyone just writes code assuming that filenames are sane, but no system enforces it. By analogy, consider that many classic UNIX utilities had fixed limits on line length. If I write a shell script that uses sort(1), I just write it for GNU sort and other modern implementations. I might note that people on older systems may encounter interesting effects using my script with large input data, but I don't have to wait for every last Xenix system to be switched off before I can get the benefit in new code.

Personally I think the issue to look at is spaces. Spaces are legal. They are undoubtedly going to remain legal. But they are inconvenient. How can we tweak our basic Unix processes (including the shell and many old tools) so that spaces are harmless ?

This is true in principle but in thirty years of Unix, essentially no progress has been made on this. Nobody bothers to fix the shell or utilities such as make(1) to cope with arbitrary characters, despite much wishing that they would. Nobody bothers to write shell scripts that cope with all legal filenames, mostly because it is all but impossible. Instead, people who care about bug-free code end up rewriting shell scripts in other languages such as C (for example, some of the git utilities), people who think life is too short are happy to distribute software that misbehaves or has security holes, and many others just don't realize there is a problem.

OS X is something of a special case because of case insensitivity. If you don't want case insensitivity then you do not need to worry about Unicode composition; just a simple byte sequence check that you have valid UTF-8. But OS X is a useful example in another way: a case-insensitive filesystem is a much bigger break with Unix tradition that what's proposed here, and yet the world did not come to an end, and it was trivial for most Unix software to adapt.

Re: Not A System Problem

nix — Sun, 29 Mar 2009 13:54:54 +0000

What you're describing is not POSIX anymore. Every single POSIX app would
need rewriting, for essentially zero gain (ooh, you can't have nulls in
filenames: that's why UTF-8 is *defined* to avoid nulls in filenames).

I'm sure users would love not being able to type in pathnames anymore,
too.

Good luck getting anyone to do it.

Re: Not A System Problem

ldo — Sun, 29 Mar 2009 10:30:45 +0000

nix wrote:

Um, if you remove the prohibition on nulls, how do you end the filename? This isn't Pascal.

Nothing to do with Pascal. C is perfectly capable of dealing with arbitrary data bytes, otherwise large parts of both kernel and userland code wouldnt work.

And if you remove the prohibition on slashes, how do you distinguish between a file called foo/bar and a file called bar in a subdirectory foo?

Simple. The kernel-level filesystem calls will not take a full pathname. Instead, they will take a parent directory ID and the name of an item within that directory. Other OSes, like VMS and old MacOS, were doing this sort of thing decades ago.

Full pathname parsing becomes a function of the userland runtime. The kernel no longer cares what the pathname separator, or even what the pathname syntax, might be.

At last, a hope of progress

mikachu — Sun, 29 Mar 2009 00:01:58 +0000

On days when I'm feeling paranoid I always say ./* instead of just *, especially when talking to /bin/rm. On the other hand, touch -- -i in directories where you have important files is a nice trick too.

Meta-discussion

man_ls — Sat, 28 Mar 2009 22:21:03 +0000

Hmmm, I'm not so sure. I feel strongly about ext4 losing data, but I don't have a strong opinion about this issue. Really. Not for lack of sensitivity to the problem -- I've had an administrator at work erase a whole directory of files because of a leading space (so that 'rm -rf /dir/file' became 'rm -rf /dir/ file'). But there are advantages and disadvantages, and I cannot pick a side.

Bojan has only posted once, and his message contains the words "not sure". I would say that this debate attracts a different subset of (opinionated) people.

Leading spaces are common, actually

nix — Sat, 28 Mar 2009 20:36:45 +0000

It's called 'sort by version' because the function it calls (strverscmp())
was designed to sort version numbers, and because the expected use of
ls -v was sorting a directory full of version-named directories in version
order.

(And you're right on the collation sort thing: I spoke carelessly.)

Wheeler: Fixing Unix/Linux/POSIX Filenames

dwheeler — Sat, 28 Mar 2009 19:50:37 +0000

Thanks for your comments! In particular, you're absolutely right about swapping the order of \t and \n in IFS - that makes it MUCH simpler. I prefer IFS=`printf '\n\t'` because then it's immediately obvious that \n and \t are the new values. I've put that into the document, with credit.