|
|
Log in / Subscribe / Register

Wheeler: Fixing Unix/Linux/POSIX Filenames

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 25, 2009 16:31 UTC (Wed) by mgross (guest, #38112)
Parent article: Wheeler: Fixing Unix/Linux/POSIX Filenames

I'm sure I'll get burned for asking, but how hard would it be to implement the kernel code to reject the creation of wonky file names? It seems like a simple thing to implement. (I guess it would add code to the file create path and perhaps slow down some benchmarks)

Also, this seems like a pretty sensible idea, why hasn't it been implemented already?


to post comments

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 25, 2009 16:50 UTC (Wed) by epa (subscriber, #39769) [Link] (2 responses)

Typically, the simpler and more obvious the idea, the longer it takes to overcome resistance and be implemented. (See for example making relatime the default, or the .desktop file security problems discussed a little while back on LWN.)

Some will argue that the answer is user education (teach your users not to use bad characters in filenames), and perhaps even a cron job you can run on your PDP-11 overnight to look for filenames containing these characters and send a message via local mail to the user responsible. Furthermore, if it was good enough for V7 UNIX, it's good enough for us now. (Note that in Plan 9, there are sensible restrictions on characters in filenames; but it's common for followers of a particular system or language to become rabidly conservative, even when the original designers of the system have moved on.)

In other words it is sheer inertia, and reluctance by any one Unix-like system to add such a feature when the others do not. You can bet that if Linux added a filename character check, it would immediately be branded 'broken' by many BSD or Solaris enthusiasts - not all, but certainly those that make the most noise online.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 2:19 UTC (Thu) by dirtyepic (guest, #30178) [Link] (1 responses)

Also see: "If it was such a good idea, someone else would have done it already. Therefore it must be flawed".

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 15:23 UTC (Thu) by dwheeler (guest, #1216) [Link]

"The reasonable man adapts himself to the world; the unreasonable one persists in trying to adapt the world to himself. Therefore, all progress depends on the unreasonable man." (George Bernard Shaw)

I'm well aware that this is different than the historical past. But that doesn't make past decisions correct for the present. So, let's chat about the pros and cons; I believe that the cons for "anything goes" now outweigh the pros.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 28, 2009 16:33 UTC (Sat) by tialaramex (subscriber, #21167) [Link] (2 responses)

It would be a really nasty, expensive change, or else, it would be a token effort that's worthless for the very things it's supposed to fix.

To actually make this work, in the kernel (where you're perf critical and this is all unwanted overhead that's costing everyone who uses your "improved" system) you need to absolutely, as a matter of "Linus will veto if you don't" policy:

* Validate every filename to check that it conforms. This has to be done either at mount time, or when syscalls interact with the filenames (e.g. directory reading, and opening files). As a network file system client the OS must either screen every filename going over the network, or else punt and rely on promises from the server (if available).

* When you find an invalid filename, you need to deal with it, it's not clear what the kernel should or even could do. Perhaps the file should just not exist as far as userspace is concerned, and fsck would unlink it?

Meanwhile application developers get no benefit for many years because of compatibility considerations. It could be a decade before it makes any sense to write a program which assumes one of the restrictions, and that's if EVERY SINGLE OS fixes this tomorrow. Wheeler mistakenly believes this is a POSIX problem, but it isn't, the problem exists everywhere that filenames are treated as opaque, which in fact includes Windows (and I have my doubts about OS X, but its API documentation promises they aren't opaque, so app developers who rely on that promise would be entitled to scream blue murder when someone finds a way to get non-Unicode into an OS X filename...*)

Personally I think the issue to look at is spaces. Spaces are legal. They are undoubtedly going to remain legal. But they are inconvenient. How can we tweak our basic Unix processes (including the shell and many old tools) so that spaces are harmless ? Once you've done this, you'll have the right mindset to tackle initial hyphen, control characters and so on from the same angle, rather than screwing the poor kernel into doing your dirty work and making everybody (including those of us for whom opaque filenames are just dandy) pay.

* Something that should make you pause, OS X's approach to filenames as Unicode strings makes Unicode composition/ decomposition into an OS ABI feature. It had been doing this for years before Unicode actually pledged to stop changing the decomposition rules (ie until that happened new versions of OS X made previously legal filenames illegal and vice versa, with no warning...)

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 29, 2009 14:31 UTC (Sun) by epa (subscriber, #39769) [Link]

Yes, validate every filename that comes from user space to check it is valid UTF-8 and does not have control characters. This is not in fact an expensive operation (especially not compared to the cost of opening or creating a file in the first place).

Every non-Unix OS already forbids control characters in filenames so there would not be much extra checking to do in filesystems like smbfs or ntfs. (Except out of paranoia to detect disk corruption, which is probably a good thing to do anyway.) As you point out, there remains the question of network filesystems like NFS, where the server could legitimately return filenames containing arbitrary byte sequences. And there would have to be some policy decision about what to do. But I would rather have one single place to deal with the mess rather than leave it to 101 different bits of code in user space. (Python 3.0 pretends that invalid-UTF-8 filenames do not exist when returning a directory listing; other programs will show them but may or may not escape control characters when displaying to the terminal; goodness knows what different Java implementations do.)

I would favour silently discarding filenames that contain control characters from the directory listing, and for those in some legacy encoding like Latin-1 or Shift-JIS, translating them to UTF-8. (The legacy encoding would be specified with a mount parameter. Again, this is a bit awkward but a hundred times less complicated than leaving every userspace program to do its own peculiar thing.)

Meanwhile application developers get no benefit for many years because of compatibility considerations.
Not really true. The benefit in closing existing security holes is immediate. In writing new code, you can note that there may be corner-case bugs on systems that permit control characters in filenames, but for 90% of the user base they do not exist. That is 90% better than the current situation, where everyone just writes code assuming that filenames are sane, but no system enforces it. By analogy, consider that many classic UNIX utilities had fixed limits on line length. If I write a shell script that uses sort(1), I just write it for GNU sort and other modern implementations. I might note that people on older systems may encounter interesting effects using my script with large input data, but I don't have to wait for every last Xenix system to be switched off before I can get the benefit in new code.

Personally I think the issue to look at is spaces. Spaces are legal. They are undoubtedly going to remain legal. But they are inconvenient. How can we tweak our basic Unix processes (including the shell and many old tools) so that spaces are harmless ?
This is true in principle but in thirty years of Unix, essentially no progress has been made on this. Nobody bothers to fix the shell or utilities such as make(1) to cope with arbitrary characters, despite much wishing that they would. Nobody bothers to write shell scripts that cope with all legal filenames, mostly because it is all but impossible. Instead, people who care about bug-free code end up rewriting shell scripts in other languages such as C (for example, some of the git utilities), people who think life is too short are happy to distribute software that misbehaves or has security holes, and many others just don't realize there is a problem.

OS X is something of a special case because of case insensitivity. If you don't want case insensitivity then you do not need to worry about Unicode composition; just a simple byte sequence check that you have valid UTF-8. But OS X is a useful example in another way: a case-insensitive filesystem is a much bigger break with Unix tradition that what's proposed here, and yet the world did not come to an end, and it was trivial for most Unix software to adapt.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 31, 2009 5:00 UTC (Tue) by njs (subscriber, #40338) [Link]

I think you're overcomplicating things -- I wouldn't implement UTF-8 requirements at the VFS level (it just doesn't make sense, since there manifestly exist filesystems where you don't know the encoding, both from pre-existing Linux installs and with "foreign" filesystems). I'd make it a filesystem feature -- a flag in the ext2/3/4 header that's set at mkfs time, say. That removes all the issues about translating invalid filenames -- if that flag is set and a filename is invalid, then *your filesystem is corrupt*. fsck can check for such corruption if it feels like it.

Then you just get distros to set that flag on the root filesystem by default, add a few bits of API for programs who want to know "is this filesystem utf8-only?" or "how does this filesystem normalize names?" (which would be really useful calls anyway), and away you go.

(It's unfortunate that the Win32 designers screwed this up, but that's hardly an argument to perpetuate their mistake.)


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds