|
|
Log in / Subscribe / Register

Wheeler: Fixing Unix/Linux/POSIX Filenames

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 25, 2009 21:56 UTC (Wed) by dwheeler (guest, #1216)
In reply to: Wheeler: Fixing Unix/Linux/POSIX Filenames by nix
Parent article: Wheeler: Fixing Unix/Linux/POSIX Filenames

A few thoughts based on nix's comments...

I use the filename as a key-value store for a system (not yet released) which implements an object model of sorts in the shell (inspired by shoop but not derived from it). dot-prepended names are used to signify private fields, and dash-prepended ones, *specifically because they are so hard to use* and thus unlikely to be desirable field names, are used by the inside of the object model as field metadata:

Such a key-value storage will have trouble with "/" in the key, since it's the directory separator. So if you truly need arbitrary keys, you already have to do some encoding anyway - so why not encode to something more convenient? If you don't need arbitrary encoding, then let's find some reasonable limits that stop the worst of the bleeding. Also, there's no need to have all those weird filenames merged with other stuff in the same directory; you could create a single directory with "." as the first character in the name, and create the key-value store in that subdirectory.

I pondered a \n-prepended filename because it's even harder to trip over by mistake, but decided that it would look too odd in directory listings of object directories when debugging. There's no danger of user code interpreting these names as options, because user code accesses files in this directory only via a shell-function API.)

That's exactly my point. Even in your case, filenames with \n are a pain. And let's say that a user runs a "find" that traverses your directory... if the filenames are troublesome (e.g., include \n or \t) you'll almost certainly cause the user grief, even if they had no idea that you implemented a keystore. And even if you don't want users (or their programs) going into these directories, people WILL need to do so, to do debugging.

The semantics of Unix filesystems have been fixed de facto for many years...

"We've always done it that way" may be true, but that doesn't justify the status quo. The status quo is causing pain, for little gain. Let's fix it.

Hardwired filename encodings are a good idea only if you can guarantee that this encoding has been the standard for the lifetime of the filesystem. You can't assume that for any existing filesystem: thus you have to decide what to do if filenames are not representable in the encoding scheme chosen. This also conflicts with 'no control characters' in that a good bunch of Unicode characters >127 can be considered 'control characters' of a sort, and there's no guarantee that more won't be added. How to exclude control characters which may be added in the future?

Lots of filesystems ALREADY mandate specific on-disk encodings; I believe all Windows and MacOS filesystems already specify them. The problem is that the system doesn't know how to map them to the userspace API. So, let's define the userspace API, so that people can actually do the mapping correctly. As far as "forever" goes, the program "convmv" does mass file renames for encoding; you can use it to convert a given filesystem from whatever encoding you've been using to UTF-8 (problem solved). The distros are already moving this way. As far as "no control characters", there's no need to do anything locale-dependent; excluding 1-31 would be adequate, and I'd also exclude 127 to to be complete for 7-bit ASCII (how do you print DEL in a GUI anyway?!?). Control characters unique to other locales don't bite people the way these characters do.

You also can't sensibly exclude shell metacharacters, because you don't know what they are, because they're shell-dependent, and some shells (like zsh) have globbing schemes so complex that ruling out all filenames that might be interpretable as a glob is wholly impractical.

I completely agree that this limitation cannot be applied everywhere. In fact, my article said, "I doubt these limitations could be agreed upon across all POSIX systems, but it'd be nice if administrators could configure specific systems to prevent such filenames on higher-value systems." But on some systems, I do know what shells are in use, and their metacharacters, and the system is never supposed to be creating filenames with metacharacters in the first place. I'd like to be able to install a "special exclusion list", just like I can install SELinux today to create additional limitations on what this particular system can do.

But I agree that these rules all make sense for parts of the filesystem that users might manipulate with arbitrary programs, as opposed to those that are using part of the FS as a single-program-managed datastore. What I think we need is an analogue of pathconf() (setpathconf()?) combined with extra fs metadata, such that constraints of this sort can be imposed and relaxed for *specific directories* (inherited by subdirectories on mkdir(), presumably). *That* might stand a chance of not breaking the entire world in the name of fixing it.

That's a very interesting idea, I like it! In fact, there's already a mechanism in the Linux kernel that might do this job already: getfattr(1)/setfattr(1). If it were implemented this way, I'd suggest that by default directories would "prevent bad filenames" (e.g., control chars and leading "-"); you could then use "setfattr" on directories to permit badness. New directories could then inherit the state of their parent. I would make those "trusted extended attributes" - you'd have to be CAP_SYS_ADMIN (typically superuser) to be able to create such directories.


to post comments

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 3:07 UTC (Thu) by drag (guest, #31333) [Link]

Mac OS X eliminated the utility of having case sensitive filenames, and while annoying if your porting software, I have not heard much complaints about it and software has been fixed.

And that is much more extreme then having a filesystem mount option to stop tabs and newlines being used to define files.

It'll be future proof also, as much that matters. You don't make a whitelist of allowed characters, you make a blacklist of troublesome characters and allow everything else. If you create more troublesome characters, which is very unlikely, you can add them to the black list. (and even if it is going to happen it will be exceedingly rare) Any new characters that get made, or any new encodings, then they will just be allowed to slide on through.

I mean if you have a future encoding scheme that conflicts with a previously established and well known encoding such as ascii, then it is just too dumb to be supported by anybody.

-----------------

Here is a challenge:

Somebody write me a script that will go and count all the uses of tabs, <, >, and newlines in their file names on their systems...

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 22:12 UTC (Thu) by nix (subscriber, #2304) [Link]

David, thanks for responding.
Such a key-value storage will have trouble with "/" in the key, since it's the directory separator. So if you truly need arbitrary keys, you already have to do some encoding anyway - so why not encode to something more convenient? If you don't need arbitrary encoding, then let's find some reasonable limits that stop the worst of the bleeding. Also, there's no need to have all those weird filenames merged with other stuff in the same directory; you could create a single directory with "." as the first character in the name, and create the key-value store in that subdirectory.
I claim mental block: this solution became obvious to me a day or so back. (Rather, since I already use . as a metacharacter to mean 'private', use .. to mean 'extra-private: metadata'. Yes, this too is bizarre, but at least it's not dash-prepended.)

But I have seen a system in production use at Big Banks (first saw it yesterday, or first noticed it, probably thanks to this conversation) that uses the filesystem as a base-254-of-key to value store. It's gross but it's sometimes done.

But then, we know how competent Big Banks are. (Especially this one, did you but know who it was.)

As far as "forever" goes, the program "convmv" does mass file renames for encoding; you can use it to convert a given filesystem from whatever encoding you've been using to UTF-8 (problem solved).
Yes, but this only works if you can mandate a no-encoding transparent view of filenames! As soon as you start to automatically encode them, this sort of transcoding is impossible.

I have no objection to making the things you propose options. What I object to is making them mandatory, because this would make some things impossible. (Strange things, but still.)

In fact, there's already a mechanism in the Linux kernel that might do this job already: getfattr(1)/setfattr(1). If it were implemented this way, I'd suggest that by default directories would "prevent bad filenames" (e.g., control chars and leading "-"); you could then use "setfattr" on directories to permit badness. New directories could then inherit the state of their parent. I would make those "trusted extended attributes" - you'd have to be CAP_SYS_ADMIN (typically superuser) to be able to create such directories.
It depends how harsh the limits are. I'd say that 'no control characters' is certainly reasonable to have only the superuser lift. Perhaps a less harsh constraint to impose is that regular users cannot set this attribute on directories readable by 'other', and that chmodding a directory after the fact strips this attribute off it. Now users cannot dump landmines in that directory for users outside their group (root is assumed to know what he's doing).

I'd say that setting this attribute flips a pathconf-viewable attribute as well, so that other POSIX-compliant systems can adopt the same approach and applications can portably query it without needing to implement/depend on all of the ACL machinery.

NT (Windows kernel) doesn't care about filenames any more than Linux

Posted Mar 28, 2009 15:36 UTC (Sat) by tialaramex (subscriber, #21167) [Link] (5 responses)

It's always worth telling people this, because it tends to make them rock back on their heels if they've been (wrongly) believing that NT is doing something special here.

NT (the kernel API in Windows NT, 2000, XP and etc.) doesn't care about filename encodings. The only thing that makes NT's attitude to such things different from that of Linux's is that NT's arbitrary sequences of non-zero code units used for filenames use 16-bit code units, and in Linux obviously they're 8-bit.

Everything else you see, such as case-insensitivity, bans on certain characters or sequences of characters, is implemented in other layers of the OS or even in language runtimes, not the kernel. Low-level programmers, just as on Unix, can call a file anything they like.

And the consequence is the same thing being lamented in this article - badly written Windows programs crash or do insane things when faced with filenames that don't look like the ones the poor third rate programmer who wrote the code was familiar with. In the absence of defensive programming this software also doesn't like leap years, or leap seconds, or files that are more than 2GB long, or... you could go on all day, badly written programs suck.

On encodings - I encourage you to use UTF-8. I encourage people with other encodings to migrate to UTF-8, but using UTF-8 and blindly trusting that everything you work with is actually legal and meaningful display-safe UTF-8 are quite different things. People who can't keep them separate are doing a bad job, whether handling filenames or displaying email.

NT (Windows kernel) doesn't care about filenames any more than Linux

Posted Mar 29, 2009 14:36 UTC (Sun) by epa (subscriber, #39769) [Link] (3 responses)

NT (the kernel API in Windows NT, 2000, XP and etc.) doesn't care about filename encodings. The only thing that makes NT's attitude to such things different from that of Linux's is that NT's arbitrary sequences of non-zero code units used for filenames use 16-bit code units, and in Linux obviously they're 8-bit.

Everything else you see, such as case-insensitivity, bans on certain characters or sequences of characters, is implemented in other layers of the OS or even in language runtimes, not the kernel. Low-level programmers, just as on Unix, can call a file anything they like.

Does that mean if you code against the NT API directly, you can create files foo and FOO in the same directory? I expect that opens up all sorts of juicy security holes - many of them theoretical, since a typical NT system has just one user and there is not much need for privelege escalation - but still it sounds fun.
using UTF-8 and blindly trusting that everything you work with is actually legal and meaningful display-safe UTF-8 are quite different things.
Indeed. Hence the benefit of enforcing this at the OS level: it gets rid of the need for sanity checks that slow down the good programmers and were never written anyway by the bad programmers.

NT (Windows kernel) doesn't care about filenames any more than Linux

Posted Mar 30, 2009 10:55 UTC (Mon) by nye (guest, #51576) [Link] (2 responses)

>Does that mean if you code against the NT API directly, you can create files foo and FOO in the same directory?

Yes. This is what the POSIX subsystems for NT do; they're implemented on top of the native API, as is the Win32 API. Note that Cygwin doesn't count here as it's a compatibility layer on top of the Win32 API rather than its own separate subsystem.

Unfortunately the Win32 API *does* enforce things like file naming conventions, so it's impossible (at least without major voodoo) to write Win32 applications which handle things like a colon in a file name, and since different subsytems are isolated, that means that no normal Windows software is going to be able to do it.

(I learnt all this when I copied my music collection to an NTFS filesystem, and discovered that bits of it were unaccessible to Windows without SFU/SUA, which is unavailable for the version of Windows I was using.)

http://en.wikipedia.org/wiki/Native_API

NT (Windows kernel) doesn't care about filenames any more than Linux

Posted Mar 30, 2009 15:13 UTC (Mon) by foom (subscriber, #14868) [Link] (1 responses)

>> Does that mean if you code against the NT API directly, you can create files foo and FOO in the same directory?
> Yes. This is what the POSIX subsystems for NT do

You can actually do this through the Win32 API: see the FILE_FLAG_POSIX_SEMANTICS flag for CreateFile. However, MS realized this was a security problem, so as of WinXP, this option will in normal circumstances do absolutely nothing. You now have to explicitly enable case-sensitive support on the system for either the "Native" or Win32 APIs to allow it.

(the SFU installer asks if you want to this, but even SFU has no special dispensation)

NT (Windows kernel) doesn't care about filenames any more than Linux

Posted Nov 15, 2009 0:06 UTC (Sun) by yuhong (guest, #57183) [Link]

Another trick you can use with CreateFile is to start the filename with \\.\.
If that is done, the only processing done on the filename before CreateFile
calls NtCreateFile with the name is that \\.\ is replace with \??\, which is
an alias of \DosDevices\.

NT (Windows kernel) doesn't care about filenames any more than Linux

Posted Nov 14, 2009 23:58 UTC (Sat) by yuhong (guest, #57183) [Link]

"files that are more than 2GB long"
Yep, NT had supported both files and disks larger than 2GB from the first
version (NT 3.1) using the NTFS filesystem. Exercise: compare the design of
the GetDiskFreeSpace and SetFilePointer APIs (look them up using MSDN or
Google), both of which has existed since NT 3.1. Which one was so much more
error-prone that the versions of Windows released in 1996 had to cap the
result to 2GB, even though older versions of NT supported returning more than
2GB using it, and why?


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds