Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 25, 2009 21:56 UTC (Wed) by dwheeler (guest, #1216)In reply to: Wheeler: Fixing Unix/Linux/POSIX Filenames by nix
Parent article: Wheeler: Fixing Unix/Linux/POSIX Filenames
A few thoughts based on nix's comments...
I use the filename as a key-value store for a system (not yet released) which implements an object model of sorts in the shell (inspired by shoop but not derived from it). dot-prepended names are used to signify private fields, and dash-prepended ones, *specifically because they are so hard to use* and thus unlikely to be desirable field names, are used by the inside of the object model as field metadata:
Such a key-value storage will have trouble with "/" in the key, since it's the directory separator. So if you truly need arbitrary keys, you already have to do some encoding anyway - so why not encode to something more convenient? If you don't need arbitrary encoding, then let's find some reasonable limits that stop the worst of the bleeding. Also, there's no need to have all those weird filenames merged with other stuff in the same directory; you could create a single directory with "." as the first character in the name, and create the key-value store in that subdirectory.
I pondered a \n-prepended filename because it's even harder to trip over by mistake, but decided that it would look too odd in directory listings of object directories when debugging. There's no danger of user code interpreting these names as options, because user code accesses files in this directory only via a shell-function API.)
That's exactly my point. Even in your case, filenames with \n are a pain. And let's say that a user runs a "find" that traverses your directory... if the filenames are troublesome (e.g., include \n or \t) you'll almost certainly cause the user grief, even if they had no idea that you implemented a keystore. And even if you don't want users (or their programs) going into these directories, people WILL need to do so, to do debugging.
The semantics of Unix filesystems have been fixed de facto for many years...
"We've always done it that way" may be true, but that doesn't justify the status quo. The status quo is causing pain, for little gain. Let's fix it.
Hardwired filename encodings are a good idea only if you can guarantee that this encoding has been the standard for the lifetime of the filesystem. You can't assume that for any existing filesystem: thus you have to decide what to do if filenames are not representable in the encoding scheme chosen. This also conflicts with 'no control characters' in that a good bunch of Unicode characters >127 can be considered 'control characters' of a sort, and there's no guarantee that more won't be added. How to exclude control characters which may be added in the future?
Lots of filesystems ALREADY mandate specific on-disk encodings; I believe all Windows and MacOS filesystems already specify them. The problem is that the system doesn't know how to map them to the userspace API. So, let's define the userspace API, so that people can actually do the mapping correctly. As far as "forever" goes, the program "convmv" does mass file renames for encoding; you can use it to convert a given filesystem from whatever encoding you've been using to UTF-8 (problem solved). The distros are already moving this way. As far as "no control characters", there's no need to do anything locale-dependent; excluding 1-31 would be adequate, and I'd also exclude 127 to to be complete for 7-bit ASCII (how do you print DEL in a GUI anyway?!?). Control characters unique to other locales don't bite people the way these characters do.
You also can't sensibly exclude shell metacharacters, because you don't know what they are, because they're shell-dependent, and some shells (like zsh) have globbing schemes so complex that ruling out all filenames that might be interpretable as a glob is wholly impractical.
I completely agree that this limitation cannot be applied everywhere. In fact, my article said, "I doubt these limitations could be agreed upon across all POSIX systems, but it'd be nice if administrators could configure specific systems to prevent such filenames on higher-value systems." But on some systems, I do know what shells are in use, and their metacharacters, and the system is never supposed to be creating filenames with metacharacters in the first place. I'd like to be able to install a "special exclusion list", just like I can install SELinux today to create additional limitations on what this particular system can do.
But I agree that these rules all make sense for parts of the filesystem that users might manipulate with arbitrary programs, as opposed to those that are using part of the FS as a single-program-managed datastore. What I think we need is an analogue of pathconf() (setpathconf()?) combined with extra fs metadata, such that constraints of this sort can be imposed and relaxed for *specific directories* (inherited by subdirectories on mkdir(), presumably). *That* might stand a chance of not breaking the entire world in the name of fixing it.
That's a very interesting idea, I like it! In fact, there's already a mechanism in the Linux kernel that might do this job already: getfattr(1)/setfattr(1). If it were implemented this way, I'd suggest that by default directories would "prevent bad filenames" (e.g., control chars and leading "-"); you could then use "setfattr" on directories to permit badness. New directories could then inherit the state of their parent. I would make those "trusted extended attributes" - you'd have to be CAP_SYS_ADMIN (typically superuser) to be able to create such directories.
