|
|
Log in / Subscribe / Register

Wheeler: Fixing Unix/Linux/POSIX Filenames

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 25, 2009 16:42 UTC (Wed) by epa (subscriber, #39769)
In reply to: Wheeler: Fixing Unix/Linux/POSIX Filenames by jreiser
Parent article: Wheeler: Fixing Unix/Linux/POSIX Filenames

Some of my programs use such "bad" filenames systematically on purpose, and achieve strictly greater utility and efficiency than would be possible without them.
Can you give an example?

There is a certain old-school appeal in just being able to use the filesystem as a key-value store with no restrictions on what bytes can appear in the key. But it's spoiled a bit by the prohibition of NUL and / characters, and trivially you can adapt such code to base64-encode the key into a sanitized filename. It may look a bit uglier, but if only application-specific programs and the OS access the files anyway, that does not matter.

If you are truly concerned about portability, then work on the problem which arises because Microsoft Windows [FAT and NTFS] allows a filename consisting of a US customary calendar date, i.e. "03/25/09" as an eight-character filename.
It's also possible for an iso9660 CD-ROM to have filenames containing the / character, or at least, I possess such a disc. This shows that in general there is a need for Linux to sanitize filenames coming from alien filesystems.


to post comments

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 25, 2009 17:46 UTC (Wed) by nix (subscriber, #2304) [Link] (9 responses)

I use the filename as a key-value store for a system (not yet released) which implements an object model of sorts in the shell (inspired by shoop but not derived from it). dot-prepended names are used to signify private fields, and dash-prepended ones, *specifically because they are so hard to use* and thus unlikely to be desirable field names, are used by the inside of the object model as field metadata: e.g. '-creator-blah' is the ID of the object that triggered creation of field 'blah'.

(I could equally use a directory full of stuff here, but it too would need a name that's hard to type. I pondered a \n-prepended filename because it's even harder to trip over by mistake, but decided that it would look too odd in directory listings of object directories when debugging. There's no danger of user code interpreting these names as options, because user code accesses files in this directory only via a shell-function API.)

And if I've done it, I guarantee you that lots and lots of other people have done it too.

David's proposed constraints on filenames are constraints which can never be imposed by default, at the very least. The semantics of Unix filesystems have been fixed de facto for many years: nobody expects files with odd characters to work on FAT, but nobody expects a Unix system to use a FAT filesystem as a primary datastore either.

Hardwired filename encodings are a good idea only if you can guarantee that this encoding has been the standard for the lifetime of the filesystem. You can't assume that for any existing filesystem: thus you have to decide what to do if filenames are not representable in the encoding scheme chosen. This also conflicts with 'no control characters' in that a good bunch of Unicode characters >127 can be considered 'control characters' of a sort, and there's no guarantee that more won't be added. How to exclude control characters which may be added in the future?

You also can't sensibly exclude shell metacharacters, because you don't know what they are, because they're shell-dependent, and some shells (like zsh) have globbing schemes so complex that ruling out all filenames that might be interpretable as a glob is wholly impractical.

But I agree that these rules all make sense for parts of the filesystem that users might manipulate with arbitrary programs, as opposed to those that are using part of the FS as a single-program-managed datastore. What I think we need is an analogue of pathconf() (setpathconf()?) combined with extra fs metadata, such that constraints of this sort can be imposed and relaxed for *specific directories* (inherited by subdirectories on mkdir(), presumably). *That* might stand a chance of not breaking the entire world in the name of fixing it.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 25, 2009 21:56 UTC (Wed) by dwheeler (guest, #1216) [Link] (8 responses)

A few thoughts based on nix's comments...

I use the filename as a key-value store for a system (not yet released) which implements an object model of sorts in the shell (inspired by shoop but not derived from it). dot-prepended names are used to signify private fields, and dash-prepended ones, *specifically because they are so hard to use* and thus unlikely to be desirable field names, are used by the inside of the object model as field metadata:

Such a key-value storage will have trouble with "/" in the key, since it's the directory separator. So if you truly need arbitrary keys, you already have to do some encoding anyway - so why not encode to something more convenient? If you don't need arbitrary encoding, then let's find some reasonable limits that stop the worst of the bleeding. Also, there's no need to have all those weird filenames merged with other stuff in the same directory; you could create a single directory with "." as the first character in the name, and create the key-value store in that subdirectory.

I pondered a \n-prepended filename because it's even harder to trip over by mistake, but decided that it would look too odd in directory listings of object directories when debugging. There's no danger of user code interpreting these names as options, because user code accesses files in this directory only via a shell-function API.)

That's exactly my point. Even in your case, filenames with \n are a pain. And let's say that a user runs a "find" that traverses your directory... if the filenames are troublesome (e.g., include \n or \t) you'll almost certainly cause the user grief, even if they had no idea that you implemented a keystore. And even if you don't want users (or their programs) going into these directories, people WILL need to do so, to do debugging.

The semantics of Unix filesystems have been fixed de facto for many years...

"We've always done it that way" may be true, but that doesn't justify the status quo. The status quo is causing pain, for little gain. Let's fix it.

Hardwired filename encodings are a good idea only if you can guarantee that this encoding has been the standard for the lifetime of the filesystem. You can't assume that for any existing filesystem: thus you have to decide what to do if filenames are not representable in the encoding scheme chosen. This also conflicts with 'no control characters' in that a good bunch of Unicode characters >127 can be considered 'control characters' of a sort, and there's no guarantee that more won't be added. How to exclude control characters which may be added in the future?

Lots of filesystems ALREADY mandate specific on-disk encodings; I believe all Windows and MacOS filesystems already specify them. The problem is that the system doesn't know how to map them to the userspace API. So, let's define the userspace API, so that people can actually do the mapping correctly. As far as "forever" goes, the program "convmv" does mass file renames for encoding; you can use it to convert a given filesystem from whatever encoding you've been using to UTF-8 (problem solved). The distros are already moving this way. As far as "no control characters", there's no need to do anything locale-dependent; excluding 1-31 would be adequate, and I'd also exclude 127 to to be complete for 7-bit ASCII (how do you print DEL in a GUI anyway?!?). Control characters unique to other locales don't bite people the way these characters do.

You also can't sensibly exclude shell metacharacters, because you don't know what they are, because they're shell-dependent, and some shells (like zsh) have globbing schemes so complex that ruling out all filenames that might be interpretable as a glob is wholly impractical.

I completely agree that this limitation cannot be applied everywhere. In fact, my article said, "I doubt these limitations could be agreed upon across all POSIX systems, but it'd be nice if administrators could configure specific systems to prevent such filenames on higher-value systems." But on some systems, I do know what shells are in use, and their metacharacters, and the system is never supposed to be creating filenames with metacharacters in the first place. I'd like to be able to install a "special exclusion list", just like I can install SELinux today to create additional limitations on what this particular system can do.

But I agree that these rules all make sense for parts of the filesystem that users might manipulate with arbitrary programs, as opposed to those that are using part of the FS as a single-program-managed datastore. What I think we need is an analogue of pathconf() (setpathconf()?) combined with extra fs metadata, such that constraints of this sort can be imposed and relaxed for *specific directories* (inherited by subdirectories on mkdir(), presumably). *That* might stand a chance of not breaking the entire world in the name of fixing it.

That's a very interesting idea, I like it! In fact, there's already a mechanism in the Linux kernel that might do this job already: getfattr(1)/setfattr(1). If it were implemented this way, I'd suggest that by default directories would "prevent bad filenames" (e.g., control chars and leading "-"); you could then use "setfattr" on directories to permit badness. New directories could then inherit the state of their parent. I would make those "trusted extended attributes" - you'd have to be CAP_SYS_ADMIN (typically superuser) to be able to create such directories.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 3:07 UTC (Thu) by drag (guest, #31333) [Link]

Mac OS X eliminated the utility of having case sensitive filenames, and while annoying if your porting software, I have not heard much complaints about it and software has been fixed.

And that is much more extreme then having a filesystem mount option to stop tabs and newlines being used to define files.

It'll be future proof also, as much that matters. You don't make a whitelist of allowed characters, you make a blacklist of troublesome characters and allow everything else. If you create more troublesome characters, which is very unlikely, you can add them to the black list. (and even if it is going to happen it will be exceedingly rare) Any new characters that get made, or any new encodings, then they will just be allowed to slide on through.

I mean if you have a future encoding scheme that conflicts with a previously established and well known encoding such as ascii, then it is just too dumb to be supported by anybody.

-----------------

Here is a challenge:

Somebody write me a script that will go and count all the uses of tabs, <, >, and newlines in their file names on their systems...

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 22:12 UTC (Thu) by nix (subscriber, #2304) [Link]

David, thanks for responding.
Such a key-value storage will have trouble with "/" in the key, since it's the directory separator. So if you truly need arbitrary keys, you already have to do some encoding anyway - so why not encode to something more convenient? If you don't need arbitrary encoding, then let's find some reasonable limits that stop the worst of the bleeding. Also, there's no need to have all those weird filenames merged with other stuff in the same directory; you could create a single directory with "." as the first character in the name, and create the key-value store in that subdirectory.
I claim mental block: this solution became obvious to me a day or so back. (Rather, since I already use . as a metacharacter to mean 'private', use .. to mean 'extra-private: metadata'. Yes, this too is bizarre, but at least it's not dash-prepended.)

But I have seen a system in production use at Big Banks (first saw it yesterday, or first noticed it, probably thanks to this conversation) that uses the filesystem as a base-254-of-key to value store. It's gross but it's sometimes done.

But then, we know how competent Big Banks are. (Especially this one, did you but know who it was.)

As far as "forever" goes, the program "convmv" does mass file renames for encoding; you can use it to convert a given filesystem from whatever encoding you've been using to UTF-8 (problem solved).
Yes, but this only works if you can mandate a no-encoding transparent view of filenames! As soon as you start to automatically encode them, this sort of transcoding is impossible.

I have no objection to making the things you propose options. What I object to is making them mandatory, because this would make some things impossible. (Strange things, but still.)

In fact, there's already a mechanism in the Linux kernel that might do this job already: getfattr(1)/setfattr(1). If it were implemented this way, I'd suggest that by default directories would "prevent bad filenames" (e.g., control chars and leading "-"); you could then use "setfattr" on directories to permit badness. New directories could then inherit the state of their parent. I would make those "trusted extended attributes" - you'd have to be CAP_SYS_ADMIN (typically superuser) to be able to create such directories.
It depends how harsh the limits are. I'd say that 'no control characters' is certainly reasonable to have only the superuser lift. Perhaps a less harsh constraint to impose is that regular users cannot set this attribute on directories readable by 'other', and that chmodding a directory after the fact strips this attribute off it. Now users cannot dump landmines in that directory for users outside their group (root is assumed to know what he's doing).

I'd say that setting this attribute flips a pathconf-viewable attribute as well, so that other POSIX-compliant systems can adopt the same approach and applications can portably query it without needing to implement/depend on all of the ACL machinery.

NT (Windows kernel) doesn't care about filenames any more than Linux

Posted Mar 28, 2009 15:36 UTC (Sat) by tialaramex (subscriber, #21167) [Link] (5 responses)

It's always worth telling people this, because it tends to make them rock back on their heels if they've been (wrongly) believing that NT is doing something special here.

NT (the kernel API in Windows NT, 2000, XP and etc.) doesn't care about filename encodings. The only thing that makes NT's attitude to such things different from that of Linux's is that NT's arbitrary sequences of non-zero code units used for filenames use 16-bit code units, and in Linux obviously they're 8-bit.

Everything else you see, such as case-insensitivity, bans on certain characters or sequences of characters, is implemented in other layers of the OS or even in language runtimes, not the kernel. Low-level programmers, just as on Unix, can call a file anything they like.

And the consequence is the same thing being lamented in this article - badly written Windows programs crash or do insane things when faced with filenames that don't look like the ones the poor third rate programmer who wrote the code was familiar with. In the absence of defensive programming this software also doesn't like leap years, or leap seconds, or files that are more than 2GB long, or... you could go on all day, badly written programs suck.

On encodings - I encourage you to use UTF-8. I encourage people with other encodings to migrate to UTF-8, but using UTF-8 and blindly trusting that everything you work with is actually legal and meaningful display-safe UTF-8 are quite different things. People who can't keep them separate are doing a bad job, whether handling filenames or displaying email.

NT (Windows kernel) doesn't care about filenames any more than Linux

Posted Mar 29, 2009 14:36 UTC (Sun) by epa (subscriber, #39769) [Link] (3 responses)

NT (the kernel API in Windows NT, 2000, XP and etc.) doesn't care about filename encodings. The only thing that makes NT's attitude to such things different from that of Linux's is that NT's arbitrary sequences of non-zero code units used for filenames use 16-bit code units, and in Linux obviously they're 8-bit.

Everything else you see, such as case-insensitivity, bans on certain characters or sequences of characters, is implemented in other layers of the OS or even in language runtimes, not the kernel. Low-level programmers, just as on Unix, can call a file anything they like.

Does that mean if you code against the NT API directly, you can create files foo and FOO in the same directory? I expect that opens up all sorts of juicy security holes - many of them theoretical, since a typical NT system has just one user and there is not much need for privelege escalation - but still it sounds fun.
using UTF-8 and blindly trusting that everything you work with is actually legal and meaningful display-safe UTF-8 are quite different things.
Indeed. Hence the benefit of enforcing this at the OS level: it gets rid of the need for sanity checks that slow down the good programmers and were never written anyway by the bad programmers.

NT (Windows kernel) doesn't care about filenames any more than Linux

Posted Mar 30, 2009 10:55 UTC (Mon) by nye (guest, #51576) [Link] (2 responses)

>Does that mean if you code against the NT API directly, you can create files foo and FOO in the same directory?

Yes. This is what the POSIX subsystems for NT do; they're implemented on top of the native API, as is the Win32 API. Note that Cygwin doesn't count here as it's a compatibility layer on top of the Win32 API rather than its own separate subsystem.

Unfortunately the Win32 API *does* enforce things like file naming conventions, so it's impossible (at least without major voodoo) to write Win32 applications which handle things like a colon in a file name, and since different subsytems are isolated, that means that no normal Windows software is going to be able to do it.

(I learnt all this when I copied my music collection to an NTFS filesystem, and discovered that bits of it were unaccessible to Windows without SFU/SUA, which is unavailable for the version of Windows I was using.)

http://en.wikipedia.org/wiki/Native_API

NT (Windows kernel) doesn't care about filenames any more than Linux

Posted Mar 30, 2009 15:13 UTC (Mon) by foom (subscriber, #14868) [Link] (1 responses)

>> Does that mean if you code against the NT API directly, you can create files foo and FOO in the same directory?
> Yes. This is what the POSIX subsystems for NT do

You can actually do this through the Win32 API: see the FILE_FLAG_POSIX_SEMANTICS flag for CreateFile. However, MS realized this was a security problem, so as of WinXP, this option will in normal circumstances do absolutely nothing. You now have to explicitly enable case-sensitive support on the system for either the "Native" or Win32 APIs to allow it.

(the SFU installer asks if you want to this, but even SFU has no special dispensation)

NT (Windows kernel) doesn't care about filenames any more than Linux

Posted Nov 15, 2009 0:06 UTC (Sun) by yuhong (guest, #57183) [Link]

Another trick you can use with CreateFile is to start the filename with \\.\.
If that is done, the only processing done on the filename before CreateFile
calls NtCreateFile with the name is that \\.\ is replace with \??\, which is
an alias of \DosDevices\.

NT (Windows kernel) doesn't care about filenames any more than Linux

Posted Nov 14, 2009 23:58 UTC (Sat) by yuhong (guest, #57183) [Link]

"files that are more than 2GB long"
Yep, NT had supported both files and disks larger than 2GB from the first
version (NT 3.1) using the NTFS filesystem. Exercise: compare the design of
the GetDiskFreeSpace and SetFilePointer APIs (look them up using MSDN or
Google), both of which has existed since NT 3.1. Which one was so much more
error-prone that the versions of Windows released in 1996 had to cap the
result to 2GB, even though older versions of NT supported returning more than
2GB using it, and why?

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 25, 2009 17:57 UTC (Wed) by jd (guest, #26381) [Link]

I can see the argument for non-printable filenames. However, you'd need to distinguish between generic non-printable filenames (ie: any character array that can be used as a filename), tokens used synonymously with filenames (ie: a short fixed-length ID that denotes the file, regardless of the logical name or logical directory) and hashes used synonymously with filenames (ie: a long fixed-length ID that serves the same role as a token but can be generated rather than looked up).

IMHO, the different roles all speak to different problems and all have their limitations outside of the problems they're meant for. The first step in finding a solution is to define the problem, but a filesystem solves a very wide range of problems, making a definition less clear-cut.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 25, 2009 23:27 UTC (Wed) by jreiser (subscriber, #11027) [Link] (1 responses)

As nix says, the filename encodes a key to what the file contains. The encoding is radix-254 (NUL and '/' excluded.) This fully utilizes the ASCII control characters [\x01-\x1f] and also the sequences such as subsets of [\xfc-\xff]* which are disallowed by UTF-8. Radix-254 is almost 2 bits per byte more dense than the proposed radix-65 (26 upper case, 26 lower case, 10 digits, dot hyphen underscore). The OS imposes an upper bound on the length of a filename, and there are critical points at various shorter lengths where there are jumps in space*time costs. Enough utility is discarded by radix-65 (as opposed to radix-254) that customers complain.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 14:44 UTC (Thu) by dwheeler (guest, #1216) [Link]

I never proposed radix-65. Radix-65 (26 upper case, 26 lower case, 10 digits, dot hyphen underscore) is what the POSIX standard ALREADY says is all you can depend on; nothing else is portable by that spec.

I want to be able to count on more than what the POSIX spec says; I want to be able to use the entire Unicode character set, minus the control chars and a few additional constraints to prevent lots of problems for the general-purpose user.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 26, 2009 13:38 UTC (Thu) by Wol (subscriber, #4433) [Link]

A system I'm playing with copies something called PI/Open.

A file inside this system is actually stored as a directory at the OS level, and it created filenames of the form <space><backspace>nnn.

I copied this, and found that Midnight Commander was great at managing the resulting files :-) It's done so that people can't tamper - corrupting one of the (many) OS-level files would do serious damage to the PI file.

Cheers,
Wol


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds