At last, a hope of progress
Posted Mar 25, 2009 14:05 UTC (Wed) by epa (subscriber, #39769) [Link]
Look at the recent Python version that got tripped up by filenames that are not valid UTF-8. Currently on a Unix-like system you cannot assume anything more about filenames than that they're a string of bytes. This frustrates efforts to treat them as Unicode strings and cleanly allow international characters.
Or look at the whole succession of security holes in shell scripts and even other languages caused by control characters in filenames. My particular favourite is the way many innocuous-looking perl programs (containing 'while (<>)') can be induced to overwrite random files by making filenames beginning '>'.
A system-wide policy guaranteeing that only sane characters can appear in filenames would eliminate at a stroke a lot of tedious sanity-checking you have to do in userspace (not to mention the hidden bugs and security holes in many programs because the sanity-checking was not paranoid enough). Given the natural conservatism of developers, I can't be optimistic it will happen soon. But, like defaulting to relatime instead of updating atime on each access, it's a long-overdue spring clean to a particularly musty corner of the Unix way.
At last, a hope of progress
Posted Mar 25, 2009 16:52 UTC (Wed) by michaeljt (subscriber, #39183) [Link]
At last, a hope of progress
Posted Mar 25, 2009 20:02 UTC (Wed) by michaeljt (subscriber, #39183) [Link]
At last, a hope of progress
Posted Mar 29, 2009 0:01 UTC (Sun) by mikachu (guest, #5333) [Link]
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 25, 2009 15:12 UTC (Wed) by rsidd (subscriber, #2582) [Link]
Yet another topic that was extensively discussed in the Unix Haters Handbook.
One gotcha that is not covered by limiting the allowed character set in filenames is this: how do you remove all your configuration files and directories (that begin with a ".")? "rm -rf .*" will have very undesirable results. And yes, I have done this to myself -- luckily the system had nightly backups.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 25, 2009 15:32 UTC (Wed) by drag (subscriber, #31333) [Link]
Maybe if Linux folks keep identifying and fixing legacy Unix usability issues people will start referring to Linux as 'The way Unix should of been' or 'Unix done right'.
I am always really paranoid about file names in Linux when doing scripting or whatnot. It's difficult and dealing with them is always a "oh god I hope I did the escaping right" sort of deal. Because I know I can have a script that I use for lots of stuff, but sometime I may make a goober'd up filename late at night or something that can end up destroying data unless I did things exactly correct in my scripts.
Filenames with ~ in them, or < > or () or all sorts of odd things I make by mistake.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 25, 2009 16:20 UTC (Wed) by epa (subscriber, #39769) [Link]
Maybe if Linux folks keep identifying and fixing legacy Unix usability issues people will start referring to Linux as 'The way Unix should of been' or 'Unix done right'.I thought that was Plan 9...
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 25, 2009 22:14 UTC (Wed) by jordanb (guest, #45668) [Link]
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 25, 2009 15:42 UTC (Wed) by gnb (subscriber, #5132) [Link]
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 25, 2009 18:43 UTC (Wed) by danielthaler (guest, #24764) [Link]
rm: cannot remove directory `.'
rm: cannot remove directory `..'
so something did prevent it from happening...
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 25, 2009 21:43 UTC (Wed) by michaeljt (subscriber, #39183) [Link]
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 25, 2009 22:46 UTC (Wed) by nix (subscriber, #2304) [Link]
GNU coreutils, like gnulib, is a goldmine of fantastic tricks (and evil
ones that it's best to, ahem, admire from a distance: e.g. the less said
about NEED_REWIND and the need for rm running everywhere to work around a
MacOS X bug, the better).
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 25, 2009 22:15 UTC (Wed) by csigler (subscriber, #1224) [Link]
IIRC, you can do something like "rm -fr .[^.]*" -- at least this WFM in the command "ls -Fa .[^.]*"
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 25, 2009 23:05 UTC (Wed) by Jonno (subscriber, #49613) [Link]
`rm -rf .??*` is a good start. It will miss configuration files with only
one or two characters after the dot, but I've not found any program using
such yet...
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 26, 2009 1:32 UTC (Thu) by bojan (subscriber, #14302) [Link]
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 26, 2009 2:34 UTC (Thu) by k8to (subscriber, #15413) [Link]
Personally I just walk the list and filter out . and ..
Yes, it sucks.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 26, 2009 4:58 UTC (Thu) by shimei (guest, #54776) [Link]
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 27, 2009 1:28 UTC (Fri) by no_treble (guest, #49534) [Link]
rm -rf .[!.]*
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 25, 2009 16:22 UTC (Wed) by jreiser (subscriber, #11027) [Link]
If you are truly concerned about portability, then work on the problem which arises because Microsoft Windows [FAT and NTFS] allows a filename consisting of a US customary calendar date, i.e. "03/25/09" as an eight-character filename.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 25, 2009 16:42 UTC (Wed) by epa (subscriber, #39769) [Link]
Some of my programs use such "bad" filenames systematically on purpose, and achieve strictly greater utility and efficiency than would be possible without them.Can you give an example?
There is a certain old-school appeal in just being able to use the filesystem as a key-value store with no restrictions on what bytes can appear in the key. But it's spoiled a bit by the prohibition of NUL and / characters, and trivially you can adapt such code to base64-encode the key into a sanitized filename. It may look a bit uglier, but if only application-specific programs and the OS access the files anyway, that does not matter.
If you are truly concerned about portability, then work on the problem which arises because Microsoft Windows [FAT and NTFS] allows a filename consisting of a US customary calendar date, i.e. "03/25/09" as an eight-character filename.It's also possible for an iso9660 CD-ROM to have filenames containing the / character, or at least, I possess such a disc. This shows that in general there is a need for Linux to sanitize filenames coming from alien filesystems.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 25, 2009 17:46 UTC (Wed) by nix (subscriber, #2304) [Link]
(I could equally use a directory full of stuff here, but it too would need a name that's hard to type. I pondered a \n-prepended filename because it's even harder to trip over by mistake, but decided that it would look too odd in directory listings of object directories when debugging. There's no danger of user code interpreting these names as options, because user code accesses files in this directory only via a shell-function API.)
And if I've done it, I guarantee you that lots and lots of other people have done it too.
David's proposed constraints on filenames are constraints which can never be imposed by default, at the very least. The semantics of Unix filesystems have been fixed de facto for many years: nobody expects files with odd characters to work on FAT, but nobody expects a Unix system to use a FAT filesystem as a primary datastore either.
Hardwired filename encodings are a good idea only if you can guarantee that this encoding has been the standard for the lifetime of the filesystem. You can't assume that for any existing filesystem: thus you have to decide what to do if filenames are not representable in the encoding scheme chosen. This also conflicts with 'no control characters' in that a good bunch of Unicode characters >127 can be considered 'control characters' of a sort, and there's no guarantee that more won't be added. How to exclude control characters which may be added in the future?
You also can't sensibly exclude shell metacharacters, because you don't know what they are, because they're shell-dependent, and some shells (like zsh) have globbing schemes so complex that ruling out all filenames that might be interpretable as a glob is wholly impractical.
But I agree that these rules all make sense for parts of the filesystem that users might manipulate with arbitrary programs, as opposed to those that are using part of the FS as a single-program-managed datastore. What I think we need is an analogue of pathconf() (setpathconf()?) combined with extra fs metadata, such that constraints of this sort can be imposed and relaxed for *specific directories* (inherited by subdirectories on mkdir(), presumably). *That* might stand a chance of not breaking the entire world in the name of fixing it.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 25, 2009 21:56 UTC (Wed) by dwheeler (guest, #1216) [Link]
A few thoughts based on nix's comments...
I use the filename as a key-value store for a system (not yet released) which implements an object model of sorts in the shell (inspired by shoop but not derived from it). dot-prepended names are used to signify private fields, and dash-prepended ones, *specifically because they are so hard to use* and thus unlikely to be desirable field names, are used by the inside of the object model as field metadata:
Such a key-value storage will have trouble with "/" in the key, since it's the directory separator. So if you truly need arbitrary keys, you already have to do some encoding anyway - so why not encode to something more convenient? If you don't need arbitrary encoding, then let's find some reasonable limits that stop the worst of the bleeding. Also, there's no need to have all those weird filenames merged with other stuff in the same directory; you could create a single directory with "." as the first character in the name, and create the key-value store in that subdirectory.
I pondered a \n-prepended filename because it's even harder to trip over by mistake, but decided that it would look too odd in directory listings of object directories when debugging. There's no danger of user code interpreting these names as options, because user code accesses files in this directory only via a shell-function API.)
That's exactly my point. Even in your case, filenames with \n are a pain. And let's say that a user runs a "find" that traverses your directory... if the filenames are troublesome (e.g., include \n or \t) you'll almost certainly cause the user grief, even if they had no idea that you implemented a keystore. And even if you don't want users (or their programs) going into these directories, people WILL need to do so, to do debugging.
The semantics of Unix filesystems have been fixed de facto for many years...
"We've always done it that way" may be true, but that doesn't justify the status quo. The status quo is causing pain, for little gain. Let's fix it.
Hardwired filename encodings are a good idea only if you can guarantee that this encoding has been the standard for the lifetime of the filesystem. You can't assume that for any existing filesystem: thus you have to decide what to do if filenames are not representable in the encoding scheme chosen. This also conflicts with 'no control characters' in that a good bunch of Unicode characters >127 can be considered 'control characters' of a sort, and there's no guarantee that more won't be added. How to exclude control characters which may be added in the future?
Lots of filesystems ALREADY mandate specific on-disk encodings; I believe all Windows and MacOS filesystems already specify them. The problem is that the system doesn't know how to map them to the userspace API. So, let's define the userspace API, so that people can actually do the mapping correctly. As far as "forever" goes, the program "convmv" does mass file renames for encoding; you can use it to convert a given filesystem from whatever encoding you've been using to UTF-8 (problem solved). The distros are already moving this way. As far as "no control characters", there's no need to do anything locale-dependent; excluding 1-31 would be adequate, and I'd also exclude 127 to to be complete for 7-bit ASCII (how do you print DEL in a GUI anyway?!?). Control characters unique to other locales don't bite people the way these characters do.
You also can't sensibly exclude shell metacharacters, because you don't know what they are, because they're shell-dependent, and some shells (like zsh) have globbing schemes so complex that ruling out all filenames that might be interpretable as a glob is wholly impractical.
I completely agree that this limitation cannot be applied everywhere. In fact, my article said, "I doubt these limitations could be agreed upon across all POSIX systems, but it'd be nice if administrators could configure specific systems to prevent such filenames on higher-value systems." But on some systems, I do know what shells are in use, and their metacharacters, and the system is never supposed to be creating filenames with metacharacters in the first place. I'd like to be able to install a "special exclusion list", just like I can install SELinux today to create additional limitations on what this particular system can do.
But I agree that these rules all make sense for parts of the filesystem that users might manipulate with arbitrary programs, as opposed to those that are using part of the FS as a single-program-managed datastore. What I think we need is an analogue of pathconf() (setpathconf()?) combined with extra fs metadata, such that constraints of this sort can be imposed and relaxed for *specific directories* (inherited by subdirectories on mkdir(), presumably). *That* might stand a chance of not breaking the entire world in the name of fixing it.
That's a very interesting idea, I like it! In fact, there's already a mechanism in the Linux kernel that might do this job already: getfattr(1)/setfattr(1). If it were implemented this way, I'd suggest that by default directories would "prevent bad filenames" (e.g., control chars and leading "-"); you could then use "setfattr" on directories to permit badness. New directories could then inherit the state of their parent. I would make those "trusted extended attributes" - you'd have to be CAP_SYS_ADMIN (typically superuser) to be able to create such directories.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 26, 2009 3:07 UTC (Thu) by drag (subscriber, #31333) [Link]
And that is much more extreme then having a filesystem mount option to stop tabs and newlines being used to define files.
It'll be future proof also, as much that matters. You don't make a whitelist of allowed characters, you make a blacklist of troublesome characters and allow everything else. If you create more troublesome characters, which is very unlikely, you can add them to the black list. (and even if it is going to happen it will be exceedingly rare) Any new characters that get made, or any new encodings, then they will just be allowed to slide on through.
I mean if you have a future encoding scheme that conflicts with a previously established and well known encoding such as ascii, then it is just too dumb to be supported by anybody.
-----------------
Here is a challenge:
Somebody write me a script that will go and count all the uses of tabs, <, >, and newlines in their file names on their systems...
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 26, 2009 22:12 UTC (Thu) by nix (subscriber, #2304) [Link]
David, thanks for responding.Such a key-value storage will have trouble with "/" in the key, since it's the directory separator. So if you truly need arbitrary keys, you already have to do some encoding anyway - so why not encode to something more convenient? If you don't need arbitrary encoding, then let's find some reasonable limits that stop the worst of the bleeding. Also, there's no need to have all those weird filenames merged with other stuff in the same directory; you could create a single directory with "." as the first character in the name, and create the key-value store in that subdirectory.I claim mental block: this solution became obvious to me a day or so back. (Rather, since I already use . as a metacharacter to mean 'private', use .. to mean 'extra-private: metadata'. Yes, this too is bizarre, but at least it's not dash-prepended.)
But I have seen a system in production use at Big Banks (first saw it yesterday, or first noticed it, probably thanks to this conversation) that uses the filesystem as a base-254-of-key to value store. It's gross but it's sometimes done.
But then, we know how competent Big Banks are. (Especially this one, did you but know who it was.)
As far as "forever" goes, the program "convmv" does mass file renames for encoding; you can use it to convert a given filesystem from whatever encoding you've been using to UTF-8 (problem solved).Yes, but this only works if you can mandate a no-encoding transparent view of filenames! As soon as you start to automatically encode them, this sort of transcoding is impossible.
I have no objection to making the things you propose options. What I object to is making them mandatory, because this would make some things impossible. (Strange things, but still.)
In fact, there's already a mechanism in the Linux kernel that might do this job already: getfattr(1)/setfattr(1). If it were implemented this way, I'd suggest that by default directories would "prevent bad filenames" (e.g., control chars and leading "-"); you could then use "setfattr" on directories to permit badness. New directories could then inherit the state of their parent. I would make those "trusted extended attributes" - you'd have to be CAP_SYS_ADMIN (typically superuser) to be able to create such directories.It depends how harsh the limits are. I'd say that 'no control characters' is certainly reasonable to have only the superuser lift. Perhaps a less harsh constraint to impose is that regular users cannot set this attribute on directories readable by 'other', and that chmodding a directory after the fact strips this attribute off it. Now users cannot dump landmines in that directory for users outside their group (root is assumed to know what he's doing).
I'd say that setting this attribute flips a pathconf-viewable attribute as well, so that other POSIX-compliant systems can adopt the same approach and applications can portably query it without needing to implement/depend on all of the ACL machinery.
NT (Windows kernel) doesn't care about filenames any more than Linux
Posted Mar 28, 2009 15:36 UTC (Sat) by tialaramex (subscriber, #21167) [Link]
NT (the kernel API in Windows NT, 2000, XP and etc.) doesn't care about filename encodings. The only thing that makes NT's attitude to such things different from that of Linux's is that NT's arbitrary sequences of non-zero code units used for filenames use 16-bit code units, and in Linux obviously they're 8-bit.
Everything else you see, such as case-insensitivity, bans on certain characters or sequences of characters, is implemented in other layers of the OS or even in language runtimes, not the kernel. Low-level programmers, just as on Unix, can call a file anything they like.
And the consequence is the same thing being lamented in this article - badly written Windows programs crash or do insane things when faced with filenames that don't look like the ones the poor third rate programmer who wrote the code was familiar with. In the absence of defensive programming this software also doesn't like leap years, or leap seconds, or files that are more than 2GB long, or... you could go on all day, badly written programs suck.
On encodings - I encourage you to use UTF-8. I encourage people with other encodings to migrate to UTF-8, but using UTF-8 and blindly trusting that everything you work with is actually legal and meaningful display-safe UTF-8 are quite different things. People who can't keep them separate are doing a bad job, whether handling filenames or displaying email.
NT (Windows kernel) doesn't care about filenames any more than Linux
Posted Mar 29, 2009 14:36 UTC (Sun) by epa (subscriber, #39769) [Link]
NT (the kernel API in Windows NT, 2000, XP and etc.) doesn't care about filename encodings. The only thing that makes NT's attitude to such things different from that of Linux's is that NT's arbitrary sequences of non-zero code units used for filenames use 16-bit code units, and in Linux obviously they're 8-bit.Does that mean if you code against the NT API directly, you can create files foo and FOO in the same directory? I expect that opens up all sorts of juicy security holes - many of them theoretical, since a typical NT system has just one user and there is not much need for privelege escalation - but still it sounds fun.Everything else you see, such as case-insensitivity, bans on certain characters or sequences of characters, is implemented in other layers of the OS or even in language runtimes, not the kernel. Low-level programmers, just as on Unix, can call a file anything they like.
using UTF-8 and blindly trusting that everything you work with is actually legal and meaningful display-safe UTF-8 are quite different things.Indeed. Hence the benefit of enforcing this at the OS level: it gets rid of the need for sanity checks that slow down the good programmers and were never written anyway by the bad programmers.
NT (Windows kernel) doesn't care about filenames any more than Linux
Posted Mar 30, 2009 10:55 UTC (Mon) by nye (guest, #51576) [Link]
Yes. This is what the POSIX subsystems for NT do; they're implemented on top of the native API, as is the Win32 API. Note that Cygwin doesn't count here as it's a compatibility layer on top of the Win32 API rather than its own separate subsystem.
Unfortunately the Win32 API *does* enforce things like file naming conventions, so it's impossible (at least without major voodoo) to write Win32 applications which handle things like a colon in a file name, and since different subsytems are isolated, that means that no normal Windows software is going to be able to do it.
(I learnt all this when I copied my music collection to an NTFS filesystem, and discovered that bits of it were unaccessible to Windows without SFU/SUA, which is unavailable for the version of Windows I was using.)
NT (Windows kernel) doesn't care about filenames any more than Linux
Posted Mar 30, 2009 15:13 UTC (Mon) by foom (subscriber, #14868) [Link]
>> Does that mean if you code against the NT API directly, you can create files foo and FOO in the same directory?You can actually do this through the Win32 API: see the FILE_FLAG_POSIX_SEMANTICS flag for CreateFile. However, MS realized this was a security problem, so as of WinXP, this option will in normal circumstances do absolutely nothing. You now have to explicitly enable case-sensitive support on the system for either the "Native" or Win32 APIs to allow it.
(the SFU installer asks if you want to this, but even SFU has no special dispensation)
NT (Windows kernel) doesn't care about filenames any more than Linux
Posted Nov 15, 2009 0:06 UTC (Sun) by yuhong (guest, #57183) [Link]
NT (Windows kernel) doesn't care about filenames any more than Linux
Posted Nov 14, 2009 23:58 UTC (Sat) by yuhong (guest, #57183) [Link]
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 25, 2009 17:57 UTC (Wed) by jd (guest, #26381) [Link]
IMHO, the different roles all speak to different problems and all have their limitations outside of the problems they're meant for. The first step in finding a solution is to define the problem, but a filesystem solves a very wide range of problems, making a definition less clear-cut.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 25, 2009 23:27 UTC (Wed) by jreiser (subscriber, #11027) [Link]
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 26, 2009 14:44 UTC (Thu) by dwheeler (guest, #1216) [Link]
I never proposed radix-65. Radix-65 (26 upper case, 26 lower case, 10 digits, dot hyphen underscore) is what the POSIX standard ALREADY says is all you can depend on; nothing else is portable by that spec.I want to be able to count on more than what the POSIX spec says; I want to be able to use the entire Unicode character set, minus the control chars and a few additional constraints to prevent lots of problems for the general-purpose user.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 26, 2009 13:38 UTC (Thu) by Wol (guest, #4433) [Link]
A file inside this system is actually stored as a directory at the OS level, and it created filenames of the form <space><backspace>nnn.
I copied this, and found that Midnight Commander was great at managing the resulting files :-) It's done so that people can't tamper - corrupting one of the (many) OS-level files would do serious damage to the PI file.
Cheers,
Wol
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 25, 2009 16:31 UTC (Wed) by mgross (subscriber, #38112) [Link]
Also, this seems like a pretty sensible idea, why hasn't it been implemented already?
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 25, 2009 16:50 UTC (Wed) by epa (subscriber, #39769) [Link]
Some will argue that the answer is user education (teach your users not to use bad characters in filenames), and perhaps even a cron job you can run on your PDP-11 overnight to look for filenames containing these characters and send a message via local mail to the user responsible. Furthermore, if it was good enough for V7 UNIX, it's good enough for us now. (Note that in Plan 9, there are sensible restrictions on characters in filenames; but it's common for followers of a particular system or language to become rabidly conservative, even when the original designers of the system have moved on.)
In other words it is sheer inertia, and reluctance by any one Unix-like system to add such a feature when the others do not. You can bet that if Linux added a filename character check, it would immediately be branded 'broken' by many BSD or Solaris enthusiasts - not all, but certainly those that make the most noise online.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 26, 2009 2:19 UTC (Thu) by dirtyepic (subscriber, #30178) [Link]
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 26, 2009 15:23 UTC (Thu) by dwheeler (guest, #1216) [Link]
"The reasonable man adapts himself to the world; the unreasonable one persists in trying to adapt the world to himself. Therefore, all progress depends on the unreasonable man." (George Bernard Shaw)
I'm well aware that this is different than the historical past. But that doesn't make past decisions correct for the present. So, let's chat about the pros and cons; I believe that the cons for "anything goes" now outweigh the pros.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 28, 2009 16:33 UTC (Sat) by tialaramex (subscriber, #21167) [Link]
To actually make this work, in the kernel (where you're perf critical and this is all unwanted overhead that's costing everyone who uses your "improved" system) you need to absolutely, as a matter of "Linus will veto if you don't" policy:
* Validate every filename to check that it conforms. This has to be done either at mount time, or when syscalls interact with the filenames (e.g. directory reading, and opening files). As a network file system client the OS must either screen every filename going over the network, or else punt and rely on promises from the server (if available).
* When you find an invalid filename, you need to deal with it, it's not clear what the kernel should or even could do. Perhaps the file should just not exist as far as userspace is concerned, and fsck would unlink it?
Meanwhile application developers get no benefit for many years because of compatibility considerations. It could be a decade before it makes any sense to write a program which assumes one of the restrictions, and that's if EVERY SINGLE OS fixes this tomorrow. Wheeler mistakenly believes this is a POSIX problem, but it isn't, the problem exists everywhere that filenames are treated as opaque, which in fact includes Windows (and I have my doubts about OS X, but its API documentation promises they aren't opaque, so app developers who rely on that promise would be entitled to scream blue murder when someone finds a way to get non-Unicode into an OS X filename...*)
Personally I think the issue to look at is spaces. Spaces are legal. They are undoubtedly going to remain legal. But they are inconvenient. How can we tweak our basic Unix processes (including the shell and many old tools) so that spaces are harmless ? Once you've done this, you'll have the right mindset to tackle initial hyphen, control characters and so on from the same angle, rather than screwing the poor kernel into doing your dirty work and making everybody (including those of us for whom opaque filenames are just dandy) pay.
* Something that should make you pause, OS X's approach to filenames as Unicode strings makes Unicode composition/ decomposition into an OS ABI feature. It had been doing this for years before Unicode actually pledged to stop changing the decomposition rules (ie until that happened new versions of OS X made previously legal filenames illegal and vice versa, with no warning...)
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 29, 2009 14:31 UTC (Sun) by epa (subscriber, #39769) [Link]
Yes, validate every filename that comes from user space to check it is valid UTF-8 and does not have control characters. This is not in fact an expensive operation (especially not compared to the cost of opening or creating a file in the first place).Every non-Unix OS already forbids control characters in filenames so there would not be much extra checking to do in filesystems like smbfs or ntfs. (Except out of paranoia to detect disk corruption, which is probably a good thing to do anyway.) As you point out, there remains the question of network filesystems like NFS, where the server could legitimately return filenames containing arbitrary byte sequences. And there would have to be some policy decision about what to do. But I would rather have one single place to deal with the mess rather than leave it to 101 different bits of code in user space. (Python 3.0 pretends that invalid-UTF-8 filenames do not exist when returning a directory listing; other programs will show them but may or may not escape control characters when displaying to the terminal; goodness knows what different Java implementations do.)
I would favour silently discarding filenames that contain control characters from the directory listing, and for those in some legacy encoding like Latin-1 or Shift-JIS, translating them to UTF-8. (The legacy encoding would be specified with a mount parameter. Again, this is a bit awkward but a hundred times less complicated than leaving every userspace program to do its own peculiar thing.)
Meanwhile application developers get no benefit for many years because of compatibility considerations.Not really true. The benefit in closing existing security holes is immediate. In writing new code, you can note that there may be corner-case bugs on systems that permit control characters in filenames, but for 90% of the user base they do not exist. That is 90% better than the current situation, where everyone just writes code assuming that filenames are sane, but no system enforces it. By analogy, consider that many classic UNIX utilities had fixed limits on line length. If I write a shell script that uses sort(1), I just write it for GNU sort and other modern implementations. I might note that people on older systems may encounter interesting effects using my script with large input data, but I don't have to wait for every last Xenix system to be switched off before I can get the benefit in new code.
Personally I think the issue to look at is spaces. Spaces are legal. They are undoubtedly going to remain legal. But they are inconvenient. How can we tweak our basic Unix processes (including the shell and many old tools) so that spaces are harmless ?This is true in principle but in thirty years of Unix, essentially no progress has been made on this. Nobody bothers to fix the shell or utilities such as make(1) to cope with arbitrary characters, despite much wishing that they would. Nobody bothers to write shell scripts that cope with all legal filenames, mostly because it is all but impossible. Instead, people who care about bug-free code end up rewriting shell scripts in other languages such as C (for example, some of the git utilities), people who think life is too short are happy to distribute software that misbehaves or has security holes, and many others just don't realize there is a problem.
OS X is something of a special case because of case insensitivity. If you don't want case insensitivity then you do not need to worry about Unicode composition; just a simple byte sequence check that you have valid UTF-8. But OS X is a useful example in another way: a case-insensitive filesystem is a much bigger break with Unix tradition that what's proposed here, and yet the world did not come to an end, and it was trivial for most Unix software to adapt.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 31, 2009 5:00 UTC (Tue) by njs (guest, #40338) [Link]
Then you just get distros to set that flag on the root filesystem by default, add a few bits of API for programs who want to know "is this filesystem utf8-only?" or "how does this filesystem normalize names?" (which would be really useful calls anyway), and away you go.
(It's unfortunate that the Win32 designers screwed this up, but that's hardly an argument to perpetuate their mistake.)
TALPA?
Posted Mar 25, 2009 16:54 UTC (Wed) by dmarti (subscriber, #11625) [Link]
Hey, cool! A use for TALPA!
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 25, 2009 17:00 UTC (Wed) by adobriyan (guest, #30858) [Link]
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 25, 2009 17:51 UTC (Wed) by ajb (guest, #9694) [Link]
- add a new inheritable process capability, 'BADFILENAMES', without which processes can't see or create files with bad names.
- add a command 'access_bad_filenames' which creates a shell with the capability.
- /bin/ls also needs the capability, but should not display bad filenames
unless an additional option is passed.
That way, most processes can run happily in the ignorance of any bad filenames. If you need to access one, you run the commands you need to access it with under the special shell.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 25, 2009 18:09 UTC (Wed) by mrshiny (subscriber, #4266) [Link]
Mr. Wheeler makes a mistake in the article as well. Windows has no problem with files starting with a dot. It's only Explorer and a handful of other tools that have problems. Otherwise Cygwin would be pretty annoying to use.
Overall, however, I like the idea of restricting certain things, especially the character encoding. The sooner the other encodings can die, the sooner I can be happy.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 25, 2009 19:52 UTC (Wed) by emk (subscriber, #1128) [Link]
If a filename is properly handled for spaces, doesn't it automatically work for all the other chars?
Unfortunately, no. One example mentioned in the article is files with names like "-rf", which will appear at the start of any glob list. To deal with this, you generally need to add "--" before any globs, but different commands behave differently, and not all commands support "--".
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 26, 2009 1:12 UTC (Thu) by mrshiny (subscriber, #4266) [Link]
Not that preventing files like '-rf' isn't a bad idea. I think it would prevent a number of mistakes.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 30, 2009 16:41 UTC (Mon) by Hawke (guest, #6978) [Link]
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 26, 2009 15:38 UTC (Thu) by dwheeler (guest, #1216) [Link]
Actually, there is a general solution for the dash: whenever you glob in the current directory, stick "./" in front of the glob. So always use "cat ./*" instead of "cat *". I do mention that in my article.
Problem is, nobody does that. It's too easy to use "*", it's what all the documents say, and it's what all the users actually do. You have to train GUI programs to do this, too. So instead of constantly trying to get developers to do something "unnatural", let's change the system so the "obvious" way is always correct.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 25, 2009 22:45 UTC (Wed) by epa (subscriber, #39769) [Link]
If not, it should be easy enough to fix the SHELLS in this case.Three decades of unhappy experience says otherwise. Nobody has a reasonable proposal to fix all the shells, all the scripting languages and all the user applications so that they don't make unsafe assumptions about filenames (e.g. assuming a filename can never begin with - or never contain the \n character).
On the other hand, a kernel-level check for bad characters is simple to implement and obviously solves these problems at a stroke.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 26, 2009 1:16 UTC (Thu) by mrshiny (subscriber, #4266) [Link]
1. Prevent files that start with dash (technically not a shell problem)
2. Prevent files that contain control characters (newline included)
3. Make the shells easy to use in the face of filenames with spaces, semi-colons, colons, quotes, punctuation, etc.
The first item is more of an interaction between programs and the shell and not specifically a shell problem. If a program doesn't support -- then it can never be used securely.
The second item seems like an obvious step to take with no downside.
The third item is what I meant by fixing the shells: shells should make it braindead-easy to manipulate filenames without them turning into commands or other nonsense. Once a filename is loaded into a variable you shouldn't have to worry about characters in the name turning into shell commands. Once that's in place we can start fixing scripts. Maybe an environment variable can determine how that instance of the shell works: in secure mode or legacy mode.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 26, 2009 14:45 UTC (Thu) by michaeljt (subscriber, #39183) [Link]
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 26, 2009 15:08 UTC (Thu) by michaeljt (subscriber, #39183) [Link]
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 26, 2009 19:49 UTC (Thu) by dwheeler (guest, #1216) [Link]
[The shell] could also recognise the null character as an argument separator as in 'find -print0'. It could even set some environment variable to tell tools like find that this is supported so that they can use it by default when not outputting to the console.
Yes, I already added the "shell could recognize null as separator". And you're right, adding an environment variable could help (though it could also backfire on older scripts!).
While on that subject, the shell could enforce that substitutions that resolve to the arguments for other commands are not allowed to spill over (e.g. VAR='myfile; rm -rf /'; ls $VAR).
This particular example doesn't do quite what you think; it just passes to ls several values: "myfile;", "rm", "-rf", and "/", and you end up with some error messages and a listing of "/". But with more tweaking, you can definitely get some exploits out of this approach. Which is why removing the space character from IFS is a big help - then VAR would become a single parameter again.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 28, 2009 1:11 UTC (Sat) by nix (subscriber, #2304) [Link]
It was removed, but I can't remember why: some sort of compatibility
problem?
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 31, 2009 7:47 UTC (Tue) by michaeljt (subscriber, #39183) [Link]
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 31, 2009 19:28 UTC (Tue) by nix (subscriber, #2304) [Link]
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Apr 3, 2009 18:49 UTC (Fri) by anton (subscriber, #25547) [Link]
It could also recognise the null character as an argument separator as in 'find -print0'.A few weeks ago I wanted to process my .ogg files which contain all kinds of characters that are treated as meta-characters by the shell or other programs I use in sheel scripts. I eventually ended up writing a new shell dumbsh that uses NUL as argument separator, and feeding it from find, with some intermediate processing in awk (which is quite flexible about meta-characters).
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 26, 2009 21:11 UTC (Thu) by explodingferret (guest, #57530) [Link]
1) Portable scripts (of a kind), init scripts, and build scripts. In all these cases the scripts need to have #!/bin/sh at the top, and contain just about every fix for every problem ever, including [ "x$var" = x ] and ${1:-"${@}"} and various other monstrosities.
In these scripts, the quotes around variables; ./ in front of filenames; IFS= for read; and filename=`foo; printf x`; filename="${filename%x}" crap will *always* have to be there. So no point trying to fix anything for those.
2) The other use is scripts that are used on either one system (personal scripts) or one "class" of system, like "only Debian GNU/Linux".
These scripts can use a particular shell like #!/bin/bash and assume the existence of -print0 and -printf to find and -d '' to read and all the other little conveniences which make a lot of the problem go away.
Well, other than newlines at the end of filenames. That's the only case that I refuse to take account of in my scripts, unless security issues might arise.
----
I'm not saying that I disagree with the ideas in this article (although I'd like to keep spaces and shell special characters in my filenames, actually). I'm just saying that as far as shell scripting is concerned, it may not actually help all that much. The main gain for me would be the security fixes and less typing in my interactive shell. Even though I'm pretty sure I don't have any newlines or control characters in any of my filenames, I just can't bring myself to write bad scripts, and that's kinda sad.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 28, 2009 1:18 UTC (Sat) by nix (subscriber, #2304) [Link]
I dictated zsh 4, simply because for this application C was far too
unpleasant, ksh was too buggy (thanks, Linux, for pdksh, with its broken
propagation of variables out of loops-with-redirection), and there was no
hope of getting the clients' systems people to install Perl: but they were
perfectly happy to install a recent zsh: fewer dependencies and no scary
modules (well, actually zsh *does* have a module system but they didn't
realise that!)
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Nov 15, 2009 1:06 UTC (Sun) by yuhong (guest, #57183) [Link]
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Nov 15, 2009 13:15 UTC (Sun) by nix (subscriber, #2304) [Link]
dot files in Windows and
Posted Mar 25, 2009 23:09 UTC (Wed) by pr1268 (subscriber, #24648) [Link]
Windows has no problem with files starting with a dot.
Oddly enough, Windows will not allow the name of a directory to end in a dot. I discovered this when, back in my Windows days, I had to name an artist directory R.E.M without a final dot. Windows wouldn't allow me to put that trailing dot in the file name. Go figure. Linux doesn't have any issue with it (and since I've abandoned Windows on my home computers, I was able to rename the directory to include that dot).
Going off on a tangent: here are some files in my music directory which would make Mr. Wheeler cringe:
In a digital forensics class the professor had us searching through a filesystem that contained directories named "..." (minus quotes). Good times...
dot files in Windows and
Posted Mar 25, 2009 23:42 UTC (Wed) by dwheeler (guest, #1216) [Link]
No cringe. I didn't see any control characters there, nor leading dashes. And you don't seem to require non-UTF-8. If we could get those done, the rest are gravy.
dot files in Windows and
Posted Mar 26, 2009 0:07 UTC (Thu) by pr1268 (subscriber, #24648) [Link]
Wow, thanks for the reply! And thank you for the original article--I found myself nodding in agreement many times while reading it.
Of course, even with your non-cringing approval, I certainly had lots of shell escaping to do with these files (and many others--my collection is approaching 10,000 audio files from almost 900 music CDs).
dot files in Windows and
Posted Mar 26, 2009 1:04 UTC (Thu) by nix (subscriber, #2304) [Link]
dot files in Windows and
Posted Mar 26, 2009 10:21 UTC (Thu) by mjj29 (guest, #49653) [Link]
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 30, 2009 19:36 UTC (Mon) by rickmoen (subscriber, #6943) [Link]
mrshiny wrote:You can pry my spaces from my filenames out of my cold dead fingers.
ObMenInBlack: "Your offer is acceptable."
(I remember having to write AppleScript to recurse through directories cleaning up files created on network shares by MacOS-using munchkins who put space characters at the ends of filenames, in order for them to become valid filenames when seen by MS-Windows-using employees looking at the same network shares. The converse problem was files, from MS-Windows users, with names containing colon, which is a reserved character in MacOS file namespace. What a pain in the tochis.)
Rick Moen
rick@linuxmafia.com
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 25, 2009 18:36 UTC (Wed) by njs (guest, #40338) [Link]
The section on Unicode-in-the-filesystem seemed quite incomplete. We know this can work, since the most widely used Unix *already* does it. OS X basically extends POSIX to say "all those char * pathnames you give me, those are UTF-8". However, there are a lot of complexities not mentioned here -- you need to worry about Unicode normalization (whether or not to allow different files to have names containing the same characters but with different bytestring representations), if there is any normalization then you need a new API to say "hey filesystem, what did you actually call that file I just opened?" (OS X has this, but it's very well hidden), and so on.
But these problems all exist now, they're just overshadowed by the terrible awful even worse problems caused by filenames all being in random unguessable charsets. I really dislike many things about Apple, but in this case we could do worse than to sit down and steal (with appropriate modification) most of the stuff in http://developer.apple.com/technotes/tn/tn1150.html#Unico...
Maybe the ext4 folks could add unicode filenames as a mount option -- they haven't done anything controversial lately ;-).
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 25, 2009 20:57 UTC (Wed) by clugstj (subscriber, #4020) [Link]
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 25, 2009 21:29 UTC (Wed) by foom (subscriber, #14868) [Link]
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 25, 2009 22:36 UTC (Wed) by nix (subscriber, #2304) [Link]
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 26, 2009 2:25 UTC (Thu) by njs (guest, #40338) [Link]
I'd be happy if we could just make a rule that filenames are valid UTF-8. Unicode normalization (composing characters and all that) is probably a good idea, but reasonable people could disagree. I'm just as happy without case normalization (though the arguments for it aren't entirely without merit, even if it can't be done perfectly). And *any* of these would be better than what we have now...
(The "so what did you call this file?" API is also useful if your system ever deals with case-insensitive or unicode-normalizing filesystems. Which Linux does, whether it becomes common for the root filesystem or not.)
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 26, 2009 13:42 UTC (Thu) by clugstj (subscriber, #4020) [Link]
Why? How is the current condition so bad that we should run headlong into any of these "solutions" without knowing what the eventual outcome will be?
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 26, 2009 17:46 UTC (Thu) by quotemstr (subscriber, #45331) [Link]
We're talking about the possible outcomes. You're telling us we shouldn't discuss the possible problems and solutions because we don't know the problems yet? That's bunk.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 26, 2009 18:59 UTC (Thu) by njs (guest, #40338) [Link]
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 29, 2009 21:27 UTC (Sun) by clugstj (subscriber, #4020) [Link]
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 30, 2009 0:07 UTC (Mon) by njs (guest, #40338) [Link]
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 29, 2009 21:30 UTC (Sun) by clugstj (subscriber, #4020) [Link]
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 29, 2009 22:07 UTC (Sun) by foom (subscriber, #14868) [Link]
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 26, 2009 14:23 UTC (Thu) by michaeljt (subscriber, #39183) [Link]
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 28, 2009 1:00 UTC (Sat) by nix (subscriber, #2304) [Link]
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Apr 2, 2009 15:54 UTC (Thu) by forthy (guest, #1525) [Link]
It is actually not that bad. As collating sequence, ß=ss (i.e. Mass and Maß sort to the same bin). Except for Austrian telephone books, where ß follows ss, but comes before st (though St. follows Sankt ;-).
However, there's a huge mess in the CJK part of UCS: short and long forms of the same character (sometimes even a special variant for the Japanese character). This should never have happend, the different forms of the same character should be encoded in fonts, not in UCS. So far, not even Mac OS X normalizes these characters, but it is obvious that a mainland China file called "中国" and a Taiwan file called "中國" not only mean the same, but they also refer to the same word, and can be interchanged at will (see for example the Chinese wikipedia entry: the lemma is the short form, the headline is the long form). And it is not easy to access long and short forms with usual input methods (mainland China: Pinyin, Canton: Cantonese Pinyin (gives traditional characters, bug you need to know Cantonese), etc.).
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 26, 2009 13:40 UTC (Thu) by clugstj (subscriber, #4020) [Link]
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 26, 2009 19:52 UTC (Thu) by leoc (guest, #39773) [Link]
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 25, 2009 21:39 UTC (Wed) by njs (guest, #40338) [Link]
I don't *like* either alternative much, but I doubt you're going to get everyone to switch back to ASCII, either. The problem isn't going away.
So... we can whine about how unfair it is that character systems are complicated and ignore the problem, or we can hold our noses and pick a least-bad option. The latter is probably more productive (though inertia suggests the former is most likely).
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 26, 2009 16:23 UTC (Thu) by dlau (guest, #4540) [Link]
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 25, 2009 19:50 UTC (Wed) by szh (guest, #23558) [Link]
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 26, 2009 0:52 UTC (Thu) by tbrownaw (guest, #45457) [Link]
Our incoming ftp server at work once got a file named "C:\something_or_other.zip". Which was perfectly fine, until someone tried to open it in Windows using the samba share. It actually did show up, but with a completely garbled name.
I also accidentally generated a file where the name had a leading '\r' (carriage return). That was a lot of fun to track down and fix, it looked perfectly normal in 'ls' unless you noticed that it wasn't in proper alphabetical order and the rest of the row was one character out of alignment.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 25, 2009 20:50 UTC (Wed) by clugstj (subscriber, #4020) [Link]
Most of the things he wants to force on everyone are already available by convention. What is the benefit of disallowing other usages? If you want to imagine that all your filenames are UTF-8, go ahead, who's stopping you! The UNIX kernel contains as little policy as possible. This results in it being more simple than it would be otherwise. Yes, this is a double-edged sword, but doing the things he suggests are not an automatic win.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 25, 2009 23:34 UTC (Wed) by dwheeler (guest, #1216) [Link]
Sure, almost all files already follow these conventions. Except when they don't. And when they don't, millions of programs subtly stop working. Everyone who does "find . blah | stuff" is writing bad code, because filenames can contain newlines deep in the directory tree. If we get rid of the nonsense, then it's easy to write correct programs; today, it takes herculean effort, and few people do so. It's a double-edged sword, but users get cut by both sides. I have yet to find a real use case for including control characters in filenames, for example, but plenty of reasons why it shouldn't ever happen.
Conventions are great! Let's go back to FAT!
Posted Mar 26, 2009 8:18 UTC (Thu) by khim (subscriber, #9252) [Link]
Most of the things he wants to force on everyone are already available by convention. What is the benefit of disallowing other usages?
What's the benefit of all these ACL's? Traditional Unix permissions, capabilities, POSIX Acls, memory protection, etc. You can just use conventions for that. And if someone violates convention he or she can be fired.
This is you proposal in nutshell - and it just does not work.
Conventions are great! Let's go back to FAT!
Posted Mar 26, 2009 13:38 UTC (Thu) by clugstj (subscriber, #4020) [Link]
Shell scripts are where this is the biggest problem. I do shell scripting for a living and don't see this issue as being anywhere near as big a problem as Mr. Wheeler thinks it is.
Also, I'm completely confused by your title. I suggest conventions and then you suggest, perhaps facetiously, FAT (which is not a convention, but enforcement of a very stupidly limited set of possible filenames).
Conventions are great! Let's go back to FAT!
Posted Mar 26, 2009 14:07 UTC (Thu) by khim (subscriber, #9252) [Link]
Wow, it does not work?Nope.
Apparently UNIX is completely broken.Nope. UNIX is not broken. Your head, on the other hand, is.
And ACL's are so complicated and a drain on performance as to be nearly useless - which is why they are not used much.Traditional unix permissions are used on most systems - and are ARE ACL's too. They are quite limited but often adequate - that's why other forms are not used much. Still they are deficient in many situations and other forms are used more and more.
Shell scripts are where this is the biggest problem. I do shell scripting for a living and don't see this issue as being anywhere near as big a problem as Mr. Wheeler thinks it is.
Number of correct scripts is not important metric. Number of bad scripts is. And it's MUCH higher then warranted: I've fixed tons of scripts which failed on names with spaces, files with dash in first position, etc. If such files are excluded from the start life will be much easier.
Also, I'm completely confused by your title. I suggest conventions and then you suggest, perhaps facetiously, FAT (which is not a convention, but enforcement of a very stupidly limited set of possible filenames).
I propose FAT as a way to get rid of these pesky ACLs. It's one of the few filesystems today with any form of access control (except read-only flag). We can extend it to allow all forms of filenames - it's not hard. Or we can just run all programs with UID==0 - it'll give us the same flexibility. Somehow noone wants to go in this direction, though.
Conventions are great! Let's go back to FAT!
Posted Mar 29, 2009 21:44 UTC (Sun) by clugstj (subscriber, #4020) [Link]
Wow, childish personal attacks. How droll.
"Number of correct scripts is not important metric. Number of bad scripts is"
I would think that the percentage of each would (possibly) be a useful metric. But, what is the damage from these "bad scripts"? If you are writing shell scripts that MUST be absoutely bullet-proof from bad input, perhaps because they run setuid-root, then you are already making a much worse mistake than the possible bugs in the script.
Still don't understand the FAT reference. Sorry, maybe I'm just slow.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 26, 2009 9:51 UTC (Thu) by epa (subscriber, #39769) [Link]
'By convention' files do not contain control characters. The problem is that you cannot rely on convention when writing robust, secure software. Either you put in endless sanity checks which cruft up your code and are liable to be forgotten, or you end up with subtle bugs that are tickled by the existence of files called '>foo' or '|/bin/sh' or countless other variations.Such bugs are made more insidious by the fact that 'by convention', they cannot ever be triggered. But for someone trying to make a working exploit, or widen a small security hole into a larger one, convention is no barrier.
If you want to have certainty that your code works correctly, 100% of the time, no ifs and no buts - rather than just waving your hands and hoping that everyone else in the world makes filenames that follow the same convention as you - then you need a guarantee that the assumptions you make are guaranteed to be true.
If you want to imagine that all your filenames are UTF-8, go ahead, who's stopping you!You could equally well say that disk quotas are not needed; if you want to limit yourself to use 100 megabytes of space, who's stopping you? Indeed what is the point of file permissions - if you want to pretend that all your files are read-only, who's stopping you? And why should the kernel forbid hard links to directories - surely it should be up to the user to decide whether their filesystem is a tree or a general DAG, and the kernel should not enforce this policy.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 27, 2009 19:23 UTC (Fri) by drag (subscriber, #31333) [Link]
YA.
All I want is for the system to cancel out malicious filename characters and things that obviously make little sense. STuff like control characters, newlines, etc etc.
As for encoding the encoding stuff... meh. Filenames being treated as a string of bytes mostly makes sense, except in a few special cases.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 28, 2009 11:45 UTC (Sat) by epa (subscriber, #39769) [Link]
If Unix really did treat filenames as merely 'a string of bytes', with no implied character set or encoding, and displayed them to the user as a hex dump or something, then it would be truly encoding-agnostic and would have no difficulties with arbitrary byte values in filenames. Of course, it would also have been a total failure that nobody uses. For a filesystem to be useful, it needs to have some amount of meaning (or 'policy' if you will) attached to the filenames it stores. The question is how much: is the current situation of 'ASCII for characters below 128, and above that you're on your own' the best one?
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 28, 2009 16:53 UTC (Sat) by tialaramex (subscriber, #21167) [Link]
I'd be surprised if the /majority/ of programs other than shell scripts aren't like this. Even in the majority of GUI software, what's needed isn't a revision of the kernel API (in fact that will barely help) but only a function which takes a zero-terminated byte array representing a filename and returns a string suitable for display. Such a function is nearly inevitable anyway - to deal with dozens of other issues unrelated to Wheeler's thesis. And such functions exist today (I can't say if they're bug free of course)
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 29, 2009 14:43 UTC (Sun) by epa (subscriber, #39769) [Link]
a function which takes a zero-terminated byte array representing a filename and returns a string suitable for displayCurrently it is impossible to reliably write such a function, because you don't know whether the byte array is encoded in Latin-1, Shift-JIS, UTF-8 or whatever.
Imagine removing the character encoding headers from the http protocol. There would then be no reliable way to take the content of a page and display it to the user - just a panoply of hacks and rules of thumb that differed from one browser to another. This is the situation we have now with filenames, which are *names* and intended for human consumption just as much as the content of a typical web page. The two choices are (a) add headers to the protocol saying what encoding is in use (or in the case of filenames, an extra parameter in all FS calls), or (b) mandate a single encoding everywhere.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 29, 2009 21:58 UTC (Sun) by clugstj (subscriber, #4020) [Link]
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 29, 2009 22:37 UTC (Sun) by epa (subscriber, #39769) [Link]
No, it is very possible to write such a function. The character encoding issue only prevents you from assuring that the string matches what the file's creator thought it should be.Well, yeah. If you allow the function to return the wrong answer, then it is easy to write. But it is not possible to in all cases return the correct filename to the user, matching the original one chosen by the user. If you pick a known encoding everywhere (UTF-8 being the obvious choice) then the problem goes away.
This doesn't represent a security problem.Correct (at least none that I can think of). The security issue is with special characters and control characters in filenames, and is separate to the issue of how to encode characters that don't fit in ASCII.
a filename issue or a shell issue?
Posted Mar 25, 2009 21:10 UTC (Wed) by renox (subscriber, #23785) [Link]
a filename issue or a shell issue?
Posted Mar 25, 2009 21:53 UTC (Wed) by alecs1 (guest, #46699) [Link]
Keep them coming :)
a filename issue or a shell issue?
Posted Mar 26, 2009 18:30 UTC (Thu) by quotemstr (subscriber, #45331) [Link]
a filename issue or a shell issue?
Posted Mar 25, 2009 23:37 UTC (Wed) by dwheeler (guest, #1216) [Link]
Perhaps, but even systems which have objects get burned. As noted earlier, the Python developers have had a hard time.... they've switched to Unicode as their main text representation, but Unix filenames aren't text... they are essentially binary blobs! If filenames were always UTF-8, there'd be no problems. Similarly, perl programs get trashed if filenames begin with <.
Case in point
Posted Mar 25, 2009 21:11 UTC (Wed) by proski (subscriber, #104) [Link]
One of the reasons git replaced many shell scripts with C code was support for weird file names. C is better at handling them. In absence of such issues, many commands would have remained shell scripts, which are easier to improve.
Simplicity is better than complexity.
Posted Mar 26, 2009 2:17 UTC (Thu) by k8to (subscriber, #15413) [Link]
This proposal is a whole lot of complexity in kernel code and the API.
UNWANT.
Simplicity is better than complexity.
Posted Mar 26, 2009 2:22 UTC (Thu) by k8to (subscriber, #15413) [Link]
Sure, some find implementations don't have it. Fix them.
Simplicity is better than complexity.
Posted Mar 26, 2009 2:29 UTC (Thu) by k8to (subscriber, #15413) [Link]
As evidence for my position, here are some real-world filenames that my software needed to create to correctly archive some digital music history of the personal computer as an instrument.
jrodman@calufrax:/opt/kmods/mods/artists/Karl> ls d_* ¦* d_ .it d_ .it d_ .it d_ .it d_1151.it d_1152.it d_1153.it d_1154.it ¦¦¦¦¯¯Ì_.it jrodman@calufrax:/opt/kmods/mods/artists/Karl> ls d_* ¦* |xxd 0000000: 645f 2020 2020 2e69 740a 645f 2020 202e d_ .it.d_ . 0000010: 6974 0a64 5f20 202e 6974 0a64 5f20 2e69 it.d_ .it.d_ .i 0000020: 740a 645f 3131 3531 2e69 740a 645f 3131 t.d_1151.it.d_11 0000030: 3532 2e69 740a 645f 3131 3533 2e69 740a 52.it.d_1153.it. 0000040: 645f 3131 3534 2e69 740a a6a6 a6a6 afaf d_1154.it....... 0000050: cc5f 2e69 740a ._.it.
These files are handled by a combination of python and shellscripts, and one piece of C code (wrapping a library which knew how to read certain binary formats.) All of these pieces can handle newlines, tabs, spaces, control characters, leading dahes, and so on. I'm not really that smart. It wasn't much work. If shellscripts are 5 second hackjobs, then they will always fail in some cases: strange filenames, permissions problems, etc. If you take a few minutes to apply correct safeguards, then thigns work fine.
Simplicity is better than complexity.
Posted Mar 26, 2009 2:31 UTC (Thu) by k8to (subscriber, #15413) [Link]
Simplicity is better than complexity.
Posted Mar 26, 2009 2:39 UTC (Thu) by foom (subscriber, #14868) [Link]
I find it very hard to believe that your software *needed* to create unintelligible filenames. And if it
did, I'd claim it needs to be fixed.
Simplicity is better than complexity.
Posted Mar 26, 2009 2:48 UTC (Thu) by k8to (subscriber, #15413) [Link]
Simplicity is better than complexity.
Posted Mar 26, 2009 3:40 UTC (Thu) by foom (subscriber, #14868) [Link]
I claim that I'd find software that creates filenames like that on my disk to be irritating. So I'd certainly prefer if no software actually did so, and probably wouldn't mind if it was impossible to do so.If, in some alternative universe, it was already impossible to create those filenames, I have little doubt you could still have created working software which didn't require the impossible.
Sorry I come off as unreasonable to you. *hugs*
Do you know difference between two words: "need" and "want"?
Posted Mar 26, 2009 8:41 UTC (Thu) by khim (subscriber, #9252) [Link]
you have no clue about my software or the project but you claim to know what is correct and incorrect.
I don't have a clue. And I don't need it to know anything about your project to know you are lying. Any project can be implemented with exactly two filenames: "0" and "1". You'll need infinite depth of directory structure to do so, true, but thankfully there are no practical limitations in Linux. Is it feasible? Probably no. Is it possible? Of course. And if we'll start with the position that your software does not need these filenames but you current design needs these suddenly you have much weaker argument: you are reducing complexity of your software by increasing complexity of everyone's else's software. Is it good trade-off? May be yes, may be no. But it's weak argument at best - no matter what your project is and what it needs to be done.
Simplicity is better than complexity.
Posted Mar 26, 2009 10:01 UTC (Thu) by epa (subscriber, #39769) [Link]
I know this is a matter of taste, and merely trying to impose one person's tastes on everyone is not a reason to change the kernel. But on the other hand, the marginal extra disk space saving (ten bytes?) from being able to put arbitrary binary stuff in filenames without encoding does not outweigh the many good reasons that Wheeler gave for changing.
Simplicity is better than complexity.
Posted Mar 26, 2009 13:54 UTC (Thu) by clugstj (subscriber, #4020) [Link]
Put quotes around it?
Simplicity is better than complexity.
Posted Mar 26, 2009 10:27 UTC (Thu) by mjj29 (guest, #49653) [Link]
Simplicity is better than complexity.
Posted Mar 26, 2009 21:22 UTC (Thu) by explodingferret (guest, #57530) [Link]
Are you able to make the source of your shell scripts available? I'm sure I can find something in them that is breakable. :-)
DANGER! DANGER! DANGER! HYPOCRISY LEVEL IS OVER 9000!!!
Posted Mar 26, 2009 8:30 UTC (Thu) by khim (subscriber, #9252) [Link]
This argument:
Simplicity is better than complexity.plus this one
As for the find, gnu find already has -print0 and xargs is compatable.equals hypocrite.
And you can not even claim that "we already solved thsi problem so it's old code vs new code". A lot of programs just don't work with currect approach (especially script). You need to write and fix literally millions lines of code vs few thoiusands in kernel.
Sorry, but you are advocating more complex solution while preaching simplicity.
Simplicity is better than complexity.
Posted Mar 26, 2009 2:49 UTC (Thu) by njs (guest, #40338) [Link]
(My opinion on this issue is strongly influenced by writing the filesystem interaction code for a VCS. Users in different locales may want to work on the same project, but they write the same filenames differently, and some charsets may not be able to even represent filenames created in other locales, and...)
There are arguments for the current system, but "simplicity" is really not one of them.
Simplicity is better than complexity.
Posted Mar 27, 2009 0:46 UTC (Fri) by nix (subscriber, #2304) [Link]
(As in, I looked at XEmacs/MULE's code and my brain dribbled out of my
ears, following which I was simple.)
Simplicity is better than complexity.
Posted Mar 26, 2009 5:34 UTC (Thu) by flewellyn (subscriber, #5047) [Link]
Simplicity is better than complexity.
Posted Mar 26, 2009 13:49 UTC (Thu) by clugstj (subscriber, #4020) [Link]
Simplicity is better than complexity.
Posted Mar 26, 2009 13:54 UTC (Thu) by epa (subscriber, #39769) [Link]
Simplicity is better than complexity.
Posted Mar 26, 2009 16:02 UTC (Thu) by flewellyn (subscriber, #5047) [Link]
But a major point of Wheeler's argument is that existing programs, filesystems, and indeed operating systems already assume that these restrictions are the case, as a matter of convention, but do not necessarily do anything to ensure that they are enforced. Existing software already rejects or fails to properly handle filenames which would violate these conventions, and the vast majority of existing files are named according to these conventions; at the very least, filenames with leading dashes, tab or newline characters, or shell control characters are very rare, and probably accidental or malicious.
So the entire point is that the changes required to existing software would be minimal, and existing software which could break on filenames that don't obey these restrictions when they're not enforced by the OS, would no longer have a problem.
Is kernel the whole world?
Posted Mar 26, 2009 8:24 UTC (Thu) by khim (subscriber, #9252) [Link]
Simplicity is better than complexity.
Oh so very true.
This proposal is a whole lot of complexity in kernel code and the API.
It also removes whole bunch of code from other places. Even more important: it removes the need to write and fix bunch of code.
Simplicity is better than complexity.
Posted Mar 26, 2009 9:56 UTC (Thu) by epa (subscriber, #39769) [Link]
If you've ever tried such an exercise, you would not believe that allowing control characters in the middle of filenames and leaving userspace to deal with the resulting mess could ever be called 'simplicity'.
Wheeler's suggestion would greatly simplify a lot of code, or if you prefer, fix many hidden bugs and security holes in code that is currently buggy.
Simply checking filenames for bad characters takes about five lines of code in the kernel plus one line for each syscall that accepts a filename from userspace. It is hardly adding significant complexity.
Simplicity is better than complexity.
Posted Mar 28, 2009 17:03 UTC (Sat) by tialaramex (subscriber, #21167) [Link]
Show me the money. Five lines, plus one per syscall. Not a lot of work to support such a broad and sweeping claim. Write those lines carefully, we wouldn't want you to be hand-waving and have missed 99.9% of the complexity of the problem...
Simplicity is better than complexity.
Posted Mar 29, 2009 15:03 UTC (Sun) by epa (subscriber, #39769) [Link]
for (const char *c = filename; *c; c++)
if (*c < 32) return EINVAL;
Adding a fixed list of 'bad characters' (please excuse lack of indentation, the LWN comment form eats it):
for (const char *c = filename; *c; c++)
if (*c < 32 || *c == '<' || *c == '>' || *c == '|') return EINVAL;
if (filename[0] == '-') return EINVAL;
To check valid UTF-8 is a little more complex, but not much. You do not need to check that assigned Unicode characters are being used, or worry about combining characters, upper and lower case, etc. See <http://www.cl.cam.ac.uk/~mgk25/unicode.html> for a list of valid byte sequences. The code would be something like
/* First pad the filename with 4 extra NUL bytes at the end. Then, */
int is_cont(char c) { return 128 <= c && c < 192 }
const char *p = filename;
while (*p) {
if (*p < 128) ++c;
else if (192 <= *p && *p < 224 && is_cont(p[1])) p += 2;
else if (224 <= *p && *p < 240 && is_cont(p[1]) && is_cont(p[2]) p += 3;
else if (240 <= *p && *p < 248 && is_cont(p[1]) && is_cont(p[2])
&& is_cont(p[3])) p += 4;
else if (248 <= *p && *p < 252 && is_cont(p[1]) && is_cont(p[2])
&& is_cont(p[3]) && is_cont(p[4])) p += 5;
else if (252 <= *p && *p < 254 && is_cont(p[1]) && is_cont(p[2])
&& is_cont(p[3]) && is_cont(p[4]) && is_cont(p[5])) p += 6;
else return EINVAL;
}
For a self-contained system, that takes care of it. Put some code like the above into a function and call it at each place a filename is taken from user space. Coping with 'foreign' filesystems (e.g. NFS servers) returning non-UTF-8 filenames is a bit more complex.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 26, 2009 2:44 UTC (Thu) by explodingferret (guest, #57530) [Link]
Here are some problems I noticed in your article:
1) "These restrictions only apply to Windows - Linux, for example, allows
use of " * : < > ? \ / | even in NTFS." -- is "/" supposed to be in that list?
2) You state that changing IFS and banning newlines and tabs in filenames would make things like 'cat $file' safer, but you should also state that shell glob characters would also need to be removed (namely *?[]).
3) You state (or at least imply) that there is no way to reliably use filenames from find, but there is a POSIX compliant and known portable method:
find . -type f -exec somecommand {} \;
or for more complex cases:
find . -type f -exec sh -c 'if true; then somecommand "$1"; fi' -- {} \;
For xargs fans, on all but files with newlines, you can do
find . -type f | sed -e 's/./\\&/g' | xargs somecommand
This is a feature of xargs and is specified by POSIX. It disables various quoting problems with xargs that you don't mention.
4) Your setting of IFS to a value of tab and newline is overly complicated. Simply use IFS=`printf \\n\\t`. It is only trailing newlines that are removed. If the different behaviour this causes with "$*" is not desired, one can set IFS=`printf \\t\\n\\t`. I know of no tool or POSIX restriction that says characters may not be repeated in IFS.
Otherwise great article! It really would be so nice to use line-separated commands in `` and not have to worry about things breaking. And although most of the thoughts expressed here are well known to me, the idea of getting the kernel to check the validity of UTF-8 filenames is fantastic!
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 28, 2009 19:50 UTC (Sat) by dwheeler (guest, #1216) [Link]
Thanks for your comments! In particular, you're absolutely right about swapping the order of \t and \n in IFS - that makes it MUCH simpler. I prefer IFS=`printf '\n\t'` because then it's immediately obvious that \n and \t are the new values. I've put that into the document, with credit.
Parentheses
Posted Mar 26, 2009 4:52 UTC (Thu) by eru (subscriber, #2753) [Link]
I don't think you could ban "()". They frequently appear in names in Windows-originated directories, apparently because some common programs generate file names containing "(1)", "(2)", ... to avoid collisions or indicate file versions.In general, the only shell metacharacters that could be banned without causing interoperability problems are those that are special also in That Other OS.
Not A System Problem
Posted Mar 26, 2009 9:56 UTC (Thu) by ldo (guest, #40946) [Link]
Not A System Problem
Posted Mar 27, 2009 0:56 UTC (Fri) by nix (subscriber, #2304) [Link]
And if you remove the prohibition on slashes, how do you distinguish
between a file called foo/bar and a file called bar in a subdirectory foo?
These limitations are there because the semantics of the filesystem itself
depends on them.
Re: Not A System Problem
Posted Mar 29, 2009 10:30 UTC (Sun) by ldo (guest, #40946) [Link]
nix wrote:
Um, if you remove the prohibition on nulls, how do you end the filename? This isn't Pascal.
Nothing to do with Pascal. C is perfectly capable of dealing with arbitrary data bytes, otherwise large parts of both kernel and userland code wouldnt work.
And if you remove the prohibition on slashes, how do you distinguish between a file called foo/bar and a file called bar in a subdirectory foo?
Simple. The kernel-level filesystem calls will not take a full pathname. Instead, they will take a parent directory ID and the name of an item within that directory. Other OSes, like VMS and old MacOS, were doing this sort of thing decades ago.
Full pathname parsing becomes a function of the userland runtime. The kernel no longer cares what the pathname separator, or even what the pathname syntax, might be.
Re: Not A System Problem
Posted Mar 29, 2009 13:54 UTC (Sun) by nix (subscriber, #2304) [Link]
I'm sure users would love not being able to type in pathnames anymore,
too.
Good luck getting anyone to do it.
Re: Not A System Problem
Posted Mar 29, 2009 19:47 UTC (Sun) by ldo (guest, #40946) [Link]
nix wrote:
What you're describing is not POSIX anymore.
Nothing to do with POSIX. POSIX is a userland API, it doesnt dictate how the kernel should work.
Re: Not A System Problem
Posted Mar 29, 2009 22:32 UTC (Sun) by nix (subscriber, #2304) [Link]
So whatever you're describing, userspace cannot any longer use standard
POSIX calls: in fact, it can't any longer use ANSI C calls! I suspect that
such a system would be almost unusable with C, simply because you couldn't
use C string literals for anything.
If you want VMS, you know where to find it.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 26, 2009 10:45 UTC (Thu) by kerolasa (guest, #56089) [Link]
http://en.wikipedia.org/wiki/Internationalized_domain_name
That would mean that there is encoded and unencoded versions of filenames, or two representations, match with same inode. The version you want to see depends on environment variable or perhaps command line option. For me this sounds like libc hack and the guys making changes to that are conservative (thank god they are, who'll want unstable libc anyway). Even the safe mode sounds like good idea I don't expect to see such for next couple of years. Well we'll see.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 26, 2009 12:23 UTC (Thu) by jpetso (subscriber, #36230) [Link]
Why don't we instead fix the mechanism that transports strings in the bash?
Like, "All glob expansions are automatically enclosed in strings", "If it's
a string then don't f*cking interpret it as an option", and maybe even
"Here's an array of return values" instead of "If the viewer is a program
then split by newlines, if the viewer is a user then make a table". Type
safety ftw?
I'm all for reasonable defaults and constraining unnecessary stuff, but
when there's an actual *sensible* use case then that use case should not be
made impossible just because the implementation is crappy.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 26, 2009 12:26 UTC (Thu) by jpetso (subscriber, #36230) [Link]
Oops, LWN swallows less-than & Co. even in plain-text mode... whatever, imagine exclamation/question marks, parentheses etc. on the second line. Plus some suffixed text that says I disagree that we disallow those characters just because we don't cope with them.
Leading spaces are common, actually
Posted Mar 26, 2009 13:17 UTC (Thu) by barryn (subscriber, #5996) [Link]
I was originally going to argue that leading spaces are necessary since people still have data
from Mac OS 9.x and earlier systems, but it turns out that this practice is far more common on
modern Mac OS X than I expected:
# find / -name ' *'
/Applications/Microsoft Office 2004/Clipart/Animations/ Animations Clip Package
/Applications/Microsoft Office 2004/Clipart/Clip Art/ Clip Art Clip Package
/Applications/Microsoft Office 2004/Clipart/Photos/ Photos Clip Package
/Applications/Microsoft Office 2004/Office/Entourage First Run/Entourage Script Menu Items/
About This Menu...
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Master/ Basic
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Master/ Basic/
Default.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Master/Ambient/
Ambient Basic.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Master/Classical/
Classical Basic.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Master/Dance/
Dance Basic.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Master/Hip Hop/
Hip Hop Basic.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Master/Jazz/ Jazz
Basic.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Master/Pop/ Pop
Basic.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Master/Rock/ Rock
Basic.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Master/Stadium
Rock/ Stadium Rock Basic.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Real/Band
Instruments/ No Effects.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Real/Basic Track/
No Effects.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Real/Bass/ No
Effects.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Real/Drums/ No
Effects.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Real/Effects/ No
Effects.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Real/Guitars/ No
Effects.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Real/Podcasting/
No Effects.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Real/Vocals/ No
Effects.cst
/Library/Scripts/Mail Scripts/Rule Actions/ Help With Rule Actions.scpt
Leading spaces are common, actually
Posted Mar 26, 2009 19:21 UTC (Thu) by njs (guest, #40338) [Link]
There is a use for leading spaces: They force files to appear earlier than usual in a lexicographic sort.Are you sure? I've seen this in real-world uses too, but I thought that all the common systems were fixed to do human-style (non-ASCIIbetical) sorting a few years ago. I don't have any proprietary systems around to test, but I'll be *really* amused if the OS X Finder is missing this usability feature of GNU ls:
~/t$ touch "a" " b" "c" ~/t$ ls -l total 0 -rw-r--r-- 1 njs njs 0 2009-03-26 12:16 a -rw-r--r-- 1 njs njs 0 2009-03-26 12:16 b -rw-r--r-- 1 njs njs 0 2009-03-26 12:16 c
Leading spaces are common, actually
Posted Mar 26, 2009 21:23 UTC (Thu) by foom (subscriber, #14868) [Link]
Finder does not ignore spaces. I'm quite glad, because I use the space-prefix trick rather regularly. I am occasionally annoyed at how GNU ls sorts "A B" between "AA" and "AC" instead of before them: that's certainly not how my brain sorts.Finder does however sort numbers like this, which GNU ls does not: 1 2 10
I don't really see what the point of the "human-style" sorting is if it can't even sort numbers. That seems kind of basic to me.
Leading spaces are common, actually
Posted Mar 28, 2009 1:21 UTC (Sat) by nix (subscriber, #2304) [Link]
(By default, despite comments elsewhere in this thread, ls sorts
ASCIIbetically, so " 2" comes before "1".)
Leading spaces are common, actually
Posted Mar 28, 2009 1:57 UTC (Sat) by foom (subscriber, #14868) [Link]
> Sorting numerically in GNU ls is done by 'ls -v'.Huh, never knew that, interesting! Never would have found that from the man page, which says "-v sort by version". That seems a remarkably poor description of what it actually does.
> (By default, despite comments elsewhere in this thread, ls sorts ASCIIbetically, so " 2" comes before "1".)
Well, not exactly: GNU ls has a default sort which depends on the locale's collation settings, and most systems default to a locale like en_US.UTF-8, so most people have it sorting in a case/accent-insensitive manner by default on their systems.
Leading spaces are common, actually
Posted Mar 28, 2009 20:36 UTC (Sat) by nix (subscriber, #2304) [Link]
(And you're right on the collation sort thing: I spoke carelessly.)
Leading spaces are common, actually
Posted Mar 27, 2009 4:41 UTC (Fri) by barryn (subscriber, #5996) [Link]
Behavior ofls in Mac OS X 10.5.6 build 9G55:
$ pwd /Library/Application Support/GarageBand/Instrument Library/Track Settings $ ls -l Master | head total 0 drwxrwxrwx 3 root admin 102 May 3 2008 Basic drwxrwxrwx 6 root admin 204 May 3 2008 Ambient drwxrwxrwx 6 root admin 204 May 3 2008 Classical drwxrwxrwx 11 root admin 374 May 3 2008 Dance drwxrwxrwx 5 root admin 170 May 3 2008 Hip Hop drwxrwxrwx 5 root admin 170 May 3 2008 Jazz drwxrwxrwx 7 root admin 238 May 3 2008 Pop drwxrwxrwx 7 root admin 238 May 3 2008 Rock drwxrwxrwx 5 root admin 170 May 3 2008 Stadium Rock $And this matches the Finder's behavior. BTW, if the Finder behaved any other way, it would be more difficult to properly recover broken Mac OS 9.x or earlier installations using OS X -- Classic Mac OS loads files in /System Folder/Extensions in lexicographic order, and the load order matters, and the leading space trick is used very frequently there. Mac OS X 10.5 can dual-boot with Mac OS 9.x, so this still matters for some users.
Leading spaces are common, actually
Posted Mar 27, 2009 15:45 UTC (Fri) by foom (subscriber, #14868) [Link]
>Behavior of ls in Mac OS X 10.5.6 build 9G55Well, OSX's "ls" is actually just doing a traditional strcmp sort, not anything fancy (note that it puts all uppercase characters before all lowercase).
But the Finder's sort routine is fancy. They seem to be a sort order based on Unicode TR10.
Leading spaces are common, actually
Posted Nov 15, 2009 0:32 UTC (Sun) by yuhong (guest, #57183) [Link]
Leading spaces are common, actually
Posted Mar 27, 2009 5:36 UTC (Fri) by quotemstr (subscriber, #45331) [Link]
The users on a filesystem I administer use six or seven levels of leading space to sort their common jobs-in-progress directory. I've long since given up on getting them to move to a hierarchical setup.The way I see it, if a program can correctly work with filenames containing spaces, it can work with a filename containing leading spaces.
It's most important to eliminate newlines and control characters in filenames. The second most important consideration is specifying UTF-8 as the preferred filename encoding. Let's not get caught up in all sorts of other wishes that will just encourage endless debate and prevent these very important changes from getting made at all.
Leading spaces are common, actually
Posted Mar 28, 2009 9:21 UTC (Sat) by explodingferret (guest, #57530) [Link]
perl also has problems with leading spaces in filenames, unless you use the right kind of open command (perldoc -f open).
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 26, 2009 14:29 UTC (Thu) by kenjennings (guest, #57559) [Link]
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 28, 2009 1:01 UTC (Sat) by nix (subscriber, #2304) [Link]
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 28, 2009 1:06 UTC (Sat) by pr1268 (subscriber, #24648) [Link]
People should not be using filenames as data storage.
How about metadata storage? In my PC troubleshooting days, I came across a Windows XP box with a folder of videos of adult content whose file names were lengthy and explicit descriptions of the activities portrayed in the videos. Just reading the directory listing alone conjured up many vivid and disturbing thoughts. Fortunately this wasn't Windows Vista—its Explorer even creates video thumbnails. :-o
Meta-discussion
Posted Mar 26, 2009 18:37 UTC (Thu) by quotemstr (subscriber, #45331) [Link]
(For the record, I support all the proposed restrictions on filenames except for a ban on shell meta-characters.)
Meta-discussion
Posted Mar 28, 2009 22:21 UTC (Sat) by man_ls (guest, #15091) [Link]
Hmmm, I'm not so sure. I feel strongly about ext4 losing data, but I don't have a strong opinion about this issue. Really. Not for lack of sensitivity to the problem -- I've had an administrator at work erase a whole directory of files because of a leading space (so that 'rm -rf /dir/file' became 'rm -rf /dir/ file'). But there are advantages and disadvantages, and I cannot pick a side.Bojan has only posted once, and his message contains the words "not sure". I would say that this debate attracts a different subset of (opinionated) people.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 26, 2009 19:34 UTC (Thu) by az (guest, #46701) [Link]
Filenames need to be usable for describing the contents of files. You sure as hell don't need newlines and tabs for that, but you certainly should be able to use the same punctuation you can use in a sentence - in English, that's !@#$%&()-:;"',.? but in other languages more characters may be needed (but they would be no different from any other non-ascii utf-8 character). I think the requirement that a filename can't start or end with certain characters is acceptable - you don't expect a sentence to start with most of them - but inside the string being forbidden from using them would be very constraining.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 26, 2009 19:36 UTC (Thu) by az (guest, #46701) [Link]
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 26, 2009 23:04 UTC (Thu) by zooko (guest, #2589) [Link]
"When reading in a filename, what we really want is to get the unicode of that filename without risk of corruption due to false decoding. ... Unfortunately this isn't possible except on Windows and Macintosh."
http://allmydata.org/pipermail/tahoe-dev/2009-March/00137...
Hopefully Linux folks will follow D. Wheeler's lead on this and make it so that some future version of Tahoe can reliably get filenames from Linux just as it currently can from Windows and Mac OS X.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 26, 2009 23:35 UTC (Thu) by zooko (guest, #2589) [Link]
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 28, 2009 16:00 UTC (Sat) by tialaramex (subscriber, #21167) [Link]
There, I said it.
It really doesn't exist. You, up in Win32 land, are forbidden from creating certain filenames, but everybody else running on the same NT kernel and sharing a filesystem with you is allowed to continue doing as they please, and so the APIs you're using /explicitly/ don't promise what you're relying on.
The filenames you get from NT will be sequences of 16-bit code units. They might be Unicode. The filenames you get from Linux will be sequences of 8-bit code units. They might be Unicode (in this case UTF-8) too.
In both cases you will /usually/ not see sequences that are crazy and undisplayable or not legal in some (non-filename) context, but you might, and when you do the OS vendor will say "I never promised otherwise". So you need defensive coding.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 28, 2009 16:41 UTC (Sat) by zooko (guest, #2589) [Link]
"The filenames you get from NT will be sequences of 16-bit code units. They might be Unicode. The filenames you get from Linux will be sequences of 8-bit code units. They might be Unicode (in this case UTF-8) too."
I don't think this is true. The bytes in the filenames in NT are defined to be UTF-16 encodings of characters. The bytes in the filenames in Mac are defined to be UTF-8 encodings. The bytes in the filenames in Linux are not defined to be any particular encoding. It isn't just a definitional issue -- the result is that reading a filename from the Windows or Mac filesystem can't cause you to lose information -- the filename you get in your application is guaranteed to be the same as the filename that is stored in the filesystem. On the other hand, when you read a filename from the filesystem in Linux, then you need to decide how to attempt to decode it, and there is no way to guarantee that your attempt won't corrupt the filename.
Please correct me if I'm wrong, because I'm intending to make the Tahoe p2p disk-sharing app depend on this guarantee from Windows (and from Mac), and to make it painfully work around the lack of this guarantee in Linux.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 28, 2009 17:45 UTC (Sat) by tialaramex (subscriber, #21167) [Link]
* I am saying that you can't guarantee that the filenames Windows gives you are all legal UTF-16 Unicode strings. Windows makes no such promise. Non-Win32 programs (including Win32 programs which also use native low-level APIs) may create files which don't obey the convention, and filenames on disk or from a network filesystem are not checked to see if they are valid UTF-16.
* I am NOT saying that there are people running Windows whose filenames are all in SJIS or ISO-8859-8 or even Windows codepage 1252. That would be silly because those encodings (and indeed practically all legacy encodings) are 8-bit and all filenames in Windows are 16-bit. When a Windows filename "means something" at all, the meaning will be encoded as UTF-16, or perhaps if you're really unlucky, UCS-2.
So if your problem is "People keep running my program with crazy locale settings and legacy encodings of filenames" well you have my sympathy, and yes you will need to handle this for Linux (even if only by writing a FAQ entry telling them to switch to UTF-8) and might get away without on Windows.
But if the problem is "My program blindly assumes filenames are legal Unicode strings" then you're in a bad way, stop doing that because it's a bug at least on Linux and Windows, and IMO most likely on Mac OS X too (though their documentation claims otherwise).
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 28, 2009 18:54 UTC (Sat) by foom (subscriber, #14868) [Link]
That's not actually true. The windows APIs take arrays of 16-bit "things". Those are supposed to be
UTF-16, but none of the APIs will check that. So, you can easily create invalid surrogate pair
sequences. Now, it's a *lot* easier to ignore this issue on windows than on linux, because:
a) The set of invalid sequences in UTF-16 is a lot smaller than in UTF-8.
b) Nobody creates those by accident. It won't happen just because you set your LOCALE wrong.
c) the windows Unicode APIs are all 16-bit unicode, so they never try decoding the surrogate pair
sequences anyways
d) Even UTF-16->UTF-32 decoders often decode a lone surrogate pair in UTF-16 into a lone
surrogate pair value in UTF-32 (even though it's theoretically not supposed to do that).
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 27, 2009 7:10 UTC (Fri) by janpla (guest, #11093) [Link]
The important thing in any theoretical framework is that it is orthogonal and logically complete; because then all it takes is clarity of mind to figure out the correct way to do things. A basic, and in my opinion very sound, principle in UNIX is that the system provides functionality, not policy. This means that unfortunately you can potentially do incredibly stupid things, but it also means that you are not excluded from doing incredibly clever things either.
UNIX is an operating system for adults who take responsibility and at least try to think before they jump. If you want a Fischer-Price interface where nothing can harm you, there are other options available.
Bad understanding of UTF-8
Posted Mar 27, 2009 22:56 UTC (Fri) by spitzak (guest, #4593) [Link]
An "invalid" UTF-8 string can contain only some extraneous bytes in the range 0x80-0xff. These high-order bytes do not cause any problems with any programs.
The problem is the stupid Python guys who believe in magic fairy land where all UTF-8 is valid. This is also causing havoc with using Python3 for URLs and HTML. No, I'm sorry, if a file contains UTF-8, it is going to have invalid sequences. They need to get their heads out of their *** and do something so that invalid UTF-8 is preserved ALL THE TIME and never throws an exception, unless you specifically call "throw_exception_if_not_valid_utf8()".
Because a whole lot of stupid people thought that "wide characters" are the solution and put them into certain systems we have to live with it and interoperate. The most popular solution is to translate invalid bytes in UTF-8 into 0xDCxx. This can be used as a stopgap until they finally realize that leaving the data in UTF-8 is the real solution. This substitution does not really fix things because it does not do a round trip clean conversion. Supporting round-trip means your system cannot name invalid UTF-16 file names, and if you think those don't exist you are really living in a fantasy world!
I think therefore the escape character can easily be the UTF-8 encoding of 0xDCxx. This will not conflict with the above because all the escaped characters do not have the high bit set. This will survive a translation to UTF-16 and thus provides a way to put the exact same filenames on Windows UTF-16 filesystems.
His proposed rules for disallowed bytes seem pretty reasonable though I would not disallow any printing characters in the interior of the filename, backslash escaping works pretty good in there.
Bad understanding of UTF-8
Posted Mar 28, 2009 3:40 UTC (Sat) by njs (guest, #40338) [Link]
An "invalid" UTF-8 string can contain only some extraneous bytes in the range 0x80-0xff. These high-order bytes do not cause any problems with any programs.
Bad understanding of UTF-8
Posted Mar 30, 2009 16:08 UTC (Mon) by spitzak (guest, #4593) [Link]
The first two references are about programs failing to recognize overlong encodings as being invalid. But those invalid sequences start with a byte with the high bit set (following bytes may not have it set, but the fact that decoders consider them part of the first byte is the decoders error, a fixed decoder would consider it a one-byte error with the high bit set, followed by normal ascii characters which are unchanged and thus cannot cause a security hole).
The last one is EXACTLY the bug I am trying to fix: stupid people who somehow believe that throwing errors or replacing with non-unique strings is how invalid UTF-8 should be handled. The bug is that it maps more than one different string to the same one. The proper solution is to stop translating UTF-8 into something else and treat it as a stream of bytes. Nothing should care that it is UTF-8 except stuff that draws it on the screen.
Bad understanding of UTF-8
Posted Mar 31, 2009 4:49 UTC (Tue) by njs (guest, #40338) [Link]
So -- just checking we're on the same page here -- what you're saying is that you're sure that those three security bugs I found in 5 minutes of googling were "not problems in any program".
> The first two references are about programs failing to recognize overlong encodings as being invalid.
Right -- if invalid codings are interpreted differently in different parts of a system, then that creates bugs and security holes.
> But those invalid sequences start with a byte with the high bit set (following bytes may not have it set, but the fact that decoders consider them part of the first byte is the decoders error, a fixed decoder would consider it a one-byte error with the high bit set, followed by normal ascii characters which are unchanged and thus cannot cause a security hole).
I'm sorry -- I cannot make out a word of this. The bug in the first two links is that the invalid sequences are over-long (but like all the bugs mentioned here, involve only bytes with the high bits set -- do you know how UTF-8 works?). The decoder should have an explicit check for such sequences and throw an error if they are encountered, but this check was left out.
> The last one is EXACTLY the bug I am trying to fix: stupid people who somehow believe that throwing errors or replacing with non-unique strings is how invalid UTF-8 should be handled.
Errrr... quite so. I wasn't sure how useful this was to start with, but when you say in so many words that the proper solution to XSS security holes is to stop sanitizing web form inputs and instead convert all web browsers so that they *don't interpret unicode* then... maybe it's time I step out of this thread. Best of luck to you.
Bad understanding of UTF-8
Posted Mar 31, 2009 17:59 UTC (Tue) by spitzak (guest, #4593) [Link]
An overlong encoding consists of a leading byte with the high bit set. This is an error. That may be followed by any byte. If it is another leading byte then it might start another UTF-8 character, or it might be an error. If it is a continuation byte then it is an error. If it is an ASCII character then it is not an error. As before, EVERY ERROR BYTE has the high bit set!
I might have misunderstood your question. You said "are you sure" in response to me saying that all error bytes have the high bit set. The reason I was confirming that all error bytes have the high bit set is that if they are mapped to a 128-long range of Unicode then the adjacent 128-long range makes a good candidate for "quoting" characters that are not allowed in filenames.
I do believe there are some serious mistakes in a lot of modern software. UTF-8 should NOT be converted until the very last moment when it is converted to "display form" for drawing on the screen. This is the only reliable way of preserving identity of invalid strings. People who think invalid strings will not occur or that it is acceptable for them to compare equal or silently be changed to other invalid strings or with valid strings are living in a fantasy land.
Bad understanding of UTF-8
Posted Apr 1, 2009 5:12 UTC (Wed) by njs (guest, #40338) [Link]
Okay, fair enough. I agree, all ASCII characters are valid UTF-8. I was objecting to your claim that bytes with the high bits set "do not cause any problems with any programs".
> An overlong encoding consists of a leading byte with the high bit set. This is an error.
All characters with codepoint >= 128 are encoded in UTF-8 as a string of bytes with the high bit set (including on the leading byte). Having the high bit set is *certainly* not an error. I can't tell what you're saying in general, but it's just not true that the only time strings need to be interpreted as text is for display. In many, many cases text needs to be processed as text, and it's often impossible and rarely practical to write algorithms in such a way that they do something sensible with invalid encodings. Those serious security bugs I pointed out up above are examples of what happens when you try.
(You're right that invalid strings usually shouldn't be silently transmuted to valid strings; they should usually signal a hard error.)
Bad understanding of UTF-8
Posted Apr 1, 2009 16:38 UTC (Wed) by spitzak (guest, #4593) [Link]
Do NOT throw exceptions on bad strings. This turns a possible security error into a guaranteed DOS error. Working around it (as I have had to do countless times due to stupid string-drawing routines that refuse to draw a string with an error in it) means I have to write my *own* UTF-8 parser, just to remove the errors, before displaying it or using it. I hope you can see how forcing programmers to use their own code to parse the strings rather than providing reusable routines is a bad idea.
And I don't want exceptions thrown when I compare two strings for equality. That way lies madness. It is unfortunate that too much of this stuff is being designed by people who never use it or they (and you) would not make such trivial design errors.
Bad understanding of UTF-8
Posted Apr 15, 2009 10:38 UTC (Wed) by epa (subscriber, #39769) [Link]
Because a whole lot of stupid people thought that "wide characters" are the solution and put them into certain systems we have to live with it and interoperate. The most popular solution is to translate invalid bytes in UTF-8 into 0xDCxx. This can be used as a stopgap until they finally realize that leaving the data in UTF-8 is the real solution.They cannot 'leave the data in UTF-8' because it is not in UTF-8 to start with! If it contains invalid bytes then by definition it's not UTF-8. It is just a string of arbitrary bytes and certainly, yes, the application can treat it as such. That does make life difficult when you want to display the filename to the user or otherwise treat it as human-readable text.
And indeeed, the Python developers are living in a magic fairy land where filenames are sanely encoded and are always human-readable text, but wouldn't it be better to change things so that this situation is no longer wishful thinking, but part of the ordinary things userspace can rely on? That is what Wheeler is proposing.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 28, 2009 11:04 UTC (Sat) by magnus (subscriber, #34778) [Link]
There ought to be a different way of passing file references between programs and from programs to the kernel in a way that conversion from text to file reference is only ever needed on hand written file names.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Posted Mar 31, 2009 5:14 UTC (Tue) by njs (guest, #40338) [Link]
Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds