Wheeler: Fixing Unix/Linux/POSIX Filenames
In a well-designed system, simple things should be simple, and the 'obvious easy' way to do something should be the right way. I call this goal 'no sharp edges' - to use an analogy, if you're designing a wrench, don't put razor blades on the handles. The current POSIX filesystem fails this test - it does have sharp edges. Because it's hard to do things the 'right' way, many Unix/Linux programs simply assume that 'filenames are reasonable', even though the system doesn't guarantee that this is true. This leads to programs with errors that aren't immediately obvious."
Posted Mar 25, 2009 14:05 UTC (Wed)
by epa (subscriber, #39769)
[Link] (3 responses)
Look at the recent Python version that got tripped up by filenames that are not valid UTF-8. Currently on a Unix-like system you cannot assume anything more about filenames than that they're a string of bytes. This frustrates efforts to treat them as Unicode strings and cleanly allow international characters.
Or look at the whole succession of security holes in shell scripts and even other languages caused by control characters in filenames. My particular favourite is the way many innocuous-looking perl programs (containing 'while (<>)') can be induced to overwrite random files by making filenames beginning '>'.
A system-wide policy guaranteeing that only sane characters can appear in filenames would eliminate at a stroke a lot of tedious sanity-checking you have to do in userspace (not to mention the hidden bugs and security holes in many programs because the sanity-checking was not paranoid enough). Given the natural conservatism of developers, I can't be optimistic it will happen soon. But, like defaulting to relatime instead of updating atime on each access, it's a long-overdue spring clean to a particularly musty corner of the Unix way.
Posted Mar 25, 2009 16:52 UTC (Wed)
by mjthayer (guest, #39183)
[Link] (2 responses)
Posted Mar 25, 2009 20:02 UTC (Wed)
by mjthayer (guest, #39183)
[Link] (1 responses)
Posted Mar 29, 2009 0:01 UTC (Sun)
by mikachu (guest, #5333)
[Link]
Posted Mar 25, 2009 15:12 UTC (Wed)
by rsidd (subscriber, #2582)
[Link] (13 responses)
Yet another topic that was extensively discussed in the Unix Haters Handbook.
One gotcha that is not covered by limiting the allowed character set in filenames is this: how do you remove all your configuration files and directories (that begin with a ".")? "rm -rf .*" will have very undesirable results. And yes, I have done this to myself -- luckily the system had nightly backups.
Posted Mar 25, 2009 15:32 UTC (Wed)
by drag (guest, #31333)
[Link] (2 responses)
Maybe if Linux folks keep identifying and fixing legacy Unix usability issues people will start referring to Linux as 'The way Unix should of been' or 'Unix done right'.
I am always really paranoid about file names in Linux when doing scripting or whatnot. It's difficult and dealing with them is always a "oh god I hope I did the escaping right" sort of deal. Because I know I can have a script that I use for lots of stuff, but sometime I may make a goober'd up filename late at night or something that can end up destroying data unless I did things exactly correct in my scripts.
Filenames with ~ in them, or < > or () or all sorts of odd things I make by mistake.
Posted Mar 25, 2009 16:20 UTC (Wed)
by epa (subscriber, #39769)
[Link] (1 responses)
Posted Mar 25, 2009 22:14 UTC (Wed)
by jordanb (guest, #45668)
[Link]
Posted Mar 25, 2009 15:42 UTC (Wed)
by gnb (subscriber, #5132)
[Link]
Posted Mar 25, 2009 18:43 UTC (Wed)
by danielthaler (guest, #24764)
[Link] (2 responses)
rm: cannot remove directory `.'
so something did prevent it from happening...
Posted Mar 25, 2009 21:43 UTC (Wed)
by mjthayer (guest, #39183)
[Link] (1 responses)
Posted Mar 25, 2009 22:46 UTC (Wed)
by nix (subscriber, #2304)
[Link]
GNU coreutils, like gnulib, is a goldmine of fantastic tricks (and evil
Posted Mar 25, 2009 22:15 UTC (Wed)
by csigler (subscriber, #1224)
[Link]
IIRC, you can do something like "rm -fr .[^.]*" -- at least this WFM in the command "ls -Fa .[^.]*"
Posted Mar 25, 2009 23:05 UTC (Wed)
by Jonno (subscriber, #49613)
[Link]
`rm -rf .??*` is a good start. It will miss configuration files with only
Posted Mar 26, 2009 1:32 UTC (Thu)
by bojan (subscriber, #14302)
[Link]
Posted Mar 26, 2009 2:34 UTC (Thu)
by k8to (guest, #15413)
[Link]
Personally I just walk the list and filter out . and ..
Yes, it sucks.
Posted Mar 26, 2009 4:58 UTC (Thu)
by shimei (guest, #54776)
[Link]
Posted Mar 27, 2009 1:28 UTC (Fri)
by no_treble (guest, #49534)
[Link]
rm -rf .[!.]*
Posted Mar 25, 2009 16:22 UTC (Wed)
by jreiser (subscriber, #11027)
[Link] (15 responses)
If you are truly concerned about portability, then work on the problem which arises because Microsoft Windows [FAT and NTFS] allows a filename consisting of a US customary calendar date, i.e. "03/25/09" as an eight-character filename.
Posted Mar 25, 2009 16:42 UTC (Wed)
by epa (subscriber, #39769)
[Link] (14 responses)
There is a certain old-school appeal in just being able to use the filesystem as a key-value store with no restrictions on what bytes can appear in the key. But it's spoiled a bit by the prohibition of NUL and / characters, and trivially you can adapt such code to base64-encode the key into a sanitized filename. It may look a bit uglier, but if only application-specific programs and the OS access the files anyway, that does not matter.
Posted Mar 25, 2009 17:46 UTC (Wed)
by nix (subscriber, #2304)
[Link] (9 responses)
(I could equally use a directory full of stuff here, but it too would need a name that's hard to type. I pondered a \n-prepended filename because it's even harder to trip over by mistake, but decided that it would look too odd in directory listings of object directories when debugging. There's no danger of user code interpreting these names as options, because user code accesses files in this directory only via a shell-function API.)
And if I've done it, I guarantee you that lots and lots of other people have done it too.
David's proposed constraints on filenames are constraints which can never be imposed by default, at the very least. The semantics of Unix filesystems have been fixed de facto for many years: nobody expects files with odd characters to work on FAT, but nobody expects a Unix system to use a FAT filesystem as a primary datastore either.
Hardwired filename encodings are a good idea only if you can guarantee that this encoding has been the standard for the lifetime of the filesystem. You can't assume that for any existing filesystem: thus you have to decide what to do if filenames are not representable in the encoding scheme chosen. This also conflicts with 'no control characters' in that a good bunch of Unicode characters >127 can be considered 'control characters' of a sort, and there's no guarantee that more won't be added. How to exclude control characters which may be added in the future?
You also can't sensibly exclude shell metacharacters, because you don't know what they are, because they're shell-dependent, and some shells (like zsh) have globbing schemes so complex that ruling out all filenames that might be interpretable as a glob is wholly impractical.
But I agree that these rules all make sense for parts of the filesystem that users might manipulate with arbitrary programs, as opposed to those that are using part of the FS as a single-program-managed datastore. What I think we need is an analogue of pathconf() (setpathconf()?) combined with extra fs metadata, such that constraints of this sort can be imposed and relaxed for *specific directories* (inherited by subdirectories on mkdir(), presumably). *That* might stand a chance of not breaking the entire world in the name of fixing it.
Posted Mar 25, 2009 21:56 UTC (Wed)
by dwheeler (guest, #1216)
[Link] (8 responses)
A few thoughts based on nix's comments...
Such a key-value storage will have trouble with "/" in the key,
since it's the directory separator.
So if you truly need arbitrary keys, you already have to do some encoding
anyway - so why not encode to something more convenient?
If you don't need arbitrary encoding, then let's find some reasonable
limits that stop the worst of the bleeding.
Also, there's no need to have all those weird filenames merged with
other stuff in the same directory;
you could create a single directory with "." as the first character in the name, and create the key-value store in that subdirectory.
That's exactly my point.
Even in your case, filenames with \n are a pain. And let's say that a user runs a "find" that traverses your directory... if the filenames are troublesome (e.g., include \n or \t) you'll almost certainly cause the user grief, even if they had no idea that you implemented a keystore. And even if you don't want users (or their programs) going into these directories, people WILL need to do so, to do debugging.
"We've always done it that way" may be true, but that doesn't
justify the status quo.
The status quo is causing pain, for little gain.
Let's fix it.
Lots of filesystems ALREADY mandate specific on-disk encodings; I believe
all Windows and MacOS filesystems already specify them.
The problem is that the system doesn't know how to map them to the
userspace API. So, let's define the userspace API, so that people
can actually do the mapping correctly.
As far as "forever" goes,
the program "convmv" does mass file renames for encoding; you can use
it to convert a given filesystem from whatever encoding you've been
using to UTF-8 (problem solved).
The distros are already moving this way.
As far as "no control characters", there's no need to do anything locale-dependent; excluding 1-31 would be adequate, and I'd also exclude 127 to to be complete for 7-bit ASCII (how do you print DEL in a GUI anyway?!?).
Control characters unique to other locales don't bite people the way these characters do.
I completely agree that this limitation
cannot be applied everywhere.
In fact, my article said, "I doubt these limitations could be agreed upon across all POSIX systems, but it'd be nice if administrators could configure specific systems to prevent such filenames on higher-value systems." But on some systems, I do know what shells are in use, and their metacharacters, and the system is never supposed to be creating filenames
with metacharacters in the first place.
I'd like to be able to install a "special exclusion list", just like
I can install SELinux today to create additional limitations on
what this particular system can do.
That's a very interesting idea, I like it!
In fact, there's already a mechanism in the Linux kernel that
might do this job already: getfattr(1)/setfattr(1).
If it were implemented this way,
I'd suggest that by default directories would
"prevent bad filenames" (e.g., control chars and leading "-");
you could then use "setfattr" on directories to permit badness.
New directories could then inherit the state of their parent.
I would make those "trusted extended attributes" - you'd have to be
CAP_SYS_ADMIN (typically superuser) to be able to create such directories.
Posted Mar 26, 2009 3:07 UTC (Thu)
by drag (guest, #31333)
[Link]
And that is much more extreme then having a filesystem mount option to stop tabs and newlines being used to define files.
It'll be future proof also, as much that matters. You don't make a whitelist of allowed characters, you make a blacklist of troublesome characters and allow everything else. If you create more troublesome characters, which is very unlikely, you can add them to the black list. (and even if it is going to happen it will be exceedingly rare) Any new characters that get made, or any new encodings, then they will just be allowed to slide on through.
I mean if you have a future encoding scheme that conflicts with a previously established and well known encoding such as ascii, then it is just too dumb to be supported by anybody.
-----------------
Here is a challenge:
Somebody write me a script that will go and count all the uses of tabs, <, >, and newlines in their file names on their systems...
Posted Mar 26, 2009 22:12 UTC (Thu)
by nix (subscriber, #2304)
[Link]
But I have seen a system in production use at Big Banks (first saw
it yesterday, or first noticed it, probably thanks to this conversation)
that uses the filesystem as a base-254-of-key to value store. It's gross
but it's sometimes done.
But then, we know how competent Big Banks are. (Especially this one, did
you but know who it was.)
I have no objection to making the things you propose options. What
I object to is making them mandatory, because this would make some
things impossible. (Strange things, but still.)
I'd say that setting this attribute flips a pathconf-viewable attribute as
well, so that other POSIX-compliant systems can adopt the same approach
and applications can portably query it without needing to implement/depend
on all of the ACL machinery.
Posted Mar 28, 2009 15:36 UTC (Sat)
by tialaramex (subscriber, #21167)
[Link] (5 responses)
NT (the kernel API in Windows NT, 2000, XP and etc.) doesn't care about filename encodings. The only thing that makes NT's attitude to such things different from that of Linux's is that NT's arbitrary sequences of non-zero code units used for filenames use 16-bit code units, and in Linux obviously they're 8-bit.
Everything else you see, such as case-insensitivity, bans on certain characters or sequences of characters, is implemented in other layers of the OS or even in language runtimes, not the kernel. Low-level programmers, just as on Unix, can call a file anything they like.
And the consequence is the same thing being lamented in this article - badly written Windows programs crash or do insane things when faced with filenames that don't look like the ones the poor third rate programmer who wrote the code was familiar with. In the absence of defensive programming this software also doesn't like leap years, or leap seconds, or files that are more than 2GB long, or... you could go on all day, badly written programs suck.
On encodings - I encourage you to use UTF-8. I encourage people with other encodings to migrate to UTF-8, but using UTF-8 and blindly trusting that everything you work with is actually legal and meaningful display-safe UTF-8 are quite different things. People who can't keep them separate are doing a bad job, whether handling filenames or displaying email.
Posted Mar 29, 2009 14:36 UTC (Sun)
by epa (subscriber, #39769)
[Link] (3 responses)
Everything else you see, such as case-insensitivity, bans on certain characters or sequences of characters, is implemented in other layers of the OS or even in language runtimes, not the kernel. Low-level programmers, just as on Unix, can call a file anything they like.
Posted Mar 30, 2009 10:55 UTC (Mon)
by nye (subscriber, #51576)
[Link] (2 responses)
Yes. This is what the POSIX subsystems for NT do; they're implemented on top of the native API, as is the Win32 API. Note that Cygwin doesn't count here as it's a compatibility layer on top of the Win32 API rather than its own separate subsystem.
Unfortunately the Win32 API *does* enforce things like file naming conventions, so it's impossible (at least without major voodoo) to write Win32 applications which handle things like a colon in a file name, and since different subsytems are isolated, that means that no normal Windows software is going to be able to do it.
(I learnt all this when I copied my music collection to an NTFS filesystem, and discovered that bits of it were unaccessible to Windows without SFU/SUA, which is unavailable for the version of Windows I was using.)
Posted Mar 30, 2009 15:13 UTC (Mon)
by foom (subscriber, #14868)
[Link] (1 responses)
You can actually do this through the Win32 API: see the FILE_FLAG_POSIX_SEMANTICS flag for CreateFile.
However, MS realized this was a security problem, so as of WinXP, this option will in normal
circumstances do absolutely nothing. You now have to explicitly enable case-sensitive support on
the system for either the "Native" or Win32 APIs to allow it.
(the SFU installer asks if you want to this, but even SFU has no special dispensation)
Posted Nov 15, 2009 0:06 UTC (Sun)
by yuhong (guest, #57183)
[Link]
Posted Nov 14, 2009 23:58 UTC (Sat)
by yuhong (guest, #57183)
[Link]
Posted Mar 25, 2009 17:57 UTC (Wed)
by jd (guest, #26381)
[Link]
IMHO, the different roles all speak to different problems and all have their limitations outside of the problems they're meant for. The first step in finding a solution is to define the problem, but a filesystem solves a very wide range of problems, making a definition less clear-cut.
Posted Mar 25, 2009 23:27 UTC (Wed)
by jreiser (subscriber, #11027)
[Link] (1 responses)
Posted Mar 26, 2009 14:44 UTC (Thu)
by dwheeler (guest, #1216)
[Link]
I want to be able to count on more than what the POSIX spec says;
I want to be able to use the entire Unicode character set, minus the control chars and a few additional constraints to prevent lots of problems for the general-purpose user.
Posted Mar 26, 2009 13:38 UTC (Thu)
by Wol (subscriber, #4433)
[Link]
A file inside this system is actually stored as a directory at the OS level, and it created filenames of the form <space><backspace>nnn.
I copied this, and found that Midnight Commander was great at managing the resulting files :-) It's done so that people can't tamper - corrupting one of the (many) OS-level files would do serious damage to the PI file.
Cheers,
Posted Mar 25, 2009 16:31 UTC (Wed)
by mgross (guest, #38112)
[Link] (6 responses)
Also, this seems like a pretty sensible idea, why hasn't it been implemented already?
Posted Mar 25, 2009 16:50 UTC (Wed)
by epa (subscriber, #39769)
[Link] (2 responses)
Some will argue that the answer is user education (teach your users not to use bad characters in filenames), and perhaps even a cron job you can run on your PDP-11 overnight to look for filenames containing these characters and send a message via local mail to the user responsible. Furthermore, if it was good enough for V7 UNIX, it's good enough for us now. (Note that in Plan 9, there are sensible restrictions on characters in filenames; but it's common for followers of a particular system or language to become rabidly conservative, even when the original designers of the system have moved on.)
In other words it is sheer inertia, and reluctance by any one Unix-like system to add such a feature when the others do not. You can bet that if Linux added a filename character check, it would immediately be branded 'broken' by many BSD or Solaris enthusiasts - not all, but certainly those that make the most noise online.
Posted Mar 26, 2009 2:19 UTC (Thu)
by dirtyepic (guest, #30178)
[Link] (1 responses)
Posted Mar 26, 2009 15:23 UTC (Thu)
by dwheeler (guest, #1216)
[Link]
"The reasonable man adapts himself to the world; the unreasonable one persists in trying to adapt the world to himself. Therefore, all progress depends on the unreasonable man." (George Bernard Shaw)
I'm well aware that this is different than the historical past. But that doesn't make past decisions correct for the present. So, let's chat about the pros and cons; I believe that the cons for "anything goes" now outweigh the pros.
Posted Mar 28, 2009 16:33 UTC (Sat)
by tialaramex (subscriber, #21167)
[Link] (2 responses)
To actually make this work, in the kernel (where you're perf critical and this is all unwanted overhead that's costing everyone who uses your "improved" system) you need to absolutely, as a matter of "Linus will veto if you don't" policy:
* Validate every filename to check that it conforms. This has to be done either at mount time, or when syscalls interact with the filenames (e.g. directory reading, and opening files). As a network file system client the OS must either screen every filename going over the network, or else punt and rely on promises from the server (if available).
* When you find an invalid filename, you need to deal with it, it's not clear what the kernel should or even could do. Perhaps the file should just not exist as far as userspace is concerned, and fsck would unlink it?
Meanwhile application developers get no benefit for many years because of compatibility considerations. It could be a decade before it makes any sense to write a program which assumes one of the restrictions, and that's if EVERY SINGLE OS fixes this tomorrow. Wheeler mistakenly believes this is a POSIX problem, but it isn't, the problem exists everywhere that filenames are treated as opaque, which in fact includes Windows (and I have my doubts about OS X, but its API documentation promises they aren't opaque, so app developers who rely on that promise would be entitled to scream blue murder when someone finds a way to get non-Unicode into an OS X filename...*)
Personally I think the issue to look at is spaces. Spaces are legal. They are undoubtedly going to remain legal. But they are inconvenient. How can we tweak our basic Unix processes (including the shell and many old tools) so that spaces are harmless ? Once you've done this, you'll have the right mindset to tackle initial hyphen, control characters and so on from the same angle, rather than screwing the poor kernel into doing your dirty work and making everybody (including those of us for whom opaque filenames are just dandy) pay.
* Something that should make you pause, OS X's approach to filenames as Unicode strings makes Unicode composition/ decomposition into an OS ABI feature. It had been doing this for years before Unicode actually pledged to stop changing the decomposition rules (ie until that happened new versions of OS X made previously legal filenames illegal and vice versa, with no warning...)
Posted Mar 29, 2009 14:31 UTC (Sun)
by epa (subscriber, #39769)
[Link]
Every non-Unix OS already forbids control characters in filenames so there would not be much extra checking to do in filesystems like smbfs or ntfs. (Except out of paranoia to detect disk corruption, which is probably a good thing to do anyway.) As you point out, there remains the question of network filesystems like NFS, where the server could legitimately return filenames containing arbitrary byte sequences. And there would have to be some policy decision about what to do. But I would rather have one single place to deal with the mess rather than leave it to 101 different bits of code in user space. (Python 3.0 pretends that invalid-UTF-8 filenames do not exist when returning a directory listing; other programs will show them but may or may not escape control characters when displaying to the terminal; goodness knows what different Java implementations do.)
I would favour silently discarding filenames that contain control characters from the directory listing, and for those in some legacy encoding like Latin-1 or Shift-JIS, translating them to UTF-8. (The legacy encoding would be specified with a mount parameter. Again, this is a bit awkward but a hundred times less complicated than leaving every userspace program to do its own peculiar thing.)
OS X is something of a special case because of case insensitivity. If you don't want case insensitivity then you do not need to worry about Unicode composition; just a simple byte sequence check that you have valid UTF-8. But OS X is a useful example in another way: a case-insensitive filesystem is a much bigger break with Unix tradition that what's proposed here, and yet the world did not come to an end, and it was trivial for most Unix software to adapt.
Posted Mar 31, 2009 5:00 UTC (Tue)
by njs (subscriber, #40338)
[Link]
Then you just get distros to set that flag on the root filesystem by default, add a few bits of API for programs who want to know "is this filesystem utf8-only?" or "how does this filesystem normalize names?" (which would be really useful calls anyway), and away you go.
(It's unfortunate that the Win32 designers screwed this up, but that's hardly an argument to perpetuate their mistake.)
Posted Mar 25, 2009 16:54 UTC (Wed)
by dmarti (subscriber, #11625)
[Link]
Posted Mar 25, 2009 17:00 UTC (Wed)
by adobriyan (subscriber, #30858)
[Link]
Posted Mar 25, 2009 17:51 UTC (Wed)
by ajb (subscriber, #9694)
[Link]
- add a new inheritable process capability, 'BADFILENAMES', without which processes can't see or create files with bad names.
That way, most processes can run happily in the ignorance of any bad filenames. If you need to access one, you run the commands you need to access it with under the special shell.
Posted Mar 25, 2009 18:09 UTC (Wed)
by mrshiny (guest, #4266)
[Link] (23 responses)
Mr. Wheeler makes a mistake in the article as well. Windows has no problem with files starting with a dot. It's only Explorer and a handful of other tools that have problems. Otherwise Cygwin would be pretty annoying to use.
Overall, however, I like the idea of restricting certain things, especially the character encoding. The sooner the other encodings can die, the sooner I can be happy.
Posted Mar 25, 2009 19:52 UTC (Wed)
by emk (subscriber, #1128)
[Link] (3 responses)
If a filename is properly handled for spaces, doesn't it automatically work for all the other chars? Unfortunately, no. One example mentioned in the article is files with names like "-rf", which will appear at the start of any glob list. To deal with this, you generally need to add "--" before any globs, but different commands behave differently, and not all commands support "--".
Posted Mar 26, 2009 1:12 UTC (Thu)
by mrshiny (guest, #4266)
[Link] (1 responses)
Not that preventing files like '-rf' isn't a bad idea. I think it would prevent a number of mistakes.
Posted Mar 30, 2009 16:41 UTC (Mon)
by Hawke (guest, #6978)
[Link]
Posted Mar 26, 2009 15:38 UTC (Thu)
by dwheeler (guest, #1216)
[Link]
Actually, there is a general solution for the dash: whenever you glob in the current directory, stick "./" in front of the glob. So always use "cat ./*" instead of "cat *". I do mention that in my article.
Problem is, nobody does that. It's too easy to use "*", it's what all the documents say, and it's what all the users actually do. You have to train GUI programs to do this, too. So instead of constantly trying to get developers to do something "unnatural", let's change the system so the "obvious" way is always correct.
Posted Mar 25, 2009 22:45 UTC (Wed)
by epa (subscriber, #39769)
[Link] (12 responses)
On the other hand, a kernel-level check for bad characters is simple to implement and obviously solves these problems at a stroke.
Posted Mar 26, 2009 1:16 UTC (Thu)
by mrshiny (guest, #4266)
[Link] (7 responses)
1. Prevent files that start with dash (technically not a shell problem)
The first item is more of an interaction between programs and the shell and not specifically a shell problem. If a program doesn't support -- then it can never be used securely.
The second item seems like an obvious step to take with no downside.
The third item is what I meant by fixing the shells: shells should make it braindead-easy to manipulate filenames without them turning into commands or other nonsense. Once a filename is loaded into a variable you shouldn't have to worry about characters in the name turning into shell commands. Once that's in place we can start fixing scripts. Maybe an environment variable can determine how that instance of the shell works: in secure mode or legacy mode.
Posted Mar 26, 2009 14:45 UTC (Thu)
by mjthayer (guest, #39183)
[Link] (6 responses)
Posted Mar 26, 2009 15:08 UTC (Thu)
by mjthayer (guest, #39183)
[Link] (5 responses)
Posted Mar 26, 2009 19:49 UTC (Thu)
by dwheeler (guest, #1216)
[Link] (3 responses)
Yes, I already added the "shell could recognize null as separator". And you're right, adding an environment variable could help (though it could also backfire on older scripts!).
This particular example doesn't do quite what you think; it just passes to ls several values: "myfile;", "rm", "-rf", and "/", and you end up with some error messages and a listing of "/". But with more tweaking, you can definitely get some exploits out of this approach. Which is why removing the space character from IFS is a big help - then VAR would become a single parameter again.
Posted Mar 28, 2009 1:11 UTC (Sat)
by nix (subscriber, #2304)
[Link] (2 responses)
It was removed, but I can't remember why: some sort of compatibility
Posted Mar 31, 2009 7:47 UTC (Tue)
by mjthayer (guest, #39183)
[Link] (1 responses)
Posted Mar 31, 2009 19:28 UTC (Tue)
by nix (subscriber, #2304)
[Link]
Posted Apr 3, 2009 18:49 UTC (Fri)
by anton (subscriber, #25547)
[Link]
Posted Mar 26, 2009 21:11 UTC (Thu)
by explodingferret (guest, #57530)
[Link] (3 responses)
1) Portable scripts (of a kind), init scripts, and build scripts. In all these cases the scripts need to have #!/bin/sh at the top, and contain just about every fix for every problem ever, including [ "x$var" = x ] and ${1:-"${@}"} and various other monstrosities.
In these scripts, the quotes around variables; ./ in front of filenames; IFS= for read; and filename=`foo; printf x`; filename="${filename%x}" crap will *always* have to be there. So no point trying to fix anything for those.
2) The other use is scripts that are used on either one system (personal scripts) or one "class" of system, like "only Debian GNU/Linux".
These scripts can use a particular shell like #!/bin/bash and assume the existence of -print0 and -printf to find and -d '' to read and all the other little conveniences which make a lot of the problem go away.
Well, other than newlines at the end of filenames. That's the only case that I refuse to take account of in my scripts, unless security issues might arise.
----
I'm not saying that I disagree with the ideas in this article (although I'd like to keep spaces and shell special characters in my filenames, actually). I'm just saying that as far as shell scripting is concerned, it may not actually help all that much. The main gain for me would be the security fixes and less typing in my interactive shell. Even though I'm pretty sure I don't have any newlines or control characters in any of my filenames, I just can't bring myself to write bad scripts, and that's kinda sad.
Posted Mar 28, 2009 1:18 UTC (Sat)
by nix (subscriber, #2304)
[Link] (2 responses)
I dictated zsh 4, simply because for this application C was far too
Posted Nov 15, 2009 1:06 UTC (Sun)
by yuhong (guest, #57183)
[Link] (1 responses)
Posted Nov 15, 2009 13:15 UTC (Sun)
by nix (subscriber, #2304)
[Link]
Posted Mar 25, 2009 23:09 UTC (Wed)
by pr1268 (guest, #24648)
[Link] (4 responses)
Windows has no problem with files starting with a dot. Oddly enough, Windows will not allow the name of a directory to end in a dot. I discovered this when, back in my Windows days, I had to name an artist directory R.E.M without a final dot. Windows wouldn't allow me to put that trailing dot in the file name. Go figure. Linux doesn't have any issue with it (and since I've abandoned Windows on my home computers, I was able to rename the directory to include that dot). Going off on a tangent: here are some files in my music directory which would make Mr. Wheeler cringe: In a digital forensics class the professor had us searching through a filesystem that contained directories named "..." (minus quotes). Good times...
Posted Mar 25, 2009 23:42 UTC (Wed)
by dwheeler (guest, #1216)
[Link] (1 responses)
No cringe. I didn't see any control characters there, nor leading dashes. And you don't seem to require non-UTF-8. If we could get those done, the rest are gravy.
Posted Mar 26, 2009 0:07 UTC (Thu)
by pr1268 (guest, #24648)
[Link]
Wow, thanks for the reply! And thank you for the original article--I found myself nodding in agreement many times while reading it. Of course, even with your non-cringing approval, I certainly had lots of shell escaping to do with these files (and many others--my collection is approaching 10,000 audio files from almost 900 music CDs).
Posted Mar 26, 2009 1:04 UTC (Thu)
by nix (subscriber, #2304)
[Link] (1 responses)
Posted Mar 26, 2009 10:21 UTC (Thu)
by mjj29 (guest, #49653)
[Link]
Posted Mar 30, 2009 19:36 UTC (Mon)
by rickmoen (subscriber, #6943)
[Link]
You can pry my spaces from my filenames out of my cold dead fingers.
ObMenInBlack: "Your offer is acceptable."
(I remember having to write AppleScript to recurse through directories cleaning up files created
on network shares by MacOS-using munchkins who put space characters at the ends
of filenames, in order for them to become valid filenames when seen by MS-Windows-using
employees looking at the same network shares. The converse problem was files, from MS-Windows
users, with names containing colon, which is a reserved character in MacOS file namespace.
What a pain in the tochis.)
Rick Moen
Posted Mar 25, 2009 18:36 UTC (Wed)
by njs (subscriber, #40338)
[Link] (18 responses)
The section on Unicode-in-the-filesystem seemed quite incomplete. We know this can work, since the most widely used Unix *already* does it. OS X basically extends POSIX to say "all those char * pathnames you give me, those are UTF-8". However, there are a lot of complexities not mentioned here -- you need to worry about Unicode normalization (whether or not to allow different files to have names containing the same characters but with different bytestring representations), if there is any normalization then you need a new API to say "hey filesystem, what did you actually call that file I just opened?" (OS X has this, but it's very well hidden), and so on.
But these problems all exist now, they're just overshadowed by the terrible awful even worse problems caused by filenames all being in random unguessable charsets. I really dislike many things about Apple, but in this case we could do worse than to sit down and steal (with appropriate modification) most of the stuff in http://developer.apple.com/technotes/tn/tn1150.html#Unico...
Maybe the ext4 folks could add unicode filenames as a mount option -- they haven't done anything controversial lately ;-).
Posted Mar 25, 2009 20:57 UTC (Wed)
by clugstj (subscriber, #4020)
[Link] (16 responses)
Posted Mar 25, 2009 21:29 UTC (Wed)
by foom (subscriber, #14868)
[Link] (14 responses)
Posted Mar 25, 2009 22:36 UTC (Wed)
by nix (subscriber, #2304)
[Link] (11 responses)
Posted Mar 26, 2009 2:25 UTC (Thu)
by njs (subscriber, #40338)
[Link] (7 responses)
I'd be happy if we could just make a rule that filenames are valid UTF-8. Unicode normalization (composing characters and all that) is probably a good idea, but reasonable people could disagree. I'm just as happy without case normalization (though the arguments for it aren't entirely without merit, even if it can't be done perfectly). And *any* of these would be better than what we have now...
(The "so what did you call this file?" API is also useful if your system ever deals with case-insensitive or unicode-normalizing filesystems. Which Linux does, whether it becomes common for the root filesystem or not.)
Posted Mar 26, 2009 13:42 UTC (Thu)
by clugstj (subscriber, #4020)
[Link] (6 responses)
Why? How is the current condition so bad that we should run headlong into any of these "solutions" without knowing what the eventual outcome will be?
Posted Mar 26, 2009 17:46 UTC (Thu)
by quotemstr (subscriber, #45331)
[Link]
Posted Mar 26, 2009 18:59 UTC (Thu)
by njs (subscriber, #40338)
[Link] (4 responses)
Posted Mar 29, 2009 21:27 UTC (Sun)
by clugstj (subscriber, #4020)
[Link] (1 responses)
Posted Mar 30, 2009 0:07 UTC (Mon)
by njs (subscriber, #40338)
[Link]
Posted Mar 29, 2009 21:30 UTC (Sun)
by clugstj (subscriber, #4020)
[Link] (1 responses)
Posted Mar 29, 2009 22:07 UTC (Sun)
by foom (subscriber, #14868)
[Link]
Posted Mar 26, 2009 14:23 UTC (Thu)
by mjthayer (guest, #39183)
[Link] (2 responses)
Posted Mar 28, 2009 1:00 UTC (Sat)
by nix (subscriber, #2304)
[Link] (1 responses)
Posted Apr 2, 2009 15:54 UTC (Thu)
by forthy (guest, #1525)
[Link]
It is actually not that bad. As collating sequence, ß=ss (i.e. Mass
and Maß sort to the same bin). Except for Austrian telephone books, where
ß follows ss, but comes before st (though St. follows Sankt ;-). However, there's a huge mess in the CJK part of UCS: short and long
forms of the same character (sometimes even a special variant for the
Japanese character). This should never have happend, the different forms
of the same character should be encoded in fonts, not in UCS. So far, not
even Mac OS X normalizes these characters, but it is obvious that a
mainland China file called "中国" and a Taiwan file called "中國" not only
mean the same, but they also refer to the same word, and can be
interchanged at will (see for example the Chinese wikipedia entry: the
lemma is the short form, the headline is the long form). And it is not
easy to access long and short forms with usual input methods (mainland
China: Pinyin, Canton: Cantonese Pinyin (gives traditional characters,
bug you need to know Cantonese), etc.).
Posted Mar 26, 2009 13:40 UTC (Thu)
by clugstj (subscriber, #4020)
[Link] (1 responses)
Posted Mar 26, 2009 19:52 UTC (Thu)
by leoc (guest, #39773)
[Link]
Posted Mar 25, 2009 21:39 UTC (Wed)
by njs (subscriber, #40338)
[Link]
I don't *like* either alternative much, but I doubt you're going to get everyone to switch back to ASCII, either. The problem isn't going away.
So... we can whine about how unfair it is that character systems are complicated and ignore the problem, or we can hold our noses and pick a least-bad option. The latter is probably more productive (though inertia suggests the former is most likely).
Posted Mar 26, 2009 16:23 UTC (Thu)
by dlau (guest, #4540)
[Link]
Posted Mar 25, 2009 19:50 UTC (Wed)
by szh (guest, #23558)
[Link] (1 responses)
Posted Mar 26, 2009 0:52 UTC (Thu)
by tbrownaw (guest, #45457)
[Link]
Our incoming ftp server at work once got a file named "C:\something_or_other.zip". Which was perfectly fine, until someone tried to open it in Windows using the samba share. It actually did show up, but with a completely garbled name.
I also accidentally generated a file where the name had a leading '\r' (carriage return). That was a lot of fun to track down and fix, it looked perfectly normal in 'ls' unless you noticed that it wasn't in proper alphabetical order and the rest of the row was one character out of alignment.
Posted Mar 25, 2009 20:50 UTC (Wed)
by clugstj (subscriber, #4020)
[Link] (12 responses)
Most of the things he wants to force on everyone are already available by convention. What is the benefit of disallowing other usages? If you want to imagine that all your filenames are UTF-8, go ahead, who's stopping you! The UNIX kernel contains as little policy as possible. This results in it being more simple than it would be otherwise. Yes, this is a double-edged sword, but doing the things he suggests are not an automatic win.
Posted Mar 25, 2009 23:34 UTC (Wed)
by dwheeler (guest, #1216)
[Link]
Sure, almost all files already follow these conventions. Except when they don't. And when they don't, millions of programs subtly stop working.
Everyone who does "find . blah | stuff" is writing bad code, because filenames can contain newlines deep in the directory tree. If we get rid of the nonsense, then it's easy to write correct programs; today, it takes herculean effort, and few people do so.
It's a double-edged sword, but users get cut by both sides.
I have yet to find a real use case for including control characters in filenames, for example, but plenty of reasons why it shouldn't ever happen.
Posted Mar 26, 2009 8:18 UTC (Thu)
by khim (subscriber, #9252)
[Link] (3 responses)
What's the benefit of all these ACL's? Traditional Unix permissions,
capabilities, POSIX Acls, memory protection, etc. You can just use
conventions for that. And if someone violates convention he or she can be
fired. This is you proposal in nutshell - and it just does not work.
Posted Mar 26, 2009 13:38 UTC (Thu)
by clugstj (subscriber, #4020)
[Link] (2 responses)
Shell scripts are where this is the biggest problem. I do shell scripting for a living and don't see this issue as being anywhere near as big a problem as Mr. Wheeler thinks it is.
Also, I'm completely confused by your title. I suggest conventions and then you suggest, perhaps facetiously, FAT (which is not a convention, but enforcement of a very stupidly limited set of possible filenames).
Posted Mar 26, 2009 14:07 UTC (Thu)
by khim (subscriber, #9252)
[Link] (1 responses)
Number of correct
scripts is not important metric. Number of bad scripts is. And it's MUCH
higher then warranted: I've fixed tons of scripts which failed on names
with spaces, files with dash in first position, etc. If such files are
excluded from the start life will be much easier. I propose FAT as a way to get rid of these pesky ACLs. It's one of the
few filesystems today with any form of access control (except read-only
flag). We can extend it to allow all forms of filenames - it's not hard. Or
we can just run all programs with UID==0 - it'll give us the same
flexibility. Somehow noone wants to go in this direction, though.
Posted Mar 29, 2009 21:44 UTC (Sun)
by clugstj (subscriber, #4020)
[Link]
Wow, childish personal attacks. How droll.
"Number of correct scripts is not important metric. Number of bad scripts is"
I would think that the percentage of each would (possibly) be a useful metric. But, what is the damage from these "bad scripts"? If you are writing shell scripts that MUST be absoutely bullet-proof from bad input, perhaps because they run setuid-root, then you are already making a much worse mistake than the possible bugs in the script.
Still don't understand the FAT reference. Sorry, maybe I'm just slow.
Posted Mar 26, 2009 9:51 UTC (Thu)
by epa (subscriber, #39769)
[Link] (6 responses)
Such bugs are made more insidious by the fact that 'by convention', they cannot ever be triggered. But for someone trying to make a working exploit, or widen a small security hole into a larger one, convention is no barrier.
If you want to have certainty that your code works correctly, 100% of the time, no ifs and no buts - rather than just waving your hands and hoping that everyone else in the world makes filenames that follow the same convention as you - then you need a guarantee that the assumptions you make are guaranteed to be true.
Posted Mar 27, 2009 19:23 UTC (Fri)
by drag (guest, #31333)
[Link] (5 responses)
YA.
All I want is for the system to cancel out malicious filename characters and things that obviously make little sense. STuff like control characters, newlines, etc etc.
As for encoding the encoding stuff... meh. Filenames being treated as a string of bytes mostly makes sense, except in a few special cases.
Posted Mar 28, 2009 11:45 UTC (Sat)
by epa (subscriber, #39769)
[Link] (4 responses)
If Unix really did treat filenames as merely 'a string of bytes', with no implied character set or encoding, and displayed them to the user as a hex dump or something, then it would be truly encoding-agnostic and would have no difficulties with arbitrary byte values in filenames. Of course, it would also have been a total failure that nobody uses. For a filesystem to be useful, it needs to have some amount of meaning (or 'policy' if you will) attached to the filenames it stores. The question is how much: is the current situation of 'ASCII for characters below 128, and above that you're on your own' the best one?
Posted Mar 28, 2009 16:53 UTC (Sat)
by tialaramex (subscriber, #21167)
[Link] (3 responses)
I'd be surprised if the /majority/ of programs other than shell scripts aren't like this. Even in the majority of GUI software, what's needed isn't a revision of the kernel API (in fact that will barely help) but only a function which takes a zero-terminated byte array representing a filename and returns a string suitable for display. Such a function is nearly inevitable anyway - to deal with dozens of other issues unrelated to Wheeler's thesis. And such functions exist today (I can't say if they're bug free of course)
Posted Mar 29, 2009 14:43 UTC (Sun)
by epa (subscriber, #39769)
[Link] (2 responses)
Imagine removing the character encoding headers from the http protocol. There would then be no reliable way to take the content of a page and display it to the user - just a panoply of hacks and rules of thumb that differed from one browser to another. This is the situation we have now with filenames, which are *names* and intended for human consumption just as much as the content of a typical web page. The two choices are (a) add headers to the protocol saying what encoding is in use (or in the case of filenames, an extra parameter in all FS calls), or (b) mandate a single encoding everywhere.
Posted Mar 29, 2009 21:58 UTC (Sun)
by clugstj (subscriber, #4020)
[Link] (1 responses)
Posted Mar 29, 2009 22:37 UTC (Sun)
by epa (subscriber, #39769)
[Link]
Posted Mar 25, 2009 21:10 UTC (Wed)
by renox (guest, #23785)
[Link] (3 responses)
Posted Mar 25, 2009 21:53 UTC (Wed)
by alecs1 (guest, #46699)
[Link] (1 responses)
Keep them coming :)
Posted Mar 26, 2009 18:30 UTC (Thu)
by quotemstr (subscriber, #45331)
[Link]
Posted Mar 25, 2009 23:37 UTC (Wed)
by dwheeler (guest, #1216)
[Link]
Perhaps, but even systems which have objects get burned. As noted earlier, the Python developers have had a hard time.... they've switched to Unicode as their main text representation, but Unix filenames aren't text... they are essentially binary blobs! If filenames were always UTF-8, there'd be no problems. Similarly, perl programs get trashed if filenames begin with <.
Posted Mar 25, 2009 21:11 UTC (Wed)
by proski (subscriber, #104)
[Link]
Posted Mar 26, 2009 2:17 UTC (Thu)
by k8to (guest, #15413)
[Link] (22 responses)
This proposal is a whole lot of complexity in kernel code and the API.
Posted Mar 26, 2009 2:22 UTC (Thu)
by k8to (guest, #15413)
[Link] (11 responses)
Sure, some find implementations don't have it. Fix them.
Posted Mar 26, 2009 2:29 UTC (Thu)
by k8to (guest, #15413)
[Link] (9 responses)
As evidence for my position, here are some real-world filenames that my software needed to create to correctly archive some digital music history of the personal computer as an instrument.
These files are handled by a combination of python and shellscripts, and one piece of C code (wrapping a library which knew how to read certain binary formats.) All of these pieces can handle newlines, tabs, spaces, control characters, leading dahes, and so on. I'm not really that smart. It wasn't much work.
If shellscripts are 5 second hackjobs, then they will always fail in some cases: strange filenames, permissions problems, etc. If you take a few minutes to apply correct safeguards, then thigns work fine.
Posted Mar 26, 2009 2:31 UTC (Thu)
by k8to (guest, #15413)
[Link]
Posted Mar 26, 2009 2:39 UTC (Thu)
by foom (subscriber, #14868)
[Link] (3 responses)
I find it very hard to believe that your software *needed* to create unintelligible filenames. And if it
Posted Mar 26, 2009 2:48 UTC (Thu)
by k8to (guest, #15413)
[Link] (2 responses)
Posted Mar 26, 2009 3:40 UTC (Thu)
by foom (subscriber, #14868)
[Link]
If, in some alternative universe, it was already impossible to create those filenames, I have little
doubt you could still have created working software which didn't require the impossible.
Sorry I come off as unreasonable to you. *hugs*
Posted Mar 26, 2009 8:41 UTC (Thu)
by khim (subscriber, #9252)
[Link]
I don't have a clue. And I don't need it to know anything about your
project to know you are lying. Any project can be implemented with
exactly two filenames: "0" and "1". You'll need infinite depth of
directory structure to do so, true, but thankfully there are no practical
limitations in Linux. Is it feasible? Probably no. Is it possible? Of
course. And if we'll start with the position that your software does
not need these filenames but you current design needs these
suddenly you have much weaker argument: you are reducing complexity of your
software by increasing complexity of everyone's else's software. Is it good
trade-off? May be yes, may be no. But it's weak argument at best - no
matter what your project is and what it needs to be done.
Posted Mar 26, 2009 10:01 UTC (Thu)
by epa (subscriber, #39769)
[Link] (1 responses)
I know this is a matter of taste, and merely trying to impose one person's tastes on everyone is not a reason to change the kernel. But on the other hand, the marginal extra disk space saving (ten bytes?) from being able to put arbitrary binary stuff in filenames without encoding does not outweigh the many good reasons that Wheeler gave for changing.
Posted Mar 26, 2009 13:54 UTC (Thu)
by clugstj (subscriber, #4020)
[Link]
Put quotes around it?
Posted Mar 26, 2009 10:27 UTC (Thu)
by mjj29 (guest, #49653)
[Link]
Posted Mar 26, 2009 21:22 UTC (Thu)
by explodingferret (guest, #57530)
[Link]
Are you able to make the source of your shell scripts available? I'm sure I can find something in them that is breakable. :-)
Posted Mar 26, 2009 8:30 UTC (Thu)
by khim (subscriber, #9252)
[Link]
This argument: And you can not even claim that "we already solved thsi problem so it's
old code vs new code". A lot of programs just don't work with
currect approach (especially script). You need to write and fix literally
millions lines of code vs few thoiusands in kernel. Sorry, but you are advocating more complex solution while preaching
simplicity.
Posted Mar 26, 2009 2:49 UTC (Thu)
by njs (subscriber, #40338)
[Link] (1 responses)
(My opinion on this issue is strongly influenced by writing the filesystem interaction code for a VCS. Users in different locales may want to work on the same project, but they write the same filenames differently, and some charsets may not be able to even represent filenames created in other locales, and...)
There are arguments for the current system, but "simplicity" is really not one of them.
Posted Mar 27, 2009 0:46 UTC (Fri)
by nix (subscriber, #2304)
[Link]
(As in, I looked at XEmacs/MULE's code and my brain dribbled out of my
Posted Mar 26, 2009 5:34 UTC (Thu)
by flewellyn (subscriber, #5047)
[Link] (3 responses)
Posted Mar 26, 2009 13:49 UTC (Thu)
by clugstj (subscriber, #4020)
[Link] (2 responses)
Posted Mar 26, 2009 13:54 UTC (Thu)
by epa (subscriber, #39769)
[Link]
Posted Mar 26, 2009 16:02 UTC (Thu)
by flewellyn (subscriber, #5047)
[Link]
But a major point of Wheeler's argument is that existing programs, filesystems, and indeed operating systems already assume that these restrictions are the case, as a matter of convention, but do not necessarily do anything to ensure that they are enforced. Existing software already rejects or fails to properly handle filenames which would violate these conventions, and the vast majority of existing files are named according to these conventions; at the very least, filenames with leading dashes, tab or newline characters, or shell control characters are very rare, and probably accidental or malicious. So the entire point is that the changes required to existing software would be minimal, and existing software which could break on filenames that don't obey these restrictions when they're not enforced by the OS, would no longer have a problem.
Posted Mar 26, 2009 8:24 UTC (Thu)
by khim (subscriber, #9252)
[Link]
Oh so very true. It also removes whole bunch of code from other places. Even more
important: it removes the need to write and fix bunch of code.
Posted Mar 26, 2009 9:56 UTC (Thu)
by epa (subscriber, #39769)
[Link] (2 responses)
If you've ever tried such an exercise, you would not believe that allowing control characters in the middle of filenames and leaving userspace to deal with the resulting mess could ever be called 'simplicity'.
Wheeler's suggestion would greatly simplify a lot of code, or if you prefer, fix many hidden bugs and security holes in code that is currently buggy.
Simply checking filenames for bad characters takes about five lines of code in the kernel plus one line for each syscall that accepts a filename from userspace. It is hardly adding significant complexity.
Posted Mar 28, 2009 17:03 UTC (Sat)
by tialaramex (subscriber, #21167)
[Link] (1 responses)
Show me the money. Five lines, plus one per syscall. Not a lot of work to support such a broad and sweeping claim. Write those lines carefully, we wouldn't want you to be hand-waving and have missed 99.9% of the complexity of the problem...
Posted Mar 29, 2009 15:03 UTC (Sun)
by epa (subscriber, #39769)
[Link]
for (const char *c = filename; *c; c++)
Adding a fixed list of 'bad characters' (please excuse lack of indentation, the LWN comment form eats it):
for (const char *c = filename; *c; c++)
To check valid UTF-8 is a little more complex, but not much. You do not need to check that assigned Unicode characters are being used, or worry about combining characters, upper and lower case, etc. See <http://www.cl.cam.ac.uk/~mgk25/unicode.html> for a list of valid byte sequences. The code would be something like
/* First pad the filename with 4 extra NUL bytes at the end. Then, */
For a self-contained system, that takes care of it. Put some code like the above into a function and call it at each place a filename is taken from user space. Coping with 'foreign' filesystems (e.g. NFS servers) returning non-UTF-8 filenames is a bit more complex.
Posted Mar 26, 2009 2:44 UTC (Thu)
by explodingferret (guest, #57530)
[Link] (1 responses)
Here are some problems I noticed in your article:
1) "These restrictions only apply to Windows - Linux, for example, allows
2) You state that changing IFS and banning newlines and tabs in filenames would make things like 'cat $file' safer, but you should also state that shell glob characters would also need to be removed (namely *?[]).
3) You state (or at least imply) that there is no way to reliably use filenames from find, but there is a POSIX compliant and known portable method:
find . -type f -exec somecommand {} \;
For xargs fans, on all but files with newlines, you can do
4) Your setting of IFS to a value of tab and newline is overly complicated. Simply use IFS=`printf \\n\\t`. It is only trailing newlines that are removed. If the different behaviour this causes with "$*" is not desired, one can set IFS=`printf \\t\\n\\t`. I know of no tool or POSIX restriction that says characters may not be repeated in IFS.
Otherwise great article! It really would be so nice to use line-separated commands in `` and not have to worry about things breaking. And although most of the thoughts expressed here are well known to me, the idea of getting the kernel to check the validity of UTF-8 filenames is fantastic!
Posted Mar 28, 2009 19:50 UTC (Sat)
by dwheeler (guest, #1216)
[Link]
Posted Mar 26, 2009 4:52 UTC (Thu)
by eru (subscriber, #2753)
[Link]
In general, the only shell metacharacters that could be banned without causing interoperability problems are those that are special also in That Other OS.
Posted Mar 26, 2009 9:56 UTC (Thu)
by ldo (guest, #40946)
[Link] (5 responses)
Posted Mar 27, 2009 0:56 UTC (Fri)
by nix (subscriber, #2304)
[Link] (4 responses)
And if you remove the prohibition on slashes, how do you distinguish
These limitations are there because the semantics of the filesystem itself
Posted Mar 29, 2009 10:30 UTC (Sun)
by ldo (guest, #40946)
[Link] (3 responses)
nix wrote: Um, if you remove the prohibition on nulls, how do you end the filename? This isn't Pascal. Nothing to do with Pascal. C is perfectly capable of dealing with arbitrary data bytes, otherwise large parts of both kernel and userland code wouldnt work. And if you remove the prohibition on slashes, how do you distinguish between a file called foo/bar and a file called bar in a subdirectory foo? Simple. The kernel-level filesystem calls will not take a full pathname. Instead, they will take a parent directory ID and the name of an item within that directory. Other OSes, like VMS and old MacOS, were doing this sort of thing decades ago. Full pathname parsing becomes a function of the userland runtime. The kernel no longer cares what the pathname separator, or even what the pathname syntax, might be.
Posted Mar 29, 2009 13:54 UTC (Sun)
by nix (subscriber, #2304)
[Link] (2 responses)
I'm sure users would love not being able to type in pathnames anymore,
Good luck getting anyone to do it.
Posted Mar 29, 2009 19:47 UTC (Sun)
by ldo (guest, #40946)
[Link] (1 responses)
nix wrote: What you're describing is not POSIX anymore. Nothing to do with POSIX. POSIX is a userland API, it doesnt dictate how the kernel should work.
Posted Mar 29, 2009 22:32 UTC (Sun)
by nix (subscriber, #2304)
[Link]
So whatever you're describing, userspace cannot any longer use standard
If you want VMS, you know where to find it.
Posted Mar 26, 2009 10:45 UTC (Thu)
by kerolasa (guest, #56089)
[Link]
http://en.wikipedia.org/wiki/Internationalized_domain_name
That would mean that there is encoded and unencoded versions of filenames, or two representations, match with same inode. The version you want to see depends on environment variable or perhaps command line option. For me this sounds like libc hack and the guys making changes to that are conservative (thank god they are, who'll want unstable libc anyway). Even the safe mode sounds like good idea I don't expect to see such for next couple of years. Well we'll see.
Posted Mar 26, 2009 12:23 UTC (Thu)
by jpetso (subscriber, #36230)
[Link] (1 responses)
Why don't we instead fix the mechanism that transports strings in the bash?
I'm all for reasonable defaults and constraining unnecessary stuff, but
Posted Mar 26, 2009 12:26 UTC (Thu)
by jpetso (subscriber, #36230)
[Link]
Posted Mar 26, 2009 13:17 UTC (Thu)
by barryn (subscriber, #5996)
[Link] (10 responses)
I was originally going to argue that leading spaces are necessary since people still have data
# find / -name ' *'
Posted Mar 26, 2009 19:21 UTC (Thu)
by njs (subscriber, #40338)
[Link] (7 responses)
Are you sure? I've seen this in real-world uses too, but I thought that all the common systems were fixed to do human-style (non-ASCIIbetical) sorting a few years ago. I don't have any proprietary systems around to test, but I'll be *really* amused if the OS X Finder is missing this usability feature of GNU ls:
Posted Mar 26, 2009 21:23 UTC (Thu)
by foom (subscriber, #14868)
[Link] (3 responses)
Finder does however sort numbers like this, which GNU ls does not: 1 2 10
I don't really see what the point of the "human-style" sorting is if it can't even sort numbers. That
seems kind of basic to me.
Posted Mar 28, 2009 1:21 UTC (Sat)
by nix (subscriber, #2304)
[Link] (2 responses)
(By default, despite comments elsewhere in this thread, ls sorts
Posted Mar 28, 2009 1:57 UTC (Sat)
by foom (subscriber, #14868)
[Link] (1 responses)
Huh, never knew that, interesting! Never would have found that from the man page, which says
"-v sort by version". That seems a remarkably poor description of what it actually does.
> (By default, despite comments elsewhere in this thread, ls sorts
ASCIIbetically, so " 2" comes before "1".)
Well, not exactly: GNU ls has a default sort which depends on the locale's collation settings, and
most systems default to a locale like en_US.UTF-8, so most people have it sorting in a
case/accent-insensitive manner by default on their systems.
Posted Mar 28, 2009 20:36 UTC (Sat)
by nix (subscriber, #2304)
[Link]
(And you're right on the collation sort thing: I spoke carelessly.)
Posted Mar 27, 2009 4:41 UTC (Fri)
by barryn (subscriber, #5996)
[Link] (2 responses)
Posted Mar 27, 2009 15:45 UTC (Fri)
by foom (subscriber, #14868)
[Link]
Well, OSX's "ls" is actually just doing a traditional strcmp sort, not anything fancy (note that it puts
all uppercase characters before all lowercase).
But the Finder's sort routine is fancy. They seem to be a sort order based on Unicode TR10.
Posted Nov 15, 2009 0:32 UTC (Sun)
by yuhong (guest, #57183)
[Link]
Posted Mar 27, 2009 5:36 UTC (Fri)
by quotemstr (subscriber, #45331)
[Link] (1 responses)
The way I see it, if a program can correctly work with filenames containing spaces, it can work with a filename containing leading spaces.
It's most important to eliminate newlines and control characters in filenames. The second most important consideration is specifying UTF-8 as the preferred filename encoding. Let's not get caught up in all sorts of other wishes that will just encourage endless debate and prevent these very important changes from getting made at all.
Posted Mar 28, 2009 9:21 UTC (Sat)
by explodingferret (guest, #57530)
[Link]
perl also has problems with leading spaces in filenames, unless you use the right kind of open command (perldoc -f open).
Posted Mar 26, 2009 14:29 UTC (Thu)
by kenjennings (guest, #57559)
[Link] (2 responses)
Posted Mar 28, 2009 1:01 UTC (Sat)
by nix (subscriber, #2304)
[Link]
Posted Mar 28, 2009 1:06 UTC (Sat)
by pr1268 (guest, #24648)
[Link]
People should not be using filenames as data storage. How about metadata storage? In my PC troubleshooting days, I came across a Windows XP box with a folder of videos of adult content whose file names were lengthy and explicit descriptions of the activities portrayed in the videos. Just reading the directory listing alone conjured up many vivid and disturbing thoughts. Fortunately this wasn't Windows Vista—its Explorer even creates video thumbnails. :-o
Posted Mar 26, 2009 18:37 UTC (Thu)
by quotemstr (subscriber, #45331)
[Link] (1 responses)
(For the record, I support all the proposed restrictions on filenames except for a ban on shell meta-characters.)
Posted Mar 28, 2009 22:21 UTC (Sat)
by man_ls (guest, #15091)
[Link]
Bojan has only posted once, and his message contains the words "not sure". I would say that this debate attracts a different subset of (opinionated) people.
Posted Mar 26, 2009 19:34 UTC (Thu)
by az (guest, #46701)
[Link] (1 responses)
Filenames need to be usable for describing the contents of files. You sure as hell don't need newlines and tabs for that, but you certainly should be able to use the same punctuation you can use in a sentence - in English, that's !@#$%&()-:;"',.? but in other languages more characters may be needed (but they would be no different from any other non-ascii utf-8 character). I think the requirement that a filename can't start or end with certain characters is acceptable - you don't expect a sentence to start with most of them - but inside the string being forbidden from using them would be very constraining.
Posted Mar 26, 2009 19:36 UTC (Thu)
by az (guest, #46701)
[Link]
Posted Mar 26, 2009 23:04 UTC (Thu)
by zooko (guest, #2589)
[Link] (5 responses)
"When reading in a filename, what we really want is to get the unicode of that filename without risk of corruption due to false decoding. ... Unfortunately this isn't possible except on Windows and Macintosh."
http://allmydata.org/pipermail/tahoe-dev/2009-March/00137...
Hopefully Linux folks will follow D. Wheeler's lead on this and make it so that some future version of Tahoe can reliably get filenames from Linux just as it currently can from Windows and Mac OS X.
Posted Mar 26, 2009 23:35 UTC (Thu)
by zooko (guest, #2589)
[Link] (4 responses)
Posted Mar 28, 2009 16:00 UTC (Sat)
by tialaramex (subscriber, #21167)
[Link] (3 responses)
There, I said it.
It really doesn't exist. You, up in Win32 land, are forbidden from creating certain filenames, but everybody else running on the same NT kernel and sharing a filesystem with you is allowed to continue doing as they please, and so the APIs you're using /explicitly/ don't promise what you're relying on.
The filenames you get from NT will be sequences of 16-bit code units. They might be Unicode. The filenames you get from Linux will be sequences of 8-bit code units. They might be Unicode (in this case UTF-8) too.
In both cases you will /usually/ not see sequences that are crazy and undisplayable or not legal in some (non-filename) context, but you might, and when you do the OS vendor will say "I never promised otherwise". So you need defensive coding.
Posted Mar 28, 2009 16:41 UTC (Sat)
by zooko (guest, #2589)
[Link] (2 responses)
"The filenames you get from NT will be sequences of 16-bit code units. They might be Unicode. The filenames you get from Linux will be sequences of 8-bit code units. They might be Unicode (in this case UTF-8) too."
I don't think this is true. The bytes in the filenames in NT are defined to be UTF-16 encodings of characters. The bytes in the filenames in Mac are defined to be UTF-8 encodings. The bytes in the filenames in Linux are not defined to be any particular encoding. It isn't just a definitional issue -- the result is that reading a filename from the Windows or Mac filesystem can't cause you to lose information -- the filename you get in your application is guaranteed to be the same as the filename that is stored in the filesystem. On the other hand, when you read a filename from the filesystem in Linux, then you need to decide how to attempt to decode it, and there is no way to guarantee that your attempt won't corrupt the filename.
Please correct me if I'm wrong, because I'm intending to make the Tahoe p2p disk-sharing app depend on this guarantee from Windows (and from Mac), and to make it painfully work around the lack of this guarantee in Linux.
Posted Mar 28, 2009 17:45 UTC (Sat)
by tialaramex (subscriber, #21167)
[Link]
* I am saying that you can't guarantee that the filenames Windows gives you are all legal UTF-16 Unicode strings. Windows makes no such promise. Non-Win32 programs (including Win32 programs which also use native low-level APIs) may create files which don't obey the convention, and filenames on disk or from a network filesystem are not checked to see if they are valid UTF-16.
* I am NOT saying that there are people running Windows whose filenames are all in SJIS or ISO-8859-8 or even Windows codepage 1252. That would be silly because those encodings (and indeed practically all legacy encodings) are 8-bit and all filenames in Windows are 16-bit. When a Windows filename "means something" at all, the meaning will be encoded as UTF-16, or perhaps if you're really unlucky, UCS-2.
So if your problem is "People keep running my program with crazy locale settings and legacy encodings of filenames" well you have my sympathy, and yes you will need to handle this for Linux (even if only by writing a FAQ entry telling them to switch to UTF-8) and might get away without on Windows.
But if the problem is "My program blindly assumes filenames are legal Unicode strings" then you're in a bad way, stop doing that because it's a bug at least on Linux and Windows, and IMO most likely on Mac OS X too (though their documentation claims otherwise).
Posted Mar 28, 2009 18:54 UTC (Sat)
by foom (subscriber, #14868)
[Link]
That's not actually true. The windows APIs take arrays of 16-bit "things". Those are supposed to be
a) The set of invalid sequences in UTF-16 is a lot smaller than in UTF-8.
Posted Mar 27, 2009 7:10 UTC (Fri)
by janpla (guest, #11093)
[Link]
The important thing in any theoretical framework is that it is orthogonal and logically complete; because then all it takes is clarity of mind to figure out the correct way to do things. A basic, and in my opinion very sound, principle in UNIX is that the system provides functionality, not policy. This means that unfortunately you can potentially do incredibly stupid things, but it also means that you are not excluded from doing incredibly clever things either.
UNIX is an operating system for adults who take responsibility and at least try to think before they jump. If you want a Fischer-Price interface where nothing can harm you, there are other options available.
Posted Mar 27, 2009 22:56 UTC (Fri)
by spitzak (guest, #4593)
[Link] (7 responses)
An "invalid" UTF-8 string can contain only some extraneous bytes in the range 0x80-0xff. These high-order bytes do not cause any problems with any programs.
The problem is the stupid Python guys who believe in magic fairy land where all UTF-8 is valid. This is also causing havoc with using Python3 for URLs and HTML. No, I'm sorry, if a file contains UTF-8, it is going to have invalid sequences. They need to get their heads out of their *** and do something so that invalid UTF-8 is preserved ALL THE TIME and never throws an exception, unless you specifically call "throw_exception_if_not_valid_utf8()".
Because a whole lot of stupid people thought that "wide characters" are the solution and put them into certain systems we have to live with it and interoperate. The most popular solution is to translate invalid bytes in UTF-8 into 0xDCxx. This can be used as a stopgap until they finally realize that leaving the data in UTF-8 is the real solution. This substitution does not really fix things because it does not do a round trip clean conversion. Supporting round-trip means your system cannot name invalid UTF-16 file names, and if you think those don't exist you are really living in a fantasy world!
I think therefore the escape character can easily be the UTF-8 encoding of 0xDCxx. This will not conflict with the above because all the escaped characters do not have the high bit set. This will survive a translation to UTF-16 and thus provides a way to put the exact same filenames on Windows UTF-16 filesystems.
His proposed rules for disallowed bytes seem pretty reasonable though I would not disallow any printing characters in the interior of the filename, backslash escaping works pretty good in there.
Posted Mar 28, 2009 3:40 UTC (Sat)
by njs (subscriber, #40338)
[Link] (5 responses)
Posted Mar 30, 2009 16:08 UTC (Mon)
by spitzak (guest, #4593)
[Link] (4 responses)
The first two references are about programs failing to recognize overlong encodings as being invalid. But those invalid sequences start with a byte with the high bit set (following bytes may not have it set, but the fact that decoders consider them part of the first byte is the decoders error, a fixed decoder would consider it a one-byte error with the high bit set, followed by normal ascii characters which are unchanged and thus cannot cause a security hole).
The last one is EXACTLY the bug I am trying to fix: stupid people who somehow believe that throwing errors or replacing with non-unique strings is how invalid UTF-8 should be handled. The bug is that it maps more than one different string to the same one. The proper solution is to stop translating UTF-8 into something else and treat it as a stream of bytes. Nothing should care that it is UTF-8 except stuff that draws it on the screen.
Posted Mar 31, 2009 4:49 UTC (Tue)
by njs (subscriber, #40338)
[Link] (3 responses)
So -- just checking we're on the same page here -- what you're saying is that you're sure that those three security bugs I found in 5 minutes of googling were "not problems in any program".
> The first two references are about programs failing to recognize overlong encodings as being invalid.
Right -- if invalid codings are interpreted differently in different parts of a system, then that creates bugs and security holes.
> But those invalid sequences start with a byte with the high bit set (following bytes may not have it set, but the fact that decoders consider them part of the first byte is the decoders error, a fixed decoder would consider it a one-byte error with the high bit set, followed by normal ascii characters which are unchanged and thus cannot cause a security hole).
I'm sorry -- I cannot make out a word of this. The bug in the first two links is that the invalid sequences are over-long (but like all the bugs mentioned here, involve only bytes with the high bits set -- do you know how UTF-8 works?). The decoder should have an explicit check for such sequences and throw an error if they are encountered, but this check was left out.
> The last one is EXACTLY the bug I am trying to fix: stupid people who somehow believe that throwing errors or replacing with non-unique strings is how invalid UTF-8 should be handled.
Errrr... quite so. I wasn't sure how useful this was to start with, but when you say in so many words that the proper solution to XSS security holes is to stop sanitizing web form inputs and instead convert all web browsers so that they *don't interpret unicode* then... maybe it's time I step out of this thread. Best of luck to you.
Posted Mar 31, 2009 17:59 UTC (Tue)
by spitzak (guest, #4593)
[Link] (2 responses)
An overlong encoding consists of a leading byte with the high bit set. This is an error. That may be followed by any byte. If it is another leading byte then it might start another UTF-8 character, or it might be an error. If it is a continuation byte then it is an error. If it is an ASCII character then it is not an error. As before, EVERY ERROR BYTE has the high bit set!
I might have misunderstood your question. You said "are you sure" in response to me saying that all error bytes have the high bit set. The reason I was confirming that all error bytes have the high bit set is that if they are mapped to a 128-long range of Unicode then the adjacent 128-long range makes a good candidate for "quoting" characters that are not allowed in filenames.
I do believe there are some serious mistakes in a lot of modern software. UTF-8 should NOT be converted until the very last moment when it is converted to "display form" for drawing on the screen. This is the only reliable way of preserving identity of invalid strings. People who think invalid strings will not occur or that it is acceptable for them to compare equal or silently be changed to other invalid strings or with valid strings are living in a fantasy land.
Posted Apr 1, 2009 5:12 UTC (Wed)
by njs (subscriber, #40338)
[Link] (1 responses)
Okay, fair enough. I agree, all ASCII characters are valid UTF-8. I was objecting to your claim that bytes with the high bits set "do not cause any problems with any programs".
> An overlong encoding consists of a leading byte with the high bit set. This is an error.
All characters with codepoint >= 128 are encoded in UTF-8 as a string of bytes with the high bit set (including on the leading byte). Having the high bit set is *certainly* not an error. I can't tell what you're saying in general, but it's just not true that the only time strings need to be interpreted as text is for display. In many, many cases text needs to be processed as text, and it's often impossible and rarely practical to write algorithms in such a way that they do something sensible with invalid encodings. Those serious security bugs I pointed out up above are examples of what happens when you try.
(You're right that invalid strings usually shouldn't be silently transmuted to valid strings; they should usually signal a hard error.)
Posted Apr 1, 2009 16:38 UTC (Wed)
by spitzak (guest, #4593)
[Link]
Do NOT throw exceptions on bad strings. This turns a possible security error into a guaranteed DOS error. Working around it (as I have had to do countless times due to stupid string-drawing routines that refuse to draw a string with an error in it) means I have to write my *own* UTF-8 parser, just to remove the errors, before displaying it or using it. I hope you can see how forcing programmers to use their own code to parse the strings rather than providing reusable routines is a bad idea.
And I don't want exceptions thrown when I compare two strings for equality. That way lies madness. It is unfortunate that too much of this stuff is being designed by people who never use it or they (and you) would not make such trivial design errors.
Posted Apr 15, 2009 10:38 UTC (Wed)
by epa (subscriber, #39769)
[Link]
And indeeed, the Python developers are living in a magic fairy land where filenames are sanely encoded and are always human-readable text, but wouldn't it be better to change things so that this situation is no longer wishful thinking, but part of the ordinary things userspace can rely on? That is what Wheeler is proposing.
Posted Mar 28, 2009 11:04 UTC (Sat)
by magnus (subscriber, #34778)
[Link] (1 responses)
There ought to be a different way of passing file references between programs and from programs to the kernel in a way that conversion from text to file reference is only ever needed on hand written file names.
Posted Mar 31, 2009 5:14 UTC (Tue)
by njs (subscriber, #40338)
[Link]
At last, a hope of progress
At last, a hope of progress
At last, a hope of progress
At last, a hope of progress
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
Maybe if Linux folks keep identifying and fixing legacy Unix usability issues people will start referring to Linux as 'The way Unix should of been' or 'Unix done right'.
I thought that was Plan 9...
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
having named files in a directory that are actually links to the directory
and its parent necessarily gives rise to such a hazard unless you put
special-case magic in globbing to exclude them from wildcard matches, in
which case it doesn't really matter what names are picked for them as long as
they're standard. Specifying a wildcard that doesn't catch them is no harder
with the current names, the problem is remembering to do it.
Wheeler: Fixing Unix/Linux/POSIX Filenames
rm: cannot remove directory `..'
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
of mostly sanity checks and fantastically elaborate code for portably and
securely recursing down directory trees, even those longer than PATH_MAX.
ones that it's best to, ahem, admire from a distance: e.g. the less said
about NEED_REWIND and the need for rm running everywhere to work around a
MacOS X bug, the better).
Wheeler: Fixing Unix/Linux/POSIX Filenames
> with a ".")? "rm -rf .*" will have very undesirable results.
Wheeler: Fixing Unix/Linux/POSIX Filenames
> (that begin with a ".")?
one or two characters after the dot, but I've not found any program using
such yet...
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
Some of my programs use such "bad" filenames systematically on purpose, and achieve strictly greater utility and efficiency than would be possible without them.
Can you give an example?
If you are truly concerned about portability, then work on the problem which arises because Microsoft Windows [FAT and NTFS] allows a filename consisting of a US customary calendar date, i.e. "03/25/09" as an eight-character filename.
It's also possible for an iso9660 CD-ROM to have filenames containing the / character, or at least, I possess such a disc. This shows that in general there is a need for Linux to sanitize filenames coming from alien filesystems.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
I use the filename as a key-value store for a system (not yet released) which implements an object model of sorts in the shell (inspired by shoop but not derived from it). dot-prepended names are used to signify private fields, and dash-prepended ones, *specifically because they are so hard to use* and thus unlikely to be desirable field names, are used by the inside of the object model as field metadata:
I pondered a \n-prepended filename because it's even harder to trip over by mistake, but decided that it would look too odd in directory listings of object directories when debugging. There's no danger of user code interpreting these names as options, because user code accesses files in this directory only via a shell-function API.)
The semantics of Unix filesystems have been fixed de facto for many years...
Hardwired filename encodings are a good idea only if you can guarantee that this encoding has been the standard for the lifetime of the filesystem. You can't assume that for any existing filesystem: thus you have to decide what to do if filenames are not representable in the encoding scheme chosen. This also conflicts with 'no control characters' in that a good bunch of Unicode characters >127 can be considered 'control characters' of a sort, and there's no guarantee that more won't be added. How to exclude control characters which may be added in the future?
You also can't sensibly exclude shell metacharacters, because you don't know what they are, because they're shell-dependent, and some shells (like zsh) have globbing schemes so complex that ruling out all filenames that might be interpretable as a glob is wholly impractical.
But I agree that these rules all make sense for parts of the filesystem that users might manipulate with arbitrary programs, as opposed to those that are using part of the FS as a single-program-managed datastore. What I think we need is an analogue of pathconf() (setpathconf()?) combined with extra fs metadata, such that constraints of this sort can be imposed and relaxed for *specific directories* (inherited by subdirectories on mkdir(), presumably). *That* might stand a chance of not breaking the entire world in the name of fixing it.
Wheeler: Fixing Unix/Linux/POSIX Filenames
David, thanks for responding.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Such a key-value storage will have trouble with "/" in the key, since it's
the directory separator. So if you truly need arbitrary keys, you already
have to do some encoding anyway - so why not encode to something more
convenient? If you don't need arbitrary encoding, then let's find some
reasonable limits that stop the worst of the bleeding. Also, there's no
need to have all those weird filenames merged with other stuff in the same
directory; you could create a single directory with "." as the first
character in the name, and create the key-value store in that
subdirectory.
I claim mental block: this solution became obvious to me a day or so back.
(Rather, since I already use . as a metacharacter to mean 'private',
use .. to mean 'extra-private: metadata'. Yes, this too is bizarre, but at
least it's not dash-prepended.)
As far as "forever" goes, the program "convmv" does mass file renames for
encoding; you can use it to convert a given filesystem from whatever
encoding you've been using to UTF-8 (problem solved).
Yes, but this only works if you can mandate a no-encoding transparent view
of filenames! As soon as you start to automatically encode them, this sort
of transcoding is impossible.
In fact, there's already a mechanism in the Linux kernel that might do
this job already: getfattr(1)/setfattr(1). If it were implemented this
way, I'd suggest that by default directories would "prevent bad filenames"
(e.g., control chars and leading "-"); you could then use "setfattr" on
directories to permit badness. New directories could then inherit the
state of their parent. I would make those "trusted extended attributes" -
you'd have to be CAP_SYS_ADMIN (typically superuser) to be able to create
such directories.
It depends how harsh the limits are. I'd say that 'no control characters'
is certainly reasonable to have only the superuser lift. Perhaps a less
harsh constraint to impose is that regular users cannot set this attribute
on directories readable by 'other', and that chmodding a directory after
the fact strips this attribute off it. Now users cannot dump landmines in
that directory for users outside their group (root is assumed to know what
he's doing).
NT (Windows kernel) doesn't care about filenames any more than Linux
NT (Windows kernel) doesn't care about filenames any more than Linux
NT (the kernel API in Windows NT, 2000, XP and etc.) doesn't care about filename encodings. The only thing that makes NT's attitude to such things different from that of Linux's is that NT's arbitrary sequences of non-zero code units used for filenames use 16-bit code units, and in Linux obviously they're 8-bit.
Does that mean if you code against the NT API directly, you can create files foo and FOO in the same directory? I expect that opens up all sorts of juicy security holes - many of them theoretical, since a typical NT system has just one user and there is not much need for privelege escalation - but still it sounds fun.
using UTF-8 and blindly trusting that everything you work with is actually legal and meaningful display-safe UTF-8 are quite different things.
Indeed. Hence the benefit of enforcing this at the OS level: it gets rid of the need for sanity checks that slow down the good programmers and were never written anyway by the bad programmers.
NT (Windows kernel) doesn't care about filenames any more than Linux
>> Does that mean if you code against the NT API directly, you
can create files foo and FOO in the same directory?
NT (Windows kernel) doesn't care about filenames any more than Linux
> Yes. This is what the POSIX subsystems for NT do
NT (Windows kernel) doesn't care about filenames any more than Linux
If that is done, the only processing done on the filename before CreateFile
calls NtCreateFile with the name is that \\.\ is replace with \??\, which is
an alias of \DosDevices\.
NT (Windows kernel) doesn't care about filenames any more than Linux
Yep, NT had supported both files and disks larger than 2GB from the first
version (NT 3.1) using the NTFS filesystem. Exercise: compare the design of
the GetDiskFreeSpace and SetFilePointer APIs (look them up using MSDN or
Google), both of which has existed since NT 3.1. Which one was so much more
error-prone that the versions of Windows released in 1996 had to cap the
result to 2GB, even though older versions of NT supported returning more than
2GB using it, and why?
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
I never proposed radix-65. Radix-65 (26 upper case, 26 lower case, 10 digits, dot hyphen underscore) is what the POSIX standard ALREADY says is all you can depend on; nothing else is portable by that spec.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wol
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
Yes, validate every filename that comes from user space to check it is valid UTF-8 and does not have control characters. This is not in fact an expensive operation (especially not compared to the cost of opening or creating a file in the first place).
Wheeler: Fixing Unix/Linux/POSIX Filenames
Meanwhile application developers get no benefit for many years because of compatibility considerations.
Not really true. The benefit in closing existing security holes is immediate. In writing new code, you can note that there may be corner-case bugs on systems that permit control characters in filenames, but for 90% of the user base they do not exist. That is 90% better than the current situation, where everyone just writes code assuming that filenames are sane, but no system enforces it. By analogy, consider that many classic UNIX utilities had fixed limits on line length. If I write a shell script that uses sort(1), I just write it for GNU sort and other modern implementations. I might note that people on older systems may encounter interesting effects using my script with large input data, but I don't have to wait for every last Xenix system to be switched off before I can get the benefit in new code.
Personally I think the issue to look at is spaces. Spaces are legal. They are undoubtedly going to remain legal. But they are inconvenient. How can we tweak our basic Unix processes (including the shell and many old tools) so that spaces are harmless ?
This is true in principle but in thirty years of Unix, essentially no progress has been made on this. Nobody bothers to fix the shell or utilities such as make(1) to cope with arbitrary characters, despite much wishing that they would. Nobody bothers to write shell scripts that cope with all legal filenames, mostly because it is all but impossible. Instead, people who care about bug-free code end up rewriting shell scripts in other languages such as C (for example, some of the git utilities), people who think life is too short are happy to distribute software that misbehaves or has security holes, and many others just don't realize there is a problem.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Hey, cool! A use for TALPA!
TALPA?
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
- add a command 'access_bad_filenames' which creates a shell with the capability.
- /bin/ls also needs the capability, but should not display bad filenames
unless an additional option is passed.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
If not, it should be easy enough to fix the SHELLS in this case.
Three decades of unhappy experience says otherwise. Nobody has a reasonable proposal to fix all the shells, all the scripting languages and all the user applications so that they don't make unsafe assumptions about filenames (e.g. assuming a filename can never begin with - or never contain the \n character).
Wheeler: Fixing Unix/Linux/POSIX Filenames
2. Prevent files that contain control characters (newline included)
3. Make the shells easy to use in the face of filenames with spaces, semi-colons, colons, quotes, punctuation, etc.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
[The shell] could also recognise the null character as an argument separator as in 'find -print0'. It could even set some environment variable to tell tools like find that this is supported so that they can use it by default when not outputting to the console.
While on that subject, the shell could enforce that substitutions that resolve to the arguments for other commands are not allowed to spill over (e.g. VAR='myfile; rm -rf /'; ls $VAR).
Wheeler: Fixing Unix/Linux/POSIX Filenames
arguments began and ended at one point.
problem?
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
projects are public, after all.
Wheeler: Fixing Unix/Linux/POSIX Filenames
It could also recognise the null character as an argument
separator as in 'find -print0'.
A few weeks ago I wanted to process my .ogg files which contain all
kinds of characters that are treated as meta-characters by the shell
or other programs I use in sheel scripts. I eventually ended up
writing a new shell dumbsh
that uses NUL as argument separator, and feeding it from find, with
some intermediate processing in awk (which is quite flexible about
meta-characters).
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
an 'appliance' (for this purpose, a set of software which is the raison
d'etre of the hardware for which it is bought: this could be a tiny
embedded system or a giant bank database or simulation box). In this case,
they can dictate whatever shell they damn well like.
unpleasant, ksh was too buggy (thanks, Linux, for pdksh, with its broken
propagation of variables out of loops-with-redirection), and there was no
hope of getting the clients' systems people to install Perl: but they were
perfectly happy to install a recent zsh: fewer dependencies and no scary
modules (well, actually zsh *does* have a module system but they didn't
realise that!)
Wheeler: Fixing Unix/Linux/POSIX Filenames
propagation of variables out of loops-with-redirection)"
Was ksh93 tried?
Wheeler: Fixing Unix/Linux/POSIX Filenames
At the time it wasn't free enough either.
dot files in Windows and
dot files in Windows and
dot files in Windows and
dot files in Windows and
looking at an attacked system's disk image (for fun, I have no life).
dot files in Windows and
mrshiny wrote:
Wheeler: Fixing Unix/Linux/POSIX Filenames
rick@linuxmafia.com
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
acute}' and u'a\N{combining acute accent}' refer to the same file. Also, more obviously, 'a' and 'A'.
Wheeler: Fixing Unix/Linux/POSIX Filenames
right? Case-insensitivity in filesystems is thus an astoundingly awful
idea.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
We're talking about the possible outcomes. You're telling us we shouldn't discuss the possible problems and solutions because we don't know the problems yet? That's bunk.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
UTF-8 (and even the change to be case insensitive!) didn't bother most programs.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
downcasing is *context-dependent* and to a certain extent a matter of
controversy and thus taste (this wasn't always true, but successive waves
of largely-failed spelling reforms have introduced a nice steaming heap of
uncertainty into this part of the written language).
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
Conventions are great! Let's go back to FAT!
Most of the things he wants to force on everyone are already
available by convention. What is the benefit of disallowing other
usages?
Conventions are great! Let's go back to FAT!
Conventions are great! Let's go back to FAT!
Wow, it does not work?
Nope.Apparently
UNIX is completely broken.
Nope. UNIX is not broken. Your head,
on the other hand, is.And ACL's are so complicated and a drain
on performance as to be nearly useless - which is why they are not used
much.
Traditional unix permissions are used on most systems -
and are ARE ACL's too. They are quite limited but often adequate -
that's why other forms are not used much. Still they are deficient in many
situations and other forms are used more and more.
Shell scripts are where this is the biggest problem. I do shell
scripting for a living and don't see this issue as being anywhere near as
big a problem as Mr. Wheeler thinks it is.
Also, I'm completely confused by your title. I suggest
conventions and then you suggest, perhaps facetiously, FAT (which is not a
convention, but enforcement of a very stupidly limited set of possible
filenames).
Conventions are great! Let's go back to FAT!
'By convention' files do not contain control characters. The problem is that you cannot rely on convention when writing robust, secure software. Either you put in endless sanity checks which cruft up your code and are liable to be forgotten, or you end up with subtle bugs that are tickled by the existence of files called '>foo' or '|/bin/sh' or countless other variations.
Wheeler: Fixing Unix/Linux/POSIX Filenames
If you want to imagine that all your filenames are UTF-8, go ahead, who's stopping you!
You could equally well say that disk quotas are not needed; if you want to limit yourself to use 100 megabytes of space, who's stopping you? Indeed what is the point of file permissions - if you want to pretend that all your files are read-only, who's stopping you? And why should the kernel forbid hard links to directories - surely it should be up to the user to decide whether their filesystem is a tree or a general DAG, and the kernel should not enforce this policy.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
a function which takes a zero-terminated byte array representing a filename and returns a string suitable for display
Currently it is impossible to reliably write such a function, because you don't know whether the byte array is encoded in Latin-1, Shift-JIS, UTF-8 or whatever.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
No, it is very possible to write such a function. The character encoding issue only prevents you from assuring that the string matches what the file's creator thought it should be.
Well, yeah. If you allow the function to return the wrong answer, then it is easy to write. But it is not possible to in all cases return the correct filename to the user, matching the original one chosen by the user. If you pick a known encoding everywhere (UTF-8 being the obvious choice) then the problem goes away.
This doesn't represent a security problem.
Correct (at least none that I can think of). The security issue is with special characters and control characters in filenames, and is separate to the issue of how to encode characters that don't fit in ASCII.
a filename issue or a shell issue?
a filename issue or a shell issue?
accelerated indirect rendering and combinations
kernel mode setting
"better than POSIX"
restrictions on filenames
Microsoft shell competes with Unix shell
a filename issue or a shell issue?
a filename issue or a shell issue?
One of the reasons git replaced many shell scripts with C code was support for weird file names. C is better at handling them. In absence of such issues, many commands would have remained shell scripts, which are easier to improve.
Case in point
Simplicity is better than complexity.
UNWANT.
Simplicity is better than complexity.
The scripting implementation is thus trivially solved.
Simplicity is better than complexity.
jrodman@calufrax:/opt/kmods/mods/artists/Karl> ls d_* ¦*
d_ .it d_ .it d_ .it d_ .it d_1151.it d_1152.it d_1153.it d_1154.it ¦¦¦¦¯¯Ì_.it
jrodman@calufrax:/opt/kmods/mods/artists/Karl> ls d_* ¦* |xxd
0000000: 645f 2020 2020 2e69 740a 645f 2020 202e d_ .it.d_ .
0000010: 6974 0a64 5f20 202e 6974 0a64 5f20 2e69 it.d_ .it.d_ .i
0000020: 740a 645f 3131 3531 2e69 740a 645f 3131 t.d_1151.it.d_11
0000030: 3532 2e69 740a 645f 3131 3533 2e69 740a 52.it.d_1153.it.
0000040: 645f 3131 3534 2e69 740a a6a6 a6a6 afaf d_1154.it.......
0000050: cc5f 2e69 740a ._.it.
Simplicity is better than complexity.
Simplicity is better than complexity.
did, I'd claim it needs to be fixed.
Simplicity is better than complexity.
I claim that I'd find software that creates filenames like that on my disk to be irritating. So I'd
certainly prefer if no software actually did so, and probably wouldn't mind if it was impossible to do
so.
Simplicity is better than complexity.
Do you know difference between two words: "need" and "want"?
you have no clue about my software or the project but you claim
to know what is correct and incorrect.
Simplicity is better than complexity.
Simplicity is better than complexity.
Simplicity is better than complexity.
Simplicity is better than complexity.
DANGER! DANGER! DANGER! HYPOCRISY LEVEL IS OVER 9000!!!
Simplicity is better than
complexity.
plus this oneAs for the find, gnu find
already has -print0 and xargs is compatable.
equals
hypocrite.Simplicity is better than complexity.
Simplicity is better than complexity.
ears, following which I was simple.)
Simplicity is better than complexity.
Simplicity is better than complexity.
Simplicity is better than complexity.
Simplicity is better than complexity.
Is kernel the whole world?
Simplicity is better than complexity.
This proposal is a whole lot of complexity in kernel code and
the API.
Simplicity is better than complexity.
Simplicity is better than complexity.
Simplicity is better than complexity.
if (*c < 32) return EINVAL;
if (*c < 32 || *c == '<' || *c == '>' || *c == '|') return EINVAL;
if (filename[0] == '-') return EINVAL;
int is_cont(char c) { return 128 <= c && c < 192 }
const char *p = filename;
while (*p) {
if (*p < 128) ++c;
else if (192 <= *p && *p < 224 && is_cont(p[1])) p += 2;
else if (224 <= *p && *p < 240 && is_cont(p[1]) && is_cont(p[2]) p += 3;
else if (240 <= *p && *p < 248 && is_cont(p[1]) && is_cont(p[2])
&& is_cont(p[3])) p += 4;
else if (248 <= *p && *p < 252 && is_cont(p[1]) && is_cont(p[2])
&& is_cont(p[3]) && is_cont(p[4])) p += 5;
else if (252 <= *p && *p < 254 && is_cont(p[1]) && is_cont(p[2])
&& is_cont(p[3]) && is_cont(p[4]) && is_cont(p[5])) p += 6;
else return EINVAL;
}
Wheeler: Fixing Unix/Linux/POSIX Filenames
freenode, so I have to deal with a lot of issues to do with quoting,
word splitting, etc.
use of " * : < > ? \ / | even in NTFS." -- is "/" supposed to be in that list?
or for more complex cases:
find . -type f -exec sh -c 'if true; then somecommand "$1"; fi' -- {} \;
find . -type f | sed -e 's/./\\&/g' | xargs somecommand
This is a feature of xargs and is specified by POSIX. It disables various quoting problems with xargs that you don't mention.
Thanks for your comments! In particular, you're absolutely right about swapping the order of \t and \n in IFS - that makes it MUCH simpler. I prefer IFS=`printf '\n\t'` because then it's immediately obvious that \n and \t are the new values. I've put that into the document, with credit.
Wheeler: Fixing Unix/Linux/POSIX Filenames
I don't think you could ban "()". They frequently appear in names in Windows-originated directories, apparently because some common programs generate file names containing "(1)", "(2)", ... to avoid collisions or indicate file versions.
Parentheses
Not A System Problem
Not A System Problem
This isn't Pascal.
between a file called foo/bar and a file called bar in a subdirectory foo?
depends on them.
Re: Not A System Problem
Re: Not A System Problem
need rewriting, for essentially zero gain (ooh, you can't have nulls in
filenames: that's why UTF-8 is *defined* to avoid nulls in filenames).
too.
Re: Not A System Problem
Re: Not A System Problem
characters, syscalls like open() must change. Library calls like fopen()
have to change, because they too accept a \0-terminated string, with /s
separating path components. Every single call in every library that
accepts pathnames has to change. Probably the very notion of a string has
to change to something non-\0-terminated.
POSIX calls: in fact, it can't any longer use ANSI C calls! I suspect that
such a system would be almost unusable with C, simply because you couldn't
use C string literals for anything.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
spaces, but cutting down on the number of "regular" characters like ":
cope with them, and amongst the obvious usefulness of parentheses, those
characters often occur in music tracks that I'm ripping from my CDs. It
would be a shame to lose the ability of naming them with their actual name.
Like, "All glob expansions are automatically enclosed in strings", "If it's
a string then don't f*cking interpret it as an option", and maybe even
"Here's an array of return values" instead of "If the viewer is a program
then split by newlines, if the viewer is a user then make a table". Type
safety ftw?
when there's an actual *sensible* use case then that use case should not be
made impossible just because the implementation is crappy.
Oops, LWN swallows less-than & Co. even in plain-text mode... whatever,
imagine exclamation/question marks, parentheses etc. on the second line.
Plus some suffixed text that says I disagree that we disallow those
characters just because we don't cope with them.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Leading spaces are common, actually
sort. (For instance, a program might create a menu at run time in lexicographic order based on
the contents of a directory, or you may want to force a file to appear near the beginning of a
listing.) This is especially common in the Mac world.
from Mac OS 9.x and earlier systems, but it turns out that this practice is far more common on
modern Mac OS X than I expected:
/Applications/Microsoft Office 2004/Clipart/Animations/ Animations Clip Package
/Applications/Microsoft Office 2004/Clipart/Clip Art/ Clip Art Clip Package
/Applications/Microsoft Office 2004/Clipart/Photos/ Photos Clip Package
/Applications/Microsoft Office 2004/Office/Entourage First Run/Entourage Script Menu Items/
About This Menu...
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Master/ Basic
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Master/ Basic/
Default.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Master/Ambient/
Ambient Basic.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Master/Classical/
Classical Basic.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Master/Dance/
Dance Basic.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Master/Hip Hop/
Hip Hop Basic.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Master/Jazz/ Jazz
Basic.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Master/Pop/ Pop
Basic.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Master/Rock/ Rock
Basic.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Master/Stadium
Rock/ Stadium Rock Basic.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Real/Band
Instruments/ No Effects.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Real/Basic Track/
No Effects.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Real/Bass/ No
Effects.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Real/Drums/ No
Effects.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Real/Effects/ No
Effects.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Real/Guitars/ No
Effects.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Real/Podcasting/
No Effects.cst
/Library/Application Support/GarageBand/Instrument Library/Track Settings/Real/Vocals/ No
Effects.cst
/Library/Scripts/Mail Scripts/Rule Actions/ Help With Rule Actions.scpt
There is a use for leading spaces: They force files to appear earlier than usual in a lexicographic sort.
Leading spaces are common, actually
~/t$ touch "a" " b" "c"
~/t$ ls -l
total 0
-rw-r--r-- 1 njs njs 0 2009-03-26 12:16 a
-rw-r--r-- 1 njs njs 0 2009-03-26 12:16 b
-rw-r--r-- 1 njs njs 0 2009-03-26 12:16 c
Finder does not ignore spaces. I'm quite glad, because I use the space-prefix trick rather regularly.
I am occasionally annoyed at how GNU ls sorts "A B" between "AA" and "AC" instead of before them:
that's certainly not how my brain sorts.
Leading spaces are common, actually
Leading spaces are common, actually
ASCIIbetically, so " 2" comes before "1".)
> Sorting numerically in GNU ls is done by 'ls -v'.
Leading spaces are common, actually
Leading spaces are common, actually
was designed to sort version numbers, and because the expected use of
ls -v was sorting a directory full of version-named directories in version
order.
Behavior of Leading spaces are common, actually
ls
in Mac OS X 10.5.6 build 9G55:
$ pwd
/Library/Application Support/GarageBand/Instrument Library/Track Settings
$ ls -l Master | head
total 0
drwxrwxrwx 3 root admin 102 May 3 2008 Basic
drwxrwxrwx 6 root admin 204 May 3 2008 Ambient
drwxrwxrwx 6 root admin 204 May 3 2008 Classical
drwxrwxrwx 11 root admin 374 May 3 2008 Dance
drwxrwxrwx 5 root admin 170 May 3 2008 Hip Hop
drwxrwxrwx 5 root admin 170 May 3 2008 Jazz
drwxrwxrwx 7 root admin 238 May 3 2008 Pop
drwxrwxrwx 7 root admin 238 May 3 2008 Rock
drwxrwxrwx 5 root admin 170 May 3 2008 Stadium Rock
$
And this matches the Finder's behavior. BTW, if the Finder behaved any other way, it would be
more difficult to properly recover broken Mac OS 9.x or earlier installations using OS X -- Classic
Mac OS loads files in /System Folder/Extensions in lexicographic order, and the load order
matters, and the leading space trick is used very frequently there. Mac OS X 10.5 can dual-boot
with Mac OS 9.x, so this still matters for some users.
>Behavior of ls in Mac OS X 10.5.6 build 9G55
Leading spaces are common, actually
Leading spaces are common, actually
order, and the load order matters, and the leading space trick is used very
frequently there. "
Yep, look at what they had to do about this when Apple introduced HFS+ in Mac
OS 8.1:
http://developer.apple.com/legacy/mac/library/technotes/t...
s
http://developer.apple.com/legacy/mac/library/technotes/t...
The users on a filesystem I administer use six or seven levels of leading space to sort their common jobs-in-progress directory. I've long since given up on getting them to move to a hierarchical setup.
Leading spaces are common, actually
Leading spaces are common, actually
Wheeler: Fixing Unix/Linux/POSIX Filenames
Having been working with computers since 1979 and subject to the various limitations of dozens of file systems, I automatically exercise self-restraint and never put any of those characters into filenames.
People should not be using filenames as data storage.
Wheeler: Fixing Unix/Linux/POSIX Filenames
is 'people should only be storing data with certain restrictions in
filenames', but that's a circular argument.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Meta-discussion
Hmmm, I'm not so sure. I feel strongly about ext4 losing data, but I don't have a strong opinion about this issue. Really. Not for lack of sensitivity to the problem -- I've had an administrator at work erase a whole directory of files because of a leading space (so that 'rm -rf /dir/file' became 'rm -rf /dir/ file'). But there are advantages and disadvantages, and I cannot pick a side.
Meta-discussion
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
of Tahoe, which wanted to take advantage of the reliably-encoded filenames of a future version of
Linux, would have to have some way to reliably distinguish between old-fashioned-linux (where
you get a string of bytes and some "suggested" encoding which may or may not correctly decode
those bytes), and new-fashioned-linux (where, like on Windows and Mac, you get unicode
filenames).
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames
UTF-16, but none of the APIs will check that. So, you can easily create invalid surrogate pair
sequences. Now, it's a *lot* easier to ignore this issue on windows than on linux, because:
b) Nobody creates those by accident. It won't happen just because you set your LOCALE wrong.
c) the windows Unicode APIs are all 16-bit unicode, so they never try decoding the surrogate pair
sequences anyways
d) Even UTF-16->UTF-32 decoders often decode a lone surrogate pair in UTF-16 into a lone
surrogate pair value in UTF-32 (even though it's theoretically not supposed to do that).
Wheeler: Fixing Unix/Linux/POSIX Filenames
Bad understanding of UTF-8
An "invalid" UTF-8 string can contain only some extraneous bytes in the range 0x80-0xff. These high-order bytes do not cause any problems with any programs.
Bad understanding of UTF-8
Bad understanding of UTF-8
Bad understanding of UTF-8
Bad understanding of UTF-8
Bad understanding of UTF-8
Bad understanding of UTF-8
Bad understanding of UTF-8
Because a whole lot of stupid people thought that "wide characters" are the solution and put them into certain systems we have to live with it and interoperate. The most popular solution is to translate invalid bytes in UTF-8 into 0xDCxx. This can be used as a stopgap until they finally realize that leaving the data in UTF-8 is the real solution.
They cannot 'leave the data in UTF-8' because it is not in UTF-8 to start with! If it contains invalid bytes then by definition it's not UTF-8. It is just a string of arbitrary bytes and certainly, yes, the application can treat it as such. That does make life difficult when you want to display the filename to the user or otherwise treat it as human-readable text.
Wheeler: Fixing Unix/Linux/POSIX Filenames
Wheeler: Fixing Unix/Linux/POSIX Filenames