Not logged in
Log in now
Create an account
Subscribe to LWN
LWN.net Weekly Edition for May 23, 2013
An "enum" for Python 3
An unexpected perf feature
LWN.net Weekly Edition for May 16, 2013
A look at the PyPy 2.0 release
It is an open question to which extent this commit should be credited to the designers of sh, autoconf, libtool, make, and/or Solaris ld.
Quote of the week
Posted Feb 28, 2013 19:57 UTC (Thu) by nix (subscriber, #2304)
Posted Mar 1, 2013 1:11 UTC (Fri) by dashesy (subscriber, #74652)
Posted Mar 1, 2013 18:03 UTC (Fri) by nix (subscriber, #2304)
Posted Mar 2, 2013 19:02 UTC (Sat) by cesarb (subscriber, #6266)
Posted Mar 2, 2013 20:31 UTC (Sat) by nix (subscriber, #2304)
Posted Mar 2, 2013 21:27 UTC (Sat) by madscientist (subscriber, #16861)
However this article is discussing command lines and here make is actually reasonable, considering the alternatives. Make does not care about any character in a command script except $. If you want a literal $ you have to double it. However, every other character including backslash (except in line-continuation context), quotes, whitespace, *, &, (), , <>, etc. etc. is given, unmolested, to the command processor (e.g., the shell).
Although it would perhaps have been nice to avoid the need for "$$" somehow, imagine how bad it would be if make ALSO used backslash for quoting, or interpreted other characters besides "$". At least as it is the real complexity in these quoting statements is not introduced, or much exacerbated, by make.
Posted Mar 3, 2013 0:59 UTC (Sun) by mathstuf (subscriber, #69389)
Posted Mar 3, 2013 3:32 UTC (Sun) by madscientist (subscriber, #16861)
$ cat Makefile
FILE := ;\n\tgcc
$(FILE): ; @echo '$@'
Posted Mar 3, 2013 23:48 UTC (Sun) by quotemstr (subscriber, #45331)
Posted Mar 4, 2013 17:49 UTC (Mon) by mathstuf (subscriber, #69389)
In reality, the thing I was interested in was something more along the lines of:
FILES := $(wildcard *.c)
all: $(patsubst %.c,%.o,%(FILES))
gcc -c '$<' -o '$>'
In the face of a file named ";\n\tgcc.c". Which breaks pretty badly with "make: gcc.o: Command not found".
Posted Mar 6, 2013 22:31 UTC (Wed) by Jandar (subscriber, #85683)
Posted Mar 7, 2013 16:53 UTC (Thu) by mathstuf (subscriber, #69389)
For encoding on the other hand, enforcing UTF-8 should be an option when mounting (with a default of warning on non-UTF-8 filenames). I can see the usecase for non-UTF-8 filenames (much as I think that it's silly since I don't think there's a per-directory LC_FILENAME (or even file if you have quite the mix), but whatever).
Posted Mar 7, 2013 17:41 UTC (Thu) by Jandar (subscriber, #85683)
Posted Mar 7, 2013 18:01 UTC (Thu) by mathstuf (subscriber, #69389)
How does Windows handle filenames which are NTFS-valid but not Windows-valid? Or is Windows-valid defined as NTFS-valid these days (which I also would presume is a superset of FAT-valid)?
Posted Mar 7, 2013 19:29 UTC (Thu) by quotemstr (subscriber, #45331)
Alias files with invalid names to "INVALID_FILE_NAME~1" or something like that. These names don't have to be stable.
> How does Windows handle filenames which are NTFS-valid but not Windows-valid? Or is Windows-valid defined as NTFS-valid these days (which I also would presume is a superset of FAT-valid)?
You mean NT-valid? As far as NT is concerned, filenames are sequences of UCS-2 codepoints. There's no validation or normalization. Win32 wraps the NT filesystem API, and although Win32-invalid files are enumerated, the Win32 API cannot enumerate these files. User-mode programs can, however, just call the NT APIs.
Posted Mar 8, 2013 19:13 UTC (Fri) by JanC_ (guest, #34940)
I'm pretty sure that the Windows APIs don't work with those features very well though...
Posted Mar 18, 2013 15:15 UTC (Mon) by nye (guest, #51576)
I have in the past had files on NTFS filesystems with colons in the name. I don't recall whether the name of the file was correctly listed in that case, though I *think* it was, however a lot of accesses would result in the process blocking indefinitely and needing a reboot to kill, ie. the same thing as being stuck in uninterruptible sleep on Unix.
This was back on Windows XP mind, so maybe more recent versions behave less badly, but I wouldn't bet on it.
If you are using a version of Windows with SFU/SUA/whatever it's called now then you can do what you like with those files from there, because the POSIX subsystem is built directly on the Windows native API which supports it just fine. OTOH Cygwin of course is layered on top of win32, so inherits all of its problems.
In the unlikely event that anyone is interested, I went into this with a little more detail a few years back: http://blog.steamsprocket.org.uk/2010/02/26/posix-file-se...
 Created by telling Amarok to organise music files based on the tags, and lacking the foresight to look up characters disallowed by win32 and munge them.
 Actually I'm fairly sure Cygwin does use the native API in some places to provide certain features, but in the main it sits on top of win32.
Posted Mar 7, 2013 18:06 UTC (Thu) by viro (subscriber, #7872)
If you want VFS to enforce anything of that kind, feel free to fork the kernel and maintain the damn thing yourself. Any patches of that kind will be NAKed.
Posted Mar 7, 2013 19:22 UTC (Thu) by quotemstr (subscriber, #45331)
You haven't addressed the reason the kernel should not care. The kernel is in an excellent position to reduce the likelihood real users will be harmed. Why *not* impose these restrictions? You're just saying that the kernel does not today impose any structure on filenames, so it never should, no matter the advantages gained by doing so. You're arguing by assertion. Perhaps you're accustomed to winning arguments this way, but it doesn't make your position any stronger.
> Any patches of that kind will be NAKed.
That's unfortunate. Maybe libc can provide the same benefits --- unless a special environment variable is set, hide these files from directory enumerations, and delete them transparently if rmdir(2) would otherwise fail due to these hidden files being left in a directory.
Posted Mar 7, 2013 19:55 UTC (Thu) by jimparis (subscriber, #38647)
No discussion about bad characters in filenames would be complete without a link to http://www.dwheeler.com/essays/fixing-unix-linux-filename.... It covers lots of options, and talks about why a libc-level fix isn't necessarily going to work.
Posted Mar 7, 2013 20:55 UTC (Thu) by viro (subscriber, #7872)
Posted Mar 8, 2013 21:36 UTC (Fri) by Cyberax (✭ supporter ✭, #52523)
Most of my editors support recoding just fine, but I'm still encountering filesystems with rubbish in filenames. And they're much more complex to deal with.
Posted Mar 7, 2013 20:19 UTC (Thu) by viro (subscriber, #7872)
Yes, I have. The set you want to filter out is a function of encoding (which the kernel has no notion of), the set of userland code likely to run on that system (ditto; different interpreted languages have different sets that need to be quoted) and your preferences.
Policy like that doesn't belong in the kernel. And drop the demagoguery, please. You demand assistance in imposing your personal preferences on everything and accuse those who have the gall to refuse of arguing by assertion? Nice... BTW, what's the difference between "user" and "real user"? Just curious...
Posted Mar 7, 2013 20:46 UTC (Thu) by quotemstr (subscriber, #45331)
Not today, no, but it could. Even absent explicit knowledge of encoding, the kernel can get very far under the assumption that a byte below 0x20 is a forbidden control character. This scheme works even for Shift-JIS. Additionally, forbidding leading 0x2D bytes still allows the full expression of characters in almost all encodings, even Shift-JIS.
The scheme I'm proposing is compatible with any character set that is itself compatible with the kernel's use of 0x2F as a directory separator.
> imposing your personal preferences
This issue has nothing to do with personal preference. I wasn't on the POSIX committee. I didn't write the Bourne shell. The way we interpret filenames is not my "personal preference": it's a reality. I'm trying to make real users safer. You're the one making systems less robust because you don't want to think about encodings.
> And drop the demagoguery, please.
Demagoguery? You're the one calling legitimate technical options "personal preferences", calling mitigation approaches "impositions" (is ASLR also an imposition of personal preferences?), and generally not touching the core point, which is that a component in a position to significantly improve the robustness of millions of lines of code without at the same time causing significant adverse side effects, well, should do so.
You need to either identify why the problem we're trying to ameliorate isn't actually a problem, describe why the solution we're discussing doesn't actually address the problem, or describe why the cost of this solution is too high.
Charitably, I can take your comment as suggesting that the cost of teaching the kernel about filename encodings is too high. I don't agree. The kernel doesn't need to know about particular encodings because all commonly-used encodings have enough in common that a byte-by-byte coding-agnostic analysis will suffice.
Posted Mar 7, 2013 21:03 UTC (Thu) by viro (subscriber, #7872)
Posted Mar 7, 2013 21:23 UTC (Thu) by quotemstr (subscriber, #45331)
Sure, but even if you restrict the filtering to characters below 0x20 (because characters between 0x20 and 0x2D are used all over the place) and to leading 0x2D, you still eliminate a big class of problems, particularly in contexts other than terminal display. 0x9B isn't special to the shell, but 0xA is. There are lots of shell idioms that break only when filenames contain newlines. Can't we just make these idioms work all the time?
> And then there's SQL/HTML/make/etc. - any number of languages with their own needs wrt quoting...
So? The kernel can't solve all problems, but it can go a long way toward eliminating a very well-known subset of "all problems relating to metacharacter injection".
Posted Mar 7, 2013 22:21 UTC (Thu) by viro (subscriber, #7872)
Unless you can guarantee that your script is never runs on older and/or non-Linux kernels, you _still_ need to quote properly. And if you can guarantee that, the usual objections re -print0 and its ilk being non-portable do not apply.
PS: the verb "impose" had been brought into that thread by yourself. The context had been about the kernel being in excellent position to impose the restriction you asked for. Check it yourself. And yes, it smells like demagoguery, especially since you proceeded to complain indignantly that I was treating something like an imposition, etc.
Posted Mar 8, 2013 21:39 UTC (Fri) by Cyberax (✭ supporter ✭, #52523)
How about removing all this ACL crap? Everything should just be world-writable. After all, we all know that all editors are vulnerable and all kernels have locally exploitable rootholes.
Posted Mar 7, 2013 20:03 UTC (Thu) by sfeam (subscriber, #2841)
Even if the option only pertained to file creation on an already mounted volume - how would that work? I get sent tarballs or zip data archives from people using SJIS encoding for example. Yeah it's a pain, but refusing to unpack the thing would only make matters worse.
Posted Mar 7, 2013 22:23 UTC (Thu) by raven667 (subscriber, #5198)
No, that sounds silly. It would be enforced when files are created/(re)named and when those names are read before giving the names to userspace.
> And what if it does encounter an non-UTF name - refuse to mount? mangle it? How does the user recover from this? Would corruption of a file name, intentional or otherwise, make the volume unmountable in the future?
To begin with it should probably just log a warning about the invalid filename (without having an attacker break the logging system 8-) and return to userspace as normal. After a few years it could be made mandatory with an option to turn it back to a warning or to turn validation off.
The kernel probably shouldn't return filenames that fail validation or allow them to be created and return an error instead. Worst case an invalid file could be renamed into lost+found with its inode number but that might be too invasive.
> I get sent tarballs or zip data archives from people using SJIS encoding for example.
That seems to be a thorny problem. Those filenames are going to be broken anyway unless your whole environment is all operating in SJIS mode, right? Or are the cases you care about already have encodings that overlap with UTF8, the way ASCII does? if your environment already is dealing with multiple legacy encodings then it may be able to normalize the encoding to UTF8 before using names in syscalls. it seems like it'll take a long time before every utility that uses open() can translate between whatever encoding the app/env/term is using and the system encoding. At the least this will only break for people still using legacy encodings, that might be enough reason to scrap the idea.
Posted Mar 7, 2013 22:57 UTC (Thu) by viro (subscriber, #7872)
And yes, it's a goddamn mess. To make life really interesting, there's a bunch of fun with TeX - there had been several cyrillic font families, with different encodings used, so an old paper sitting around may be nasty. So can old documentation, for that matter - transliterating from Alt to KOI or UTF isn't a problem, but then you might want to rip your eyes out trying to make sense of what used to be a diagram describing a hardware register, with some well-intentioned twit having done boxes and arrows in that kind of pseudo-ASCII-graphics back in '93 or so...
Posted Mar 7, 2013 23:02 UTC (Thu) by quotemstr (subscriber, #45331)
Posted Mar 8, 2013 0:41 UTC (Fri) by mgedmin (subscriber, #34497)
I assume UTF-8-capable terminal emulators are able to cope.
Posted Mar 8, 2013 21:44 UTC (Fri) by Cyberax (✭ supporter ✭, #52523)
Just add a freaking UTF-8 filter (optional) and synthetic names. Besides, kernel already DOES support encodings (for mounting of FAT filesystems). And Linux ALREADY has a notion of NLS: http://lxr.free-electrons.com/source/include/linux/nls.h
Posted Mar 7, 2013 23:39 UTC (Thu) by sfeam (subscriber, #2841)
The filenames are broken, yes, although they are at least guaranteed to be unique. Fortunately the file contents are usually binary, so text encoding is not relevant. So long as I can figure out which file is which, I can rename it to a UTF-8 encoded equivalent. E.g. filter the output of `tar tzvf` through an SJIS to UTF8 conversion step. But knowing how to rename the files would not help if the system were refusing to unpack the files onto disk in the first place.
Posted Mar 8, 2013 15:48 UTC (Fri) by Jandar (subscriber, #85683)
The environment has to be *agnostic* about the meaning of individual bytes apart from '/' and '\0'. Any interpretation of the bytes composing a filename should be restricted to the end-user interface, may it be ls in a xterm or a gui.
Posted Mar 8, 2013 16:43 UTC (Fri) by raven667 (subscriber, #5198)
One could implement a standard encoding and a whiltelist of acceptable characters but that would be difficult and enormous and would still have problems with existing software, the other option would be for all existing software to filter characters the can't handle, to not have bugs, which is an impossible dream. Maybe instead of something in the kernel something in glibc or a library suitable for LD_PRELOAD could at least be a proof of concept that validating characters in filenames has merit, or alternately is unworkable.
Posted Mar 11, 2013 12:07 UTC (Mon) by dgm (subscriber, #49227)
Correct. And _each_ environment has the responsibility to filter what it cannot interpret. The kernel has no business into to that, because there are _many_ possible environments, each with it's own limitations.
Posted Mar 11, 2013 14:51 UTC (Mon) by Cyberax (✭ supporter ✭, #52523)
Just imagine the cool double and triple-quoting necessary to deal with complex paths in sed invoked from $(eval) in Makefiles! Nobody should be deprived of this. No, we must preserve this insanity at all costs.
Guys, simply normalizing everything to UTF-8 is not going to harm any sane filesystem uses. And for insane users a mount-time option can be added.
I'm actually tempted to write patch and submit it - Linux already has all the required NLS pieces.
Posted Mar 11, 2013 15:20 UTC (Mon) by daglwn (subscriber, #65432)
Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds