LWN: Comments on "Wheeler: Fixing Unix/Linux/POSIX Filenames" https://lwn.net/Articles/325304/ This is a special feed containing comments posted to the individual LWN article titled "Wheeler: Fixing Unix/Linux/POSIX Filenames". en-us Tue, 11 Nov 2025 04:50:07 +0000 Tue, 11 Nov 2025 04:50:07 +0000 https://www.rssboard.org/rss-specification lwn@lwn.net Wheeler: Fixing Unix/Linux/POSIX Filenames https://lwn.net/Articles/362018/ https://lwn.net/Articles/362018/ nix <div class="FormattedComment"> zsh93 was too sodding hard to require because building it was a nightmare. <br> At the time it wasn't free enough either.<br> <p> </div> Sun, 15 Nov 2009 13:15:57 +0000 Wheeler: Fixing Unix/Linux/POSIX Filenames https://lwn.net/Articles/362006/ https://lwn.net/Articles/362006/ yuhong <div class="FormattedComment"> "ksh was too buggy (thanks, Linux, for pdksh, with its broken <br> propagation of variables out of loops-with-redirection)"<br> Was ksh93 tried?<br> </div> Sun, 15 Nov 2009 01:06:04 +0000 Leading spaces are common, actually https://lwn.net/Articles/362000/ https://lwn.net/Articles/362000/ yuhong <div class="FormattedComment"> "Classic Mac OS loads files in /System Folder/Extensions in lexicographic <br> order, and the load order matters, and the leading space trick is used very <br> frequently there. "<br> Yep, look at what they had to do about this when Apple introduced HFS+ in Mac <br> OS 8.1:<br> <a rel="nofollow" href="http://developer.apple.com/legacy/mac/library/technotes/tn/tn1121.html#HFSPlu">http://developer.apple.com/legacy/mac/library/technotes/t...</a><br> s<br> <a rel="nofollow" href="http://developer.apple.com/legacy/mac/library/technotes/tn/tn1123.html">http://developer.apple.com/legacy/mac/library/technotes/t...</a><br> <p> </div> Sun, 15 Nov 2009 00:32:25 +0000 NT (Windows kernel) doesn't care about filenames any more than Linux https://lwn.net/Articles/361998/ https://lwn.net/Articles/361998/ yuhong <div class="FormattedComment"> Another trick you can use with CreateFile is to start the filename with \\.\. <br> If that is done, the only processing done on the filename before CreateFile <br> calls NtCreateFile with the name is that \\.\ is replace with \??\, which is <br> an alias of \DosDevices\.<br> </div> Sun, 15 Nov 2009 00:06:30 +0000 NT (Windows kernel) doesn't care about filenames any more than Linux https://lwn.net/Articles/361997/ https://lwn.net/Articles/361997/ yuhong <div class="FormattedComment"> "files that are more than 2GB long"<br> Yep, NT had supported both files and disks larger than 2GB from the first <br> version (NT 3.1) using the NTFS filesystem. Exercise: compare the design of <br> the GetDiskFreeSpace and SetFilePointer APIs (look them up using MSDN or <br> Google), both of which has existed since NT 3.1. Which one was so much more <br> error-prone that the versions of Windows released in 1996 had to cap the <br> result to 2GB, even though older versions of NT supported returning more than <br> 2GB using it, and why?<br> </div> Sat, 14 Nov 2009 23:58:04 +0000 Bad understanding of UTF-8 https://lwn.net/Articles/328525/ https://lwn.net/Articles/328525/ epa <blockquote>Because a whole lot of stupid people thought that "wide characters" are the solution and put them into certain systems we have to live with it and interoperate. The most popular solution is to translate invalid bytes in UTF-8 into 0xDCxx. This can be used as a stopgap until they finally realize that leaving the data in UTF-8 is the real solution.</blockquote>They cannot 'leave the data in UTF-8' because it is not in UTF-8 to start with! If it contains invalid bytes then by definition it's not UTF-8. It is just a string of arbitrary bytes and certainly, yes, the application can treat it as such. That does make life difficult when you want to display the filename to the user or otherwise treat it as human-readable text. <p> And indeeed, the Python developers are living in a magic fairy land where filenames are sanely encoded and are always human-readable text, but wouldn't it be better to change things so that this situation is no longer wishful thinking, but part of the ordinary things userspace can rely on? That is what Wheeler is proposing. Wed, 15 Apr 2009 10:38:01 +0000 Wheeler: Fixing Unix/Linux/POSIX Filenames https://lwn.net/Articles/327173/ https://lwn.net/Articles/327173/ anton <blockquote>It could also recognise the null character as an argument separator as in 'find -print0'.</blockquote> A few weeks ago I wanted to process my .ogg files which contain all kinds of characters that are treated as meta-characters by the shell or other programs I use in sheel scripts. I eventually ended up writing a new shell <a href="http://www.complang.tuwien.ac.at/forth/programs/dumbsh.fs">dumbsh</a> that uses NUL as argument separator, and feeding it from find, with some intermediate processing in awk (which is quite flexible about meta-characters). Fri, 03 Apr 2009 18:49:33 +0000 Wheeler: Fixing Unix/Linux/POSIX Filenames https://lwn.net/Articles/326932/ https://lwn.net/Articles/326932/ forthy <p>It is actually not that bad. As collating sequence, ß=ss (i.e. Mass and Maß sort to the same bin). Except for Austrian telephone books, where ß follows ss, but comes before st (though St. follows Sankt ;-).</p> <p>However, there's a huge mess in the CJK part of UCS: short and long forms of the same character (sometimes even a special variant for the Japanese character). This should never have happend, the different forms of the same character should be encoded in fonts, not in UCS. So far, not even Mac OS X normalizes these characters, but it is obvious that a mainland China file called "&#20013;&#22269;" and a Taiwan file called "&#20013;&#22283;" not only mean the same, but they also refer to the same word, and can be interchanged at will (see for example the Chinese wikipedia entry: the lemma is the short form, the headline is the long form). And it is not easy to access long and short forms with usual input methods (mainland China: Pinyin, Canton: Cantonese Pinyin (gives traditional characters, bug you need to know Cantonese), etc.).</p> Thu, 02 Apr 2009 15:54:06 +0000 Bad understanding of UTF-8 https://lwn.net/Articles/326695/ https://lwn.net/Articles/326695/ spitzak <div class="FormattedComment"> A program that treats bytes with the high bit set as "this may be a piece of a UTF-8 character", and puts all those bytes into a single class such as "may be a part of an identifier", can safely handle UTF-8 strings (including invalid ones) as bytes. This is FAR better than trying to detect and handle errors, in particular because it is a hundred times simper and thus more reliable and less likely to have bugs.<br> <p> Do NOT throw exceptions on bad strings. This turns a possible security error into a guaranteed DOS error. Working around it (as I have had to do countless times due to stupid string-drawing routines that refuse to draw a string with an error in it) means I have to write my *own* UTF-8 parser, just to remove the errors, before displaying it or using it. I hope you can see how forcing programmers to use their own code to parse the strings rather than providing reusable routines is a bad idea.<br> <p> And I don't want exceptions thrown when I compare two strings for equality. That way lies madness. It is unfortunate that too much of this stuff is being designed by people who never use it or they (and you) would not make such trivial design errors.<br> <p> <p> </div> Wed, 01 Apr 2009 16:38:38 +0000 Bad understanding of UTF-8 https://lwn.net/Articles/326538/ https://lwn.net/Articles/326538/ njs <div class="FormattedComment"> <font class="QuotedText">&gt; I am sure that "errors in UTF-8 only contain bytes with the high bit set", which is what I thought you were asking.</font><br> <p> Okay, fair enough. I agree, all ASCII characters are valid UTF-8. I was objecting to your claim that bytes with the high bits set "do not cause any problems with any programs".<br> <p> <font class="QuotedText">&gt; An overlong encoding consists of a leading byte with the high bit set. This is an error.</font><br> <p> All characters with codepoint &gt;= 128 are encoded in UTF-8 as a string of bytes with the high bit set (including on the leading byte). Having the high bit set is *certainly* not an error. I can't tell what you're saying in general, but it's just not true that the only time strings need to be interpreted as text is for display. In many, many cases text needs to be processed as text, and it's often impossible and rarely practical to write algorithms in such a way that they do something sensible with invalid encodings. Those serious security bugs I pointed out up above are examples of what happens when you try.<br> <p> (You're right that invalid strings usually shouldn't be silently transmuted to valid strings; they should usually signal a hard error.)<br> </div> Wed, 01 Apr 2009 05:12:40 +0000 Wheeler: Fixing Unix/Linux/POSIX Filenames https://lwn.net/Articles/326499/ https://lwn.net/Articles/326499/ nix <div class="FormattedComment"> I've contributed fixes now and then, but I just read a lot. :) The <br> projects are public, after all.<br> <p> <p> </div> Tue, 31 Mar 2009 19:28:56 +0000 Bad understanding of UTF-8 https://lwn.net/Articles/326477/ https://lwn.net/Articles/326477/ spitzak <div class="FormattedComment"> I am sure that "errors in UTF-8 only contain bytes with the high bit set", which is what I thought you were asking.<br> <p> An overlong encoding consists of a leading byte with the high bit set. This is an error. That may be followed by any byte. If it is another leading byte then it might start another UTF-8 character, or it might be an error. If it is a continuation byte then it is an error. If it is an ASCII character then it is not an error. As before, EVERY ERROR BYTE has the high bit set!<br> <p> I might have misunderstood your question. You said "are you sure" in response to me saying that all error bytes have the high bit set. The reason I was confirming that all error bytes have the high bit set is that if they are mapped to a 128-long range of Unicode then the adjacent 128-long range makes a good candidate for "quoting" characters that are not allowed in filenames.<br> <p> I do believe there are some serious mistakes in a lot of modern software. UTF-8 should NOT be converted until the very last moment when it is converted to "display form" for drawing on the screen. This is the only reliable way of preserving identity of invalid strings. People who think invalid strings will not occur or that it is acceptable for them to compare equal or silently be changed to other invalid strings or with valid strings are living in a fantasy land.<br> <p> </div> Tue, 31 Mar 2009 17:59:20 +0000 Wheeler: Fixing Unix/Linux/POSIX Filenames https://lwn.net/Articles/326395/ https://lwn.net/Articles/326395/ mjthayer <div class="FormattedComment"> I was wondering now whether to ask about this on the Bash mailing lists. Just out of interest, are you involved with the development of Bash/the GNU tools in any way? You seem well informed about them.<br> </div> Tue, 31 Mar 2009 07:47:30 +0000 Wheeler: Fixing Unix/Linux/POSIX Filenames https://lwn.net/Articles/326380/ https://lwn.net/Articles/326380/ njs <div class="FormattedComment"> We have that -- that's what file descriptors are. It would be nice if programs passed them back and forth more often, but my guess is that they mostly get used where they should, and to make their use more ubiquitous you'd need to radically re-architect a lot of stuff. (If one wanted to be provocative, one could claim that the whole goal of EROS/Coyotos is to figure out what that re-architecting looks like.)<br> <p> </div> Tue, 31 Mar 2009 05:14:50 +0000 Wheeler: Fixing Unix/Linux/POSIX Filenames https://lwn.net/Articles/326376/ https://lwn.net/Articles/326376/ njs <div class="FormattedComment"> I think you're overcomplicating things -- I wouldn't implement UTF-8 requirements at the VFS level (it just doesn't make sense, since there manifestly exist filesystems where you don't know the encoding, both from pre-existing Linux installs and with "foreign" filesystems). I'd make it a filesystem feature -- a flag in the ext2/3/4 header that's set at mkfs time, say. That removes all the issues about translating invalid filenames -- if that flag is set and a filename is invalid, then *your filesystem is corrupt*. fsck can check for such corruption if it feels like it.<br> <p> Then you just get distros to set that flag on the root filesystem by default, add a few bits of API for programs who want to know "is this filesystem utf8-only?" or "how does this filesystem normalize names?" (which would be really useful calls anyway), and away you go.<br> <p> (It's unfortunate that the Win32 designers screwed this up, but that's hardly an argument to perpetuate their mistake.)<br> <p> </div> Tue, 31 Mar 2009 05:00:40 +0000 Bad understanding of UTF-8 https://lwn.net/Articles/326375/ https://lwn.net/Articles/326375/ njs <div class="FormattedComment"> <font class="QuotedText">&gt; Yes I am sure.</font><br> <p> So -- just checking we're on the same page here -- what you're saying is that you're sure that those three security bugs I found in 5 minutes of googling were "not problems in any program".<br> <p> <font class="QuotedText">&gt; The first two references are about programs failing to recognize overlong encodings as being invalid.</font><br> <p> Right -- if invalid codings are interpreted differently in different parts of a system, then that creates bugs and security holes.<br> <p> <font class="QuotedText">&gt; But those invalid sequences start with a byte with the high bit set (following bytes may not have it set, but the fact that decoders consider them part of the first byte is the decoders error, a fixed decoder would consider it a one-byte error with the high bit set, followed by normal ascii characters which are unchanged and thus cannot cause a security hole).</font><br> <p> I'm sorry -- I cannot make out a word of this. The bug in the first two links is that the invalid sequences are over-long (but like all the bugs mentioned here, involve only bytes with the high bits set -- do you know how UTF-8 works?). The decoder should have an explicit check for such sequences and throw an error if they are encountered, but this check was left out.<br> <p> <font class="QuotedText">&gt; The last one is EXACTLY the bug I am trying to fix: stupid people who somehow believe that throwing errors or replacing with non-unique strings is how invalid UTF-8 should be handled.</font><br> <p> Errrr... quite so. I wasn't sure how useful this was to start with, but when you say in so many words that the proper solution to XSS security holes is to stop sanitizing web form inputs and instead convert all web browsers so that they *don't interpret unicode* then... maybe it's time I step out of this thread. Best of luck to you.<br> <p> </div> Tue, 31 Mar 2009 04:49:07 +0000 Wheeler: Fixing Unix/Linux/POSIX Filenames https://lwn.net/Articles/326312/ https://lwn.net/Articles/326312/ rickmoen mrshiny wrote: <p><em>You can pry my spaces from my filenames out of my cold dead fingers.</em> <p>ObMenInBlack: "Your offer is acceptable." <p>(I remember having to write AppleScript to recurse through directories cleaning up files created on network shares by MacOS-using munchkins who put space characters at the <em>ends</em> of filenames, in order for them to become valid filenames when seen by MS-Windows-using employees looking at the same network shares. The converse problem was files, from MS-Windows users, with names containing colon, which is a reserved character in MacOS file namespace. What a pain in the tochis.) <p>Rick Moen<br> rick@linuxmafia.com Mon, 30 Mar 2009 19:36:47 +0000 Wheeler: Fixing Unix/Linux/POSIX Filenames https://lwn.net/Articles/326227/ https://lwn.net/Articles/326227/ Hawke <div class="FormattedComment"> I don't think any DOS applications use backslash for their option marker. Some use dash, and most use slash. But I'm pretty sure that practically none if any use backslash<br> </div> Mon, 30 Mar 2009 16:41:21 +0000 Bad understanding of UTF-8 https://lwn.net/Articles/326202/ https://lwn.net/Articles/326202/ spitzak <div class="FormattedComment"> Yes I am sure.<br> <p> The first two references are about programs failing to recognize overlong encodings as being invalid. But those invalid sequences start with a byte with the high bit set (following bytes may not have it set, but the fact that decoders consider them part of the first byte is the decoders error, a fixed decoder would consider it a one-byte error with the high bit set, followed by normal ascii characters which are unchanged and thus cannot cause a security hole).<br> <p> The last one is EXACTLY the bug I am trying to fix: stupid people who somehow believe that throwing errors or replacing with non-unique strings is how invalid UTF-8 should be handled. The bug is that it maps more than one different string to the same one. The proper solution is to stop translating UTF-8 into something else and treat it as a stream of bytes. Nothing should care that it is UTF-8 except stuff that draws it on the screen.<br> <p> <p> </div> Mon, 30 Mar 2009 16:08:23 +0000 NT (Windows kernel) doesn't care about filenames any more than Linux https://lwn.net/Articles/326190/ https://lwn.net/Articles/326190/ foom <font class="QuotedText">&gt;&gt; Does that mean if you code against the NT API directly, you can create files foo and FOO in the same directory? <br>&gt; Yes. This is what the POSIX subsystems for NT do </font> <p> You can actually do this through the Win32 API: see the FILE_FLAG_POSIX_SEMANTICS flag for <a href="http://msdn.microsoft.com/en-us/library/aa363858(VS.85).aspx">CreateFile</a>. However, MS realized this was a security problem, so as of WinXP, this option will in normal circumstances do absolutely nothing. You now have to explicitly enable case-sensitive support on the system for either the "Native" or Win32 APIs to allow it. <p> (the SFU installer asks if you want to this, but even SFU has no special dispensation) Mon, 30 Mar 2009 15:13:36 +0000 NT (Windows kernel) doesn't care about filenames any more than Linux https://lwn.net/Articles/326161/ https://lwn.net/Articles/326161/ nye <div class="FormattedComment"> <font class="QuotedText">&gt;Does that mean if you code against the NT API directly, you can create files foo and FOO in the same directory?</font><br> <p> Yes. This is what the POSIX subsystems for NT do; they're implemented on top of the native API, as is the Win32 API. Note that Cygwin doesn't count here as it's a compatibility layer on top of the Win32 API rather than its own separate subsystem.<br> <p> Unfortunately the Win32 API *does* enforce things like file naming conventions, so it's impossible (at least without major voodoo) to write Win32 applications which handle things like a colon in a file name, and since different subsytems are isolated, that means that no normal Windows software is going to be able to do it.<br> <p> (I learnt all this when I copied my music collection to an NTFS filesystem, and discovered that bits of it were unaccessible to Windows without SFU/SUA, which is unavailable for the version of Windows I was using.)<br> <p> <a href="http://en.wikipedia.org/wiki/Native_API">http://en.wikipedia.org/wiki/Native_API</a><br> </div> Mon, 30 Mar 2009 10:55:30 +0000 Wheeler: Fixing Unix/Linux/POSIX Filenames https://lwn.net/Articles/326129/ https://lwn.net/Articles/326129/ njs <div class="FormattedComment"> You cannot, in general, convert a filename to text. That's the fundamental problem that any of the proposals would solve.<br> <p> </div> Mon, 30 Mar 2009 00:07:22 +0000 Wheeler: Fixing Unix/Linux/POSIX Filenames https://lwn.net/Articles/326127/ https://lwn.net/Articles/326127/ epa <blockquote>No, it is very possible to write such a function. The character encoding issue only prevents you from assuring that the string matches what the file's creator thought it should be.</blockquote>Well, yeah. If you allow the function to return the wrong answer, then it is easy to write. But it is not possible to in all cases return the correct filename to the user, matching the original one chosen by the user. If you pick a known encoding everywhere (UTF-8 being the obvious choice) then the problem goes away. <blockquote>This doesn't represent a security problem.</blockquote>Correct (at least none that I can think of). The security issue is with special characters and control characters in filenames, and is separate to the issue of how to encode characters that don't fit in ASCII. Sun, 29 Mar 2009 22:37:31 +0000 Re: Not A System Problem https://lwn.net/Articles/326126/ https://lwn.net/Articles/326126/ nix <div class="FormattedComment"> You don't get it. In order to permit / and \0 as valid filename <br> characters, syscalls like open() must change. Library calls like fopen() <br> have to change, because they too accept a \0-terminated string, with /s <br> separating path components. Every single call in every library that <br> accepts pathnames has to change. Probably the very notion of a string has <br> to change to something non-\0-terminated.<br> <p> So whatever you're describing, userspace cannot any longer use standard <br> POSIX calls: in fact, it can't any longer use ANSI C calls! I suspect that <br> such a system would be almost unusable with C, simply because you couldn't <br> use C string literals for anything.<br> <p> If you want VMS, you know where to find it.<br> <p> </div> Sun, 29 Mar 2009 22:32:28 +0000 Wheeler: Fixing Unix/Linux/POSIX Filenames https://lwn.net/Articles/326123/ https://lwn.net/Articles/326123/ foom <div class="FormattedComment"> Eh...but OSX *does* run 40+ years of UNIX programs. It's pretty clear that the change to require <br> UTF-8 (and even the change to be case insensitive!) didn't bother most programs.<br> <p> </div> Sun, 29 Mar 2009 22:07:27 +0000 Wheeler: Fixing Unix/Linux/POSIX Filenames https://lwn.net/Articles/326122/ https://lwn.net/Articles/326122/ clugstj <div class="FormattedComment"> No, it is very possible to write such a function. The character encoding issue only prevents you from assuring that the string matches what the file's creator thought it should be. This doesn't represent a security problem.<br> </div> Sun, 29 Mar 2009 21:58:46 +0000 Conventions are great! Let's go back to FAT! https://lwn.net/Articles/326121/ https://lwn.net/Articles/326121/ clugstj <div class="FormattedComment"> "UNIX is not broken. Your head, on the other hand, is"<br> <p> Wow, childish personal attacks. How droll.<br> <p> "Number of correct scripts is not important metric. Number of bad scripts is"<br> <p> I would think that the percentage of each would (possibly) be a useful metric. But, what is the damage from these "bad scripts"? If you are writing shell scripts that MUST be absoutely bullet-proof from bad input, perhaps because they run setuid-root, then you are already making a much worse mistake than the possible bugs in the script.<br> <p> Still don't understand the FAT reference. Sorry, maybe I'm just slow.<br> </div> Sun, 29 Mar 2009 21:44:21 +0000 Wheeler: Fixing Unix/Linux/POSIX Filenames https://lwn.net/Articles/326120/ https://lwn.net/Articles/326120/ clugstj <div class="FormattedComment"> OS X is trivial to handle. It only has to continue to work in a compatible way with the previous Mac OS - which wasn't UNIX. So using it as an example of how to "fix" these problems is not a good idea if you care about supporting 40+ years of UNIX programs - which is why this is difficult to change.<br> </div> Sun, 29 Mar 2009 21:30:44 +0000 Wheeler: Fixing Unix/Linux/POSIX Filenames https://lwn.net/Articles/326119/ https://lwn.net/Articles/326119/ clugstj <div class="FormattedComment"> I'm sorry, but when you said, that any of these propositions is better than the current situation, I HAD to disagree. In what way is the current situation so bad that any proposal is better that the current situation?<br> </div> Sun, 29 Mar 2009 21:27:08 +0000 Re: Not A System Problem https://lwn.net/Articles/326109/ https://lwn.net/Articles/326109/ ldo <P>nix wrote:</P> <BLOCKQUOTE><FONT STYLE="color : #0000FF"><P>What you're describing is not POSIX anymore.</P></FONT></BLOCKQUOTE> <P>Nothing to do with POSIX. POSIX is a userland API, it doesn’t dictate how the kernel should work.</P> Sun, 29 Mar 2009 19:47:30 +0000 Simplicity is better than complexity. https://lwn.net/Articles/326092/ https://lwn.net/Articles/326092/ epa <div class="FormattedComment"> To check for control characters<br> <p> for (const char *c = filename; *c; c++)<br> if (*c &lt; 32) return EINVAL;<br> <p> Adding a fixed list of 'bad characters' (please excuse lack of indentation, the LWN comment form eats it):<br> <p> for (const char *c = filename; *c; c++)<br> if (*c &lt; 32 || *c == '&lt;' || *c == '&gt;' || *c == '|') return EINVAL;<br> if (filename[0] == '-') return EINVAL;<br> <p> To check valid UTF-8 is a little more complex, but not much. You do not need to check that assigned Unicode characters are being used, or worry about combining characters, upper and lower case, etc. See &lt;<a href="http://www.cl.cam.ac.uk/~mgk25/unicode.html">http://www.cl.cam.ac.uk/~mgk25/unicode.html</a>&gt; for a list of valid byte sequences. The code would be something like<br> <p> /* First pad the filename with 4 extra NUL bytes at the end. Then, */<br> int is_cont(char c) { return 128 &lt;= c &amp;&amp; c &lt; 192 }<br> const char *p = filename;<br> while (*p) {<br> if (*p &lt; 128) ++c;<br> else if (192 &lt;= *p &amp;&amp; *p &lt; 224 &amp;&amp; is_cont(p[1])) p += 2;<br> else if (224 &lt;= *p &amp;&amp; *p &lt; 240 &amp;&amp; is_cont(p[1]) &amp;&amp; is_cont(p[2]) p += 3;<br> else if (240 &lt;= *p &amp;&amp; *p &lt; 248 &amp;&amp; is_cont(p[1]) &amp;&amp; is_cont(p[2])<br> &amp;&amp; is_cont(p[3])) p += 4;<br> else if (248 &lt;= *p &amp;&amp; *p &lt; 252 &amp;&amp; is_cont(p[1]) &amp;&amp; is_cont(p[2])<br> &amp;&amp; is_cont(p[3]) &amp;&amp; is_cont(p[4])) p += 5;<br> else if (252 &lt;= *p &amp;&amp; *p &lt; 254 &amp;&amp; is_cont(p[1]) &amp;&amp; is_cont(p[2])<br> &amp;&amp; is_cont(p[3]) &amp;&amp; is_cont(p[4]) &amp;&amp; is_cont(p[5])) p += 6;<br> else return EINVAL;<br> }<br> <p> For a self-contained system, that takes care of it. Put some code like the above into a function and call it at each place a filename is taken from user space. Coping with 'foreign' filesystems (e.g. NFS servers) returning non-UTF-8 filenames is a bit more complex.<br> </div> Sun, 29 Mar 2009 15:03:37 +0000 Wheeler: Fixing Unix/Linux/POSIX Filenames https://lwn.net/Articles/326091/ https://lwn.net/Articles/326091/ epa <blockquote>a function which takes a zero-terminated byte array representing a filename and returns a string suitable for display</blockquote>Currently it is impossible to reliably write such a function, because you don't know whether the byte array is encoded in Latin-1, Shift-JIS, UTF-8 or whatever. <p> Imagine removing the character encoding headers from the http protocol. There would then be no reliable way to take the content of a page and display it to the user - just a panoply of hacks and rules of thumb that differed from one browser to another. This is the situation we have now with filenames, which are *names* and intended for human consumption just as much as the content of a typical web page. The two choices are (a) add headers to the protocol saying what encoding is in use (or in the case of filenames, an extra parameter in all FS calls), or (b) mandate a single encoding everywhere. Sun, 29 Mar 2009 14:43:25 +0000 NT (Windows kernel) doesn't care about filenames any more than Linux https://lwn.net/Articles/326090/ https://lwn.net/Articles/326090/ epa <blockquote> NT (the kernel API in Windows NT, 2000, XP and etc.) doesn't care about filename encodings. The only thing that makes NT's attitude to such things different from that of Linux's is that NT's arbitrary sequences of non-zero code units used for filenames use 16-bit code units, and in Linux obviously they're 8-bit. <p> Everything else you see, such as case-insensitivity, bans on certain characters or sequences of characters, is implemented in other layers of the OS or even in language runtimes, not the kernel. Low-level programmers, just as on Unix, can call a file anything they like.</blockquote>Does that mean if you code against the NT API directly, you can create files foo and FOO in the same directory? I expect that opens up all sorts of juicy security holes - many of them theoretical, since a typical NT system has just one user and there is not much need for privelege escalation - but still it sounds fun. <blockquote>using UTF-8 and blindly trusting that everything you work with is actually legal and meaningful display-safe UTF-8 are quite different things.</blockquote>Indeed. Hence the benefit of enforcing this at the OS level: it gets rid of the need for sanity checks that slow down the good programmers and were never written anyway by the bad programmers. Sun, 29 Mar 2009 14:36:20 +0000 Wheeler: Fixing Unix/Linux/POSIX Filenames https://lwn.net/Articles/326087/ https://lwn.net/Articles/326087/ epa Yes, validate every filename that comes from user space to check it is valid UTF-8 and does not have control characters. This is not in fact an expensive operation (especially not compared to the cost of opening or creating a file in the first place). <p> Every non-Unix OS already forbids control characters in filenames so there would not be much extra checking to do in filesystems like smbfs or ntfs. (Except out of paranoia to detect disk corruption, which is probably a good thing to do anyway.) As you point out, there remains the question of network filesystems like NFS, where the server could legitimately return filenames containing arbitrary byte sequences. And there would have to be some policy decision about what to do. But I would rather have one single place to deal with the mess rather than leave it to 101 different bits of code in user space. (Python 3.0 pretends that invalid-UTF-8 filenames do not exist when returning a directory listing; other programs will show them but may or may not escape control characters when displaying to the terminal; goodness knows what different Java implementations do.) <p> I would favour silently discarding filenames that contain control characters from the directory listing, and for those in some legacy encoding like Latin-1 or Shift-JIS, translating them to UTF-8. (The legacy encoding would be specified with a mount parameter. Again, this is a bit awkward but a hundred times less complicated than leaving every userspace program to do its own peculiar thing.) <blockquote>Meanwhile application developers get no benefit for many years because of compatibility considerations.</blockquote>Not really true. The benefit in closing existing security holes is immediate. In writing new code, you can note that there may be corner-case bugs on systems that permit control characters in filenames, but for 90% of the user base they do not exist. That is 90% better than the current situation, where everyone just writes code assuming that filenames are sane, but no system enforces it. By analogy, consider that many classic UNIX utilities had fixed limits on line length. If I write a shell script that uses sort(1), I just write it for GNU sort and other modern implementations. I might note that people on older systems may encounter interesting effects using my script with large input data, but I don't have to wait for every last Xenix system to be switched off before I can get the benefit in new code. <p> <blockquote>Personally I think the issue to look at is spaces. Spaces are legal. They are undoubtedly going to remain legal. But they are inconvenient. How can we tweak our basic Unix processes (including the shell and many old tools) so that spaces are harmless ?</blockquote> This is true in principle but in thirty years of Unix, essentially no progress has been made on this. Nobody bothers to fix the shell or utilities such as make(1) to cope with arbitrary characters, despite much wishing that they would. Nobody bothers to write shell scripts that cope with all legal filenames, mostly because it is all but impossible. Instead, people who care about bug-free code end up rewriting shell scripts in other languages such as C (for example, some of the git utilities), people who think life is too short are happy to distribute software that misbehaves or has security holes, and many others just don't realize there is a problem. <p> OS X is something of a special case because of case insensitivity. If you don't want case insensitivity then you do not need to worry about Unicode composition; just a simple byte sequence check that you have valid UTF-8. But OS X is a useful example in another way: a case-insensitive filesystem is a much bigger break with Unix tradition that what's proposed here, and yet the world did not come to an end, and it was trivial for most Unix software to adapt. Sun, 29 Mar 2009 14:31:15 +0000 Re: Not A System Problem https://lwn.net/Articles/326088/ https://lwn.net/Articles/326088/ nix <div class="FormattedComment"> What you're describing is not POSIX anymore. Every single POSIX app would <br> need rewriting, for essentially zero gain (ooh, you can't have nulls in <br> filenames: that's why UTF-8 is *defined* to avoid nulls in filenames).<br> <p> I'm sure users would love not being able to type in pathnames anymore, <br> too.<br> <p> Good luck getting anyone to do it.<br> <p> </div> Sun, 29 Mar 2009 13:54:54 +0000 Re: Not A System Problem https://lwn.net/Articles/326082/ https://lwn.net/Articles/326082/ ldo <P>nix wrote:</P> <BLOCKQUOTE><FONT STYLE="color : #0000C0"><P>Um, if you remove the prohibition on nulls, how do you end the filename? This isn't Pascal.</P></FONT></BLOCKQUOTE> <P>Nothing to do with Pascal. C is perfectly capable of dealing with arbitrary data bytes, otherwise large parts of both kernel and userland code wouldn’t work.</P> <BLOCKQUOTE><FONT STYLE="color : #0000C0"><P>And if you remove the prohibition on slashes, how do you distinguish between a file called foo/bar and a file called bar in a subdirectory foo?</P></FONT></BLOCKQUOTE> <P>Simple. The kernel-level filesystem calls will not take a full pathname. Instead, they will take a parent directory ID and the name of an item within that directory. Other OSes, like VMS and old MacOS, were doing this sort of thing decades ago.</P> <P>Full pathname parsing becomes a function of the userland runtime. The kernel no longer cares what the pathname separator, or even what the pathname syntax, might be.</P> Sun, 29 Mar 2009 10:30:45 +0000 At last, a hope of progress https://lwn.net/Articles/326059/ https://lwn.net/Articles/326059/ mikachu <div class="FormattedComment"> On days when I'm feeling paranoid I always say ./* instead of just *, especially when talking to /bin/rm. On the other hand, touch -- -i in directories where you have important files is a nice trick too.<br> </div> Sun, 29 Mar 2009 00:01:58 +0000 Meta-discussion https://lwn.net/Articles/326057/ https://lwn.net/Articles/326057/ man_ls Hmmm, I'm not so sure. I feel strongly about ext4 losing data, but I don't have a strong opinion about this issue. Really. Not for lack of sensitivity to the problem -- I've had an administrator at work erase a whole directory of files because of a leading space (so that 'rm -rf /dir/file' became 'rm -rf /dir/ file'). But there are advantages and disadvantages, and I cannot pick a side. <p> Bojan has only posted once, and his message contains the words "not sure". I would say that this debate attracts a different subset of (opinionated) people. Sat, 28 Mar 2009 22:21:03 +0000 Leading spaces are common, actually https://lwn.net/Articles/326052/ https://lwn.net/Articles/326052/ nix <div class="FormattedComment"> It's called 'sort by version' because the function it calls (strverscmp()) <br> was designed to sort version numbers, and because the expected use of <br> ls -v was sorting a directory full of version-named directories in version <br> order.<br> <p> (And you're right on the collation sort thing: I spoke carelessly.)<br> <p> </div> Sat, 28 Mar 2009 20:36:45 +0000 Wheeler: Fixing Unix/Linux/POSIX Filenames https://lwn.net/Articles/326049/ https://lwn.net/Articles/326049/ dwheeler Thanks for your comments! In particular, you're absolutely right about swapping the order of \t and \n in IFS - that makes it MUCH simpler. I prefer IFS=`printf '\n\t'` because then it's immediately obvious that \n and \t are the new values. I've put that into the document, with credit. Sat, 28 Mar 2009 19:50:37 +0000