|
|
Log in / Subscribe / Register

Wheeler: Fixing Unix/Linux/POSIX Filenames

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 28, 2009 11:45 UTC (Sat) by epa (subscriber, #39769)
In reply to: Wheeler: Fixing Unix/Linux/POSIX Filenames by drag
Parent article: Wheeler: Fixing Unix/Linux/POSIX Filenames

Of course no existing software treats filenames purely as a string of bytes - that is just rhetoric. At the very least, filenames are treated as ASCII character encoding and displayed to the user as such. Of course, this breaks down when a filename contains control characters.

If Unix really did treat filenames as merely 'a string of bytes', with no implied character set or encoding, and displayed them to the user as a hex dump or something, then it would be truly encoding-agnostic and would have no difficulties with arbitrary byte values in filenames. Of course, it would also have been a total failure that nobody uses. For a filesystem to be useful, it needs to have some amount of meaning (or 'policy' if you will) attached to the filenames it stores. The question is how much: is the current situation of 'ASCII for characters below 128, and above that you're on your own' the best one?


to post comments

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 28, 2009 16:53 UTC (Sat) by tialaramex (subscriber, #21167) [Link] (3 responses)

The two major pieces of in-house software I develop both treat filenames purely as a string of bytes. The names chosen happen to be meaningful to the programmers, but they are of no importance to the program or its users.

I'd be surprised if the /majority/ of programs other than shell scripts aren't like this. Even in the majority of GUI software, what's needed isn't a revision of the kernel API (in fact that will barely help) but only a function which takes a zero-terminated byte array representing a filename and returns a string suitable for display. Such a function is nearly inevitable anyway - to deal with dozens of other issues unrelated to Wheeler's thesis. And such functions exist today (I can't say if they're bug free of course)

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 29, 2009 14:43 UTC (Sun) by epa (subscriber, #39769) [Link] (2 responses)

a function which takes a zero-terminated byte array representing a filename and returns a string suitable for display
Currently it is impossible to reliably write such a function, because you don't know whether the byte array is encoded in Latin-1, Shift-JIS, UTF-8 or whatever.

Imagine removing the character encoding headers from the http protocol. There would then be no reliable way to take the content of a page and display it to the user - just a panoply of hacks and rules of thumb that differed from one browser to another. This is the situation we have now with filenames, which are *names* and intended for human consumption just as much as the content of a typical web page. The two choices are (a) add headers to the protocol saying what encoding is in use (or in the case of filenames, an extra parameter in all FS calls), or (b) mandate a single encoding everywhere.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 29, 2009 21:58 UTC (Sun) by clugstj (subscriber, #4020) [Link] (1 responses)

No, it is very possible to write such a function. The character encoding issue only prevents you from assuring that the string matches what the file's creator thought it should be. This doesn't represent a security problem.

Wheeler: Fixing Unix/Linux/POSIX Filenames

Posted Mar 29, 2009 22:37 UTC (Sun) by epa (subscriber, #39769) [Link]

No, it is very possible to write such a function. The character encoding issue only prevents you from assuring that the string matches what the file's creator thought it should be.
Well, yeah. If you allow the function to return the wrong answer, then it is easy to write. But it is not possible to in all cases return the correct filename to the user, matching the original one chosen by the user. If you pick a known encoding everywhere (UTF-8 being the obvious choice) then the problem goes away.
This doesn't represent a security problem.
Correct (at least none that I can think of). The security issue is with special characters and control characters in filenames, and is separate to the issue of how to encode characters that don't fit in ASCII.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds