|
|
Log in / Subscribe / Register

Simplicity is better than complexity.

Simplicity is better than complexity.

Posted Mar 26, 2009 9:56 UTC (Thu) by epa (subscriber, #39769)
In reply to: Simplicity is better than complexity. by k8to
Parent article: Wheeler: Fixing Unix/Linux/POSIX Filenames

Have you ever written a reasonably complex but still non-buggy shell script - one which accepts arbitrary filenames (as is the Unix way) and doesn't do random buggy things?

If you've ever tried such an exercise, you would not believe that allowing control characters in the middle of filenames and leaving userspace to deal with the resulting mess could ever be called 'simplicity'.

Wheeler's suggestion would greatly simplify a lot of code, or if you prefer, fix many hidden bugs and security holes in code that is currently buggy.

Simply checking filenames for bad characters takes about five lines of code in the kernel plus one line for each syscall that accepts a filename from userspace. It is hardly adding significant complexity.


to post comments

Simplicity is better than complexity.

Posted Mar 28, 2009 17:03 UTC (Sat) by tialaramex (subscriber, #21167) [Link] (1 responses)

“Simply checking filenames for bad characters takes about five lines of code in the kernel plus one line for each syscall that accepts a filename from userspace.”

Show me the money. Five lines, plus one per syscall. Not a lot of work to support such a broad and sweeping claim. Write those lines carefully, we wouldn't want you to be hand-waving and have missed 99.9% of the complexity of the problem...

Simplicity is better than complexity.

Posted Mar 29, 2009 15:03 UTC (Sun) by epa (subscriber, #39769) [Link]

To check for control characters

for (const char *c = filename; *c; c++)
if (*c < 32) return EINVAL;

Adding a fixed list of 'bad characters' (please excuse lack of indentation, the LWN comment form eats it):

for (const char *c = filename; *c; c++)
if (*c < 32 || *c == '<' || *c == '>' || *c == '|') return EINVAL;
if (filename[0] == '-') return EINVAL;

To check valid UTF-8 is a little more complex, but not much. You do not need to check that assigned Unicode characters are being used, or worry about combining characters, upper and lower case, etc. See <http://www.cl.cam.ac.uk/~mgk25/unicode.html> for a list of valid byte sequences. The code would be something like

/* First pad the filename with 4 extra NUL bytes at the end. Then, */
int is_cont(char c) { return 128 <= c && c < 192 }
const char *p = filename;
while (*p) {
if (*p < 128) ++c;
else if (192 <= *p && *p < 224 && is_cont(p[1])) p += 2;
else if (224 <= *p && *p < 240 && is_cont(p[1]) && is_cont(p[2]) p += 3;
else if (240 <= *p && *p < 248 && is_cont(p[1]) && is_cont(p[2])
&& is_cont(p[3])) p += 4;
else if (248 <= *p && *p < 252 && is_cont(p[1]) && is_cont(p[2])
&& is_cont(p[3]) && is_cont(p[4])) p += 5;
else if (252 <= *p && *p < 254 && is_cont(p[1]) && is_cont(p[2])
&& is_cont(p[3]) && is_cont(p[4]) && is_cont(p[5])) p += 6;
else return EINVAL;
}

For a self-contained system, that takes care of it. Put some code like the above into a function and call it at each place a filename is taken from user space. Coping with 'foreign' filesystems (e.g. NFS servers) returning non-UTF-8 filenames is a bit more complex.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds