Simplicity is better than complexity.
Simplicity is better than complexity.
Posted Mar 29, 2009 15:03 UTC (Sun) by epa (subscriber, #39769)In reply to: Simplicity is better than complexity. by tialaramex
Parent article: Wheeler: Fixing Unix/Linux/POSIX Filenames
for (const char *c = filename; *c; c++)
if (*c < 32) return EINVAL;
Adding a fixed list of 'bad characters' (please excuse lack of indentation, the LWN comment form eats it):
for (const char *c = filename; *c; c++)
if (*c < 32 || *c == '<' || *c == '>' || *c == '|') return EINVAL;
if (filename[0] == '-') return EINVAL;
To check valid UTF-8 is a little more complex, but not much. You do not need to check that assigned Unicode characters are being used, or worry about combining characters, upper and lower case, etc. See <http://www.cl.cam.ac.uk/~mgk25/unicode.html> for a list of valid byte sequences. The code would be something like
/* First pad the filename with 4 extra NUL bytes at the end. Then, */
int is_cont(char c) { return 128 <= c && c < 192 }
const char *p = filename;
while (*p) {
if (*p < 128) ++c;
else if (192 <= *p && *p < 224 && is_cont(p[1])) p += 2;
else if (224 <= *p && *p < 240 && is_cont(p[1]) && is_cont(p[2]) p += 3;
else if (240 <= *p && *p < 248 && is_cont(p[1]) && is_cont(p[2])
&& is_cont(p[3])) p += 4;
else if (248 <= *p && *p < 252 && is_cont(p[1]) && is_cont(p[2])
&& is_cont(p[3]) && is_cont(p[4])) p += 5;
else if (252 <= *p && *p < 254 && is_cont(p[1]) && is_cont(p[2])
&& is_cont(p[3]) && is_cont(p[4]) && is_cont(p[5])) p += 6;
else return EINVAL;
}
For a self-contained system, that takes care of it. Put some code like the above into a function and call it at each place a filename is taken from user space. Coping with 'foreign' filesystems (e.g. NFS servers) returning non-UTF-8 filenames is a bit more complex.
