|
|
Log in / Subscribe / Register

Simplicity is better than complexity.

Simplicity is better than complexity.

Posted Mar 26, 2009 2:17 UTC (Thu) by k8to (guest, #15413)
Parent article: Wheeler: Fixing Unix/Linux/POSIX Filenames

Simplicity is better than complexity.

This proposal is a whole lot of complexity in kernel code and the API.
UNWANT.


to post comments

Simplicity is better than complexity.

Posted Mar 26, 2009 2:22 UTC (Thu) by k8to (guest, #15413) [Link] (11 responses)

As for the find, gnu find already has -print0 and xargs is compatable.
The scripting implementation is thus trivially solved.

Sure, some find implementations don't have it. Fix them.

Simplicity is better than complexity.

Posted Mar 26, 2009 2:29 UTC (Thu) by k8to (guest, #15413) [Link] (9 responses)

As evidence for my position, here are some real-world filenames that my software needed to create to correctly archive some digital music history of the personal computer as an instrument.

jrodman@calufrax:/opt/kmods/mods/artists/Karl> ls d_* ¦*
d_    .it  d_   .it  d_  .it  d_ .it  d_1151.it  d_1152.it  d_1153.it  d_1154.it  ¦¦¦¦¯¯Ì_.it
jrodman@calufrax:/opt/kmods/mods/artists/Karl> ls d_* ¦* |xxd
0000000: 645f 2020 2020 2e69 740a 645f 2020 202e  d_    .it.d_   .
0000010: 6974 0a64 5f20 202e 6974 0a64 5f20 2e69  it.d_  .it.d_ .i
0000020: 740a 645f 3131 3531 2e69 740a 645f 3131  t.d_1151.it.d_11
0000030: 3532 2e69 740a 645f 3131 3533 2e69 740a  52.it.d_1153.it.
0000040: 645f 3131 3534 2e69 740a a6a6 a6a6 afaf  d_1154.it.......
0000050: cc5f 2e69 740a                           ._.it.

These files are handled by a combination of python and shellscripts, and one piece of C code (wrapping a library which knew how to read certain binary formats.) All of these pieces can handle newlines, tabs, spaces, control characters, leading dahes, and so on. I'm not really that smart. It wasn't much work. If shellscripts are 5 second hackjobs, then they will always fail in some cases: strange filenames, permissions problems, etc. If you take a few minutes to apply correct safeguards, then thigns work fine.

Simplicity is better than complexity.

Posted Mar 26, 2009 2:31 UTC (Thu) by k8to (guest, #15413) [Link]

Hah, the html markup I wasn't familiar with (needed to prevent the forum from mangling the listing) led me to mangle my post. Case in point, bad input, bad output. Shrug, learn, move on.

Simplicity is better than complexity.

Posted Mar 26, 2009 2:39 UTC (Thu) by foom (subscriber, #14868) [Link] (3 responses)

> needed to create

I find it very hard to believe that your software *needed* to create unintelligible filenames. And if it
did, I'd claim it needs to be fixed.

Simplicity is better than complexity.

Posted Mar 26, 2009 2:48 UTC (Thu) by k8to (guest, #15413) [Link] (2 responses)

If someone else cares about the design constraints that led to this necessity, let me know. Foom: i've watched you refuse to converse in a reasonable way across many threads. you have no clue about my software or the project but you claim to know what is correct and incorrect. Shut up.

Simplicity is better than complexity.

Posted Mar 26, 2009 3:40 UTC (Thu) by foom (subscriber, #14868) [Link]

I claim that I'd find software that creates filenames like that on my disk to be irritating. So I'd certainly prefer if no software actually did so, and probably wouldn't mind if it was impossible to do so.

If, in some alternative universe, it was already impossible to create those filenames, I have little doubt you could still have created working software which didn't require the impossible.

Sorry I come off as unreasonable to you. *hugs*

Do you know difference between two words: "need" and "want"?

Posted Mar 26, 2009 8:41 UTC (Thu) by khim (subscriber, #9252) [Link]

you have no clue about my software or the project but you claim to know what is correct and incorrect.

I don't have a clue. And I don't need it to know anything about your project to know you are lying. Any project can be implemented with exactly two filenames: "0" and "1". You'll need infinite depth of directory structure to do so, true, but thankfully there are no practical limitations in Linux. Is it feasible? Probably no. Is it possible? Of course. And if we'll start with the position that your software does not need these filenames but you current design needs these suddenly you have much weaker argument: you are reducing complexity of your software by increasing complexity of everyone's else's software. Is it good trade-off? May be yes, may be no. But it's weak argument at best - no matter what your project is and what it needs to be done.

Simplicity is better than complexity.

Posted Mar 26, 2009 10:01 UTC (Thu) by epa (subscriber, #39769) [Link] (1 responses)

Surely base64-encoding the filenames would not be too much hardship? Personally, I would be inclined to do that anyway, because trying to use shell commands to manipulate a file called ¦¦¦¦¯¯Ì_.it will be painful.

I know this is a matter of taste, and merely trying to impose one person's tastes on everyone is not a reason to change the kernel. But on the other hand, the marginal extra disk space saving (ten bytes?) from being able to put arbitrary binary stuff in filenames without encoding does not outweigh the many good reasons that Wheeler gave for changing.

Simplicity is better than complexity.

Posted Mar 26, 2009 13:54 UTC (Thu) by clugstj (subscriber, #4020) [Link]

"manipulate a file called ¦¦¦¦¯¯Ì_"

Put quotes around it?

Simplicity is better than complexity.

Posted Mar 26, 2009 10:27 UTC (Thu) by mjj29 (guest, #49653) [Link]

Would there be any problem with uuencoding those filenames?

Simplicity is better than complexity.

Posted Mar 26, 2009 21:22 UTC (Thu) by explodingferret (guest, #57530) [Link]

If we can do ANYTHING to prevent you writing programs which try to store information in filenames this way, I'm all for it.

Are you able to make the source of your shell scripts available? I'm sure I can find something in them that is breakable. :-)

DANGER! DANGER! DANGER! HYPOCRISY LEVEL IS OVER 9000!!!

Posted Mar 26, 2009 8:30 UTC (Thu) by khim (subscriber, #9252) [Link]

This argument:

Simplicity is better than complexity.
plus this one
As for the find, gnu find already has -print0 and xargs is compatable.
equals hypocrite.

And you can not even claim that "we already solved thsi problem so it's old code vs new code". A lot of programs just don't work with currect approach (especially script). You need to write and fix literally millions lines of code vs few thoiusands in kernel.

Sorry, but you are advocating more complex solution while preaching simplicity.

Simplicity is better than complexity.

Posted Mar 26, 2009 2:49 UTC (Thu) by njs (subscriber, #40338) [Link] (1 responses)

Have you ever written a program where it was important to handle filename character sets Right? Were you able to do so?

(My opinion on this issue is strongly influenced by writing the filesystem interaction code for a VCS. Users in different locales may want to work on the same project, but they write the same filenames differently, and some charsets may not be able to even represent filenames created in other locales, and...)

There are arguments for the current system, but "simplicity" is really not one of them.

Simplicity is better than complexity.

Posted Mar 27, 2009 0:46 UTC (Fri) by nix (subscriber, #2304) [Link]

But handling multiple charsets is so simple!

(As in, I looked at XEmacs/MULE's code and my brain dribbled out of my
ears, following which I was simple.)

Simplicity is better than complexity.

Posted Mar 26, 2009 5:34 UTC (Thu) by flewellyn (subscriber, #5047) [Link] (3 responses)

Simplicity is precisely the goal of this proposal. And how complex, really, is it to check a filename and return an error if it contains a disallowed character? Really, it isn't.

Simplicity is better than complexity.

Posted Mar 26, 2009 13:49 UTC (Thu) by clugstj (subscriber, #4020) [Link] (2 responses)

Yes, it is simple to reject a filename. Now every program in the world has to be changed to handle this new error case. Not sure how this is simplicity.

Simplicity is better than complexity.

Posted Mar 26, 2009 13:54 UTC (Thu) by epa (subscriber, #39769) [Link]

You are right, it is an error case to handle; but then creating or renaming a file is already allowed to fail for all sorts of reasons, so all programs already have to check it succeeded and handle errors gracefully. Besides, EINVAL is already returned for bad characters if the filesystem happens to be one that disallows them (like FAT or indeed most other non-Unix-native filesystems), so apps already have to handle that error case too.

Simplicity is better than complexity.

Posted Mar 26, 2009 16:02 UTC (Thu) by flewellyn (subscriber, #5047) [Link]

But a major point of Wheeler's argument is that existing programs, filesystems, and indeed operating systems already assume that these restrictions are the case, as a matter of convention, but do not necessarily do anything to ensure that they are enforced. Existing software already rejects or fails to properly handle filenames which would violate these conventions, and the vast majority of existing files are named according to these conventions; at the very least, filenames with leading dashes, tab or newline characters, or shell control characters are very rare, and probably accidental or malicious.

So the entire point is that the changes required to existing software would be minimal, and existing software which could break on filenames that don't obey these restrictions when they're not enforced by the OS, would no longer have a problem.

Is kernel the whole world?

Posted Mar 26, 2009 8:24 UTC (Thu) by khim (subscriber, #9252) [Link]

Simplicity is better than complexity.

Oh so very true.

This proposal is a whole lot of complexity in kernel code and the API.

It also removes whole bunch of code from other places. Even more important: it removes the need to write and fix bunch of code.

Simplicity is better than complexity.

Posted Mar 26, 2009 9:56 UTC (Thu) by epa (subscriber, #39769) [Link] (2 responses)

Have you ever written a reasonably complex but still non-buggy shell script - one which accepts arbitrary filenames (as is the Unix way) and doesn't do random buggy things?

If you've ever tried such an exercise, you would not believe that allowing control characters in the middle of filenames and leaving userspace to deal with the resulting mess could ever be called 'simplicity'.

Wheeler's suggestion would greatly simplify a lot of code, or if you prefer, fix many hidden bugs and security holes in code that is currently buggy.

Simply checking filenames for bad characters takes about five lines of code in the kernel plus one line for each syscall that accepts a filename from userspace. It is hardly adding significant complexity.

Simplicity is better than complexity.

Posted Mar 28, 2009 17:03 UTC (Sat) by tialaramex (subscriber, #21167) [Link] (1 responses)

“Simply checking filenames for bad characters takes about five lines of code in the kernel plus one line for each syscall that accepts a filename from userspace.”

Show me the money. Five lines, plus one per syscall. Not a lot of work to support such a broad and sweeping claim. Write those lines carefully, we wouldn't want you to be hand-waving and have missed 99.9% of the complexity of the problem...

Simplicity is better than complexity.

Posted Mar 29, 2009 15:03 UTC (Sun) by epa (subscriber, #39769) [Link]

To check for control characters

for (const char *c = filename; *c; c++)
if (*c < 32) return EINVAL;

Adding a fixed list of 'bad characters' (please excuse lack of indentation, the LWN comment form eats it):

for (const char *c = filename; *c; c++)
if (*c < 32 || *c == '<' || *c == '>' || *c == '|') return EINVAL;
if (filename[0] == '-') return EINVAL;

To check valid UTF-8 is a little more complex, but not much. You do not need to check that assigned Unicode characters are being used, or worry about combining characters, upper and lower case, etc. See <http://www.cl.cam.ac.uk/~mgk25/unicode.html> for a list of valid byte sequences. The code would be something like

/* First pad the filename with 4 extra NUL bytes at the end. Then, */
int is_cont(char c) { return 128 <= c && c < 192 }
const char *p = filename;
while (*p) {
if (*p < 128) ++c;
else if (192 <= *p && *p < 224 && is_cont(p[1])) p += 2;
else if (224 <= *p && *p < 240 && is_cont(p[1]) && is_cont(p[2]) p += 3;
else if (240 <= *p && *p < 248 && is_cont(p[1]) && is_cont(p[2])
&& is_cont(p[3])) p += 4;
else if (248 <= *p && *p < 252 && is_cont(p[1]) && is_cont(p[2])
&& is_cont(p[3]) && is_cont(p[4])) p += 5;
else if (252 <= *p && *p < 254 && is_cont(p[1]) && is_cont(p[2])
&& is_cont(p[3]) && is_cont(p[4]) && is_cont(p[5])) p += 6;
else return EINVAL;
}

For a self-contained system, that takes care of it. Put some code like the above into a function and call it at each place a filename is taken from user space. Coping with 'foreign' filesystems (e.g. NFS servers) returning non-UTF-8 filenames is a bit more complex.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds