|
|
Log in / Subscribe / Register

Filesystems and case-insensitivity

Filesystems and case-insensitivity

Posted Nov 28, 2018 21:06 UTC (Wed) by smurf (subscriber, #17840)
In reply to: Filesystems and case-insensitivity by perennialmind
Parent article: Filesystems and case-insensitivity

It's easy to encode invalid byte sequences so that they survive a round trip through Unicode / UTF-8 – you mis-appropriate the surrogates. The actual higher-level semantics of that, though, are fraught with corner cases you *really* don't want to deal with.

Basically IMHO there are two sane choices – (a) the current situation: the kernel does not attach any semantics to any bytes other than '/' and '\0' (thus there is no chance for case insensitivity beyond ASCII), or (b) you use clean and preferably pre-normalized UTF-8 on the userspace/kernel boundary, outlaw anything nonconforming, and do everything else in userspace. Anything else is a recipe for long-term desaster.


to post comments

Filesystems and case-insensitivity

Posted Nov 28, 2018 22:28 UTC (Wed) by perennialmind (guest, #45817) [Link] (13 responses)

Newline, tab, and bel codepoints are perfectly valid UTF-8 plain text, but I'd prefer to push that out to userspace as well. I don't much care whether curl -O gives me filenames with spaces or %20s, but I do object if I see files with newlines in the names. I don't mind if I'm left with sneaky left-to-right, right-to-left marks or explicitly red hearts. I see the need for parentheses and question marks...

... but not control characters. To me, a natural language filename would comprise user-perceived characters and the one true space space character (U+0020). Flexibility beyond that does more harm than good. Leave those footguns to the bytestring paths. 😉

Filesystems and case-insensitivity

Posted Nov 29, 2018 13:41 UTC (Thu) by utoddl (guest, #1232) [Link] (3 responses)

I was with you until you got to spaces. It's only wishing of course, but I wish spaces in file names would go away. Personal peeve.

Filesystems and case-insensitivity

Posted Nov 30, 2018 9:04 UTC (Fri) by jezuch (subscriber, #52988) [Link] (2 responses)

Spaces in file names are wonderful. You can name your files like a human being, not a slave of poorly written shell scripts with broken quoting :) (I have a pet peeve too)

Filesystems and case-insensitivity

Posted Dec 3, 2018 12:30 UTC (Mon) by ale2018 (subscriber, #128727) [Link] (1 responses)

Ah, poorly written shell scripts, eh? Because you obviously think that being slave of over-complicated command lines is fine? A good percentage of my command lines start with find . -name whatever | xargs... Yes, I know I can write -print0 and -0, I do that when I write shell scripts.

When I find a filename with spaces I just move it away.

For the record, the normalization step and control characters were never taken care of. For example:

    ~$ touch aaabd $(printf 'aaabc\bd')  "$(printf 'aaabc\nd')"
    ~$ ls -lt | head -5
    total 3686968
    -rw-r--r--  1 ale ale                 0 Dec  3 13:21 aaabd
    -rw-r--r--  1 ale ale                 0 Dec  3 13:21 aaabc
    d
    -rw-r--r--  1 ale ale                 0 Dec  3 13:21 aaabd
Control characters where never forbidden. Consider that human beings are sometimes uncertain about the name they're typing and type a backspace (\b) in it. So, why isn't that beautiful too? Perhaps, users should have a clue. In the words of the Ancient Philosophy, rubbish in, rubbish out.

Filesystems and case-insensitivity

Posted Dec 3, 2018 20:24 UTC (Mon) by flussence (guest, #85566) [Link]

ls took care of that a few years ago…
~/test $ ls
'aaabc'$'\b''d'  'aaabc'$'\n''d'   aaabd
~/test $ ls --version
ls (GNU coreutils) 8.30
Packaged by Gentoo (8.30 (p01))

Filesystems and case-insensitivity

Posted Nov 30, 2018 9:09 UTC (Fri) by jezuch (subscriber, #52988) [Link] (3 responses)

I guess the concept of control characters should have been retired long time ago. I also think that it was a huge mistake to bring them to UTF-8 along with the rest of ASCII. But I'm pretty sure someone will explain to me that they are in fact critical and there are further control characters in the Unicode spec anyway :)

Filesystems and case-insensitivity

Posted Nov 30, 2018 16:25 UTC (Fri) by perennialmind (guest, #45817) [Link] (2 responses)

You mean end-of-string delimiters, end-of-line delimiters, tabs, and the codes needed for controlling a terminal such as escape and erase? Setting aside hurdles to adoption, one can imagine hoisting those into markup. Perhaps there's even a spec for plainer-than-plain-text for when such markup exists (i.e. HTML). If so, it might be perfect for filenames.

ASCII compatibility was the selling point for UTF-8. Beyond the above, even the oddballs are still in use. Take for example "group separator" which stands in for FNC1 in barcodes.

Somebody else will have to defend the C1 block though.

Filesystems and case-insensitivity

Posted Dec 1, 2018 11:24 UTC (Sat) by jezuch (subscriber, #52988) [Link] (1 responses)

I mean all the bytes below 0x20. This is not text, they have no place in a *character* encoding. Apart from that I'm totally fine with ASCII compatibility, even though it's typically American culturally insensitive invention ;)

Filesystems and case-insensitivity

Posted Dec 6, 2018 10:16 UTC (Thu) by Wol (subscriber, #4433) [Link]

I believe there are two control characters RS1 and RS2? Basically standing for "Repeat String"? Which were used on a system I worked on, and actually were a damn good fix for "how many characters does a tab stand for?". So most lines in my FORTRAN source code would have been physically stored on disk as "<RS1><6>code..."

And if you had a lot of spaces it saved a fair few bytes over tab-encoding, plus being completely unambiguous.

Cheers,
Wol

Filesystems and case-insensitivity

Posted Dec 4, 2018 8:13 UTC (Tue) by pr1268 (guest, #24648) [Link] (4 responses)

one true space space character (U+0020)

Um, there's more than one space: ' ' and ' '. One is \u0020 (good ol' ASCII 0x20) and the other is \u00a0.

I was personally burned by the second "space" above appearing in an Excel spreadsheet (to the exclusion of the "one true space character" you mentioned). >:-(

Filesystems and case-insensitivity

Posted Dec 4, 2018 10:24 UTC (Tue) by smurf (subscriber, #17840) [Link] (3 responses)

Don't worry – Unicode has a bunch more spaces, including zero-width ones.

On second thought: do worry.

Filesystems and case-insensitivity

Posted Dec 4, 2018 13:27 UTC (Tue) by hummassa (guest, #307) [Link] (2 responses)

The initial argument still holds: there is no reasonable rationale for those other space characters (including U+00a0) in file names.

Filesystems and case-insensitivity

Posted Dec 12, 2018 23:45 UTC (Wed) by pr1268 (guest, #24648) [Link] (1 responses)

there is no reasonable rationale for those other space characters (including U+00a0) in file names.

Agreed, but try telling that to those fools who auto-generated the spreadsheet with \u00a0 spaces. </angry rant>

Filesystems and case-insensitivity

Posted Dec 13, 2018 10:33 UTC (Thu) by james (guest, #1325) [Link]

Would it calm your anger to point out that LibreOffice can search using regexps?

Filesystems and case-insensitivity

Posted Nov 29, 2018 4:29 UTC (Thu) by raven667 (subscriber, #5198) [Link]

This was basically my thinking too, I'm an amateur when it comes to filesystem/vfs design or kernel/user interface, it seems you could limit the scope of what the kernel tries to do to something reliably implementable, such as storing an encoding hint and validating the encoding on read/write, that would be useful for a reference implementation in userspace to handle normalization and locale and all the wooly corner cases that will probably require frequent patching, but I'm not sure that you could outlaw anything at the kernel interface except for strings that are not valid for the hinted encoding, or a very small blacklist of control characters. I don't know what the state of the art is in userspace, but it seems that a lot of these challenges have already been faced by web browsers, database engines and others, and it would make sense to me to re-use as much of that experience, or even implementations, as possible to build a consensus, conventions and a reference implementation for the libc's and others to use.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds