Filesystems and case-insensitivity

Posted Nov 28, 2018 19:09 UTC (Wed) by perennialmind (guest, #45817)
Parent article: Filesystems and case-insensitivity

Another problem area is how to deal with invalid byte sequences. He proposes falling back to the previous behavior, just treating the names as sequences of bytes, when a sequence is invalid for the encoding.

Please no – this is the best opportunity yet to outlaw pernicious byte sequences! Once you decide to accept and present a set of path components as text, why would you want to allow mixing in random binary garbage? Once you've taken the step of blocking a new Makefile when there's a makefile, you clear the way to refusing to accept linebreaks, escape characters, and all the other control characters. Windows already blocks those, so it's a portability win.

The last time I read anything on the topic was an old lwn article⁽¹⁾ on a proposal by David Wheeler⁽²⁾ . Back then it was clear that there would need to be a way to opt-in to such screening. Maybe that happened when I wasn't looking? If not, this looks like the perfect time.

Filesystems and case-insensitivity

Posted Nov 28, 2018 20:20 UTC (Wed) by saffroy (guest, #43999) [Link] (10 responses)

I agree that allowing invalid byte sequences seems dangerous.

However, I wouldn't go as far as forbidding valid characters, that would be a different feature; blocking sequences that invalid for the selected encoding is sufficient.

Filesystems and case-insensitivity

Posted Nov 28, 2018 21:07 UTC (Wed) by perennialmind (guest, #45817) [Link] (1 responses)

If I understand you correctly, I agree. Adding filename-as-natural-language-text semantics rounds off one of the many sharp edges in shell scripting. Adding bumpers around the rest is a separate task.

If the semantics really are to be changed – if a differentiable set of path elements are to contain text and only text – that's a useful feature from an application developer's perspective. If it's to be a new kind of hard-to-predict weirdness, that's less useful.

I'd prefer it if text-only filenames were limited to printable graphemes only. That might be too high a bar. I would hope that control characters (C0,C1) would be disallowed. I don't consider \x7F (DELETE) or \x09 (TAB) to be valid in a natural-language name.

Filesystems and case-insensitivity

Posted Dec 6, 2018 10:02 UTC (Thu) by Wol (subscriber, #4433) [Link]

I worked with a system that had filenames of the form <space><backspace>NNNN

Bearing in mind users weren't supposed to go anywhere near them, it was a pretty good way of stopping people scanning the filesystem and messing about with them. I agree for user visible files, it's a good idea, but not all files are meant to be user visible and some of them can't be hidden.

Cheers,
Wol

Filesystems and case-insensitivity

Posted Nov 28, 2018 22:10 UTC (Wed) by quotemstr (subscriber, #45331) [Link] (7 responses)

I've yet to hear any good justification for allowing a formfeed in a filename.

Filesystems and case-insensitivity

Posted Nov 29, 2018 6:39 UTC (Thu) by lkundrak (subscriber, #43452) [Link] (1 responses)

Is "there could already be files with a form feed in their name" a good justification?

Filesystems and case-insensitivity

Posted Nov 29, 2018 21:04 UTC (Thu) by perennialmind (guest, #45817) [Link]

That would be a potential problem for a mount option, but not when setting a per-directory attribute, since the directory must be empty. The same problem applies to the name collision problem though. Imposing new semantics on a filesystem would be problematic. Maybe with a superblock change? Meh. The per-directory attribute just seems cleaner overall.

Filesystems and case-insensitivity

Posted Nov 29, 2018 9:11 UTC (Thu) by dgm (subscriber, #49227) [Link] (4 responses)

> I've yet to hear any good justification for allowing a formfeed in a filename.

You cannot have formfeeds in a file name. You can have bytes with the decimal value 12, but reading it as a formfeed or somethig else is completely up to you.

Filesystems and case-insensitivity

Posted Nov 29, 2018 10:46 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

What's the difference in practice then?

Seriously, the idiocy with free-form filenames should be fixed.

Filesystems and case-insensitivity

Posted Nov 29, 2018 12:43 UTC (Thu) by hkario (subscriber, #94864) [Link] (2 responses)

because that file could have been created on a system working in CP437 where it would be shown as ♀

Just because a byte string is an invalid sequence in one encoding doesn't mean it's an invalid sequence in all encodings.

Filesystems and case-insensitivity

Posted Nov 29, 2018 18:28 UTC (Thu) by quotemstr (subscriber, #45331) [Link]

UTF-8 won. We should start giving it its well-earned victory parade. Don't you that, after we're finished banning non-UTF-8 encodings, and after we ban illegal or bizarre code sequences, and after we start normalizing filenames into consistent form, we'll end up in a better place? If in the process of doing that we can't access files called ♀ from old volumes without special compatibility mount options, so be it.

Filesystems and case-insensitivity

Posted Nov 29, 2018 20:47 UTC (Thu) by perennialmind (guest, #45817) [Link]

That's why I thought it was a brilliant choice to require that directories be empty in order to switch on the "text mode dentries" attribute. You sidestep the "reinterpret" problem in trade for a save/copy errors that are easy to surface to end users. I'm not sure how overlayfs union mounts would work though.

Filesystems and case-insensitivity

Posted Nov 28, 2018 21:06 UTC (Wed) by smurf (subscriber, #17840) [Link] (15 responses)

It's easy to encode invalid byte sequences so that they survive a round trip through Unicode / UTF-8 – you mis-appropriate the surrogates. The actual higher-level semantics of that, though, are fraught with corner cases you *really* don't want to deal with.

Basically IMHO there are two sane choices – (a) the current situation: the kernel does not attach any semantics to any bytes other than '/' and '\0' (thus there is no chance for case insensitivity beyond ASCII), or (b) you use clean and preferably pre-normalized UTF-8 on the userspace/kernel boundary, outlaw anything nonconforming, and do everything else in userspace. Anything else is a recipe for long-term desaster.

Filesystems and case-insensitivity

Posted Nov 28, 2018 22:28 UTC (Wed) by perennialmind (guest, #45817) [Link] (13 responses)

Newline, tab, and bel codepoints are perfectly valid UTF-8 plain text, but I'd prefer to push that out to userspace as well. I don't much care whether curl -O gives me filenames with spaces or %20s, but I do object if I see files with newlines in the names. I don't mind if I'm left with sneaky left-to-right, right-to-left marks or explicitly red hearts. I see the need for parentheses and question marks...

... but not control characters. To me, a natural language filename would comprise user-perceived characters and the one true space space character (U+0020). Flexibility beyond that does more harm than good. Leave those footguns to the bytestring paths. 😉

Filesystems and case-insensitivity

Posted Nov 29, 2018 13:41 UTC (Thu) by utoddl (guest, #1232) [Link] (3 responses)

I was with you until you got to spaces. It's only wishing of course, but I wish spaces in file names would go away. Personal peeve.

Filesystems and case-insensitivity

Posted Nov 30, 2018 9:04 UTC (Fri) by jezuch (subscriber, #52988) [Link] (2 responses)

Spaces in file names are wonderful. You can name your files like a human being, not a slave of poorly written shell scripts with broken quoting :) (I have a pet peeve too)

Filesystems and case-insensitivity

Posted Dec 3, 2018 12:30 UTC (Mon) by ale2018 (guest, #128727) [Link] (1 responses)

Ah, poorly written shell scripts, eh? Because you obviously think that being slave of over-complicated command lines is fine? A good percentage of my command lines start with find . -name whatever | xargs... Yes, I know I can write -print0 and -0, I do that when I write shell scripts.

When I find a filename with spaces I just move it away.

For the record, the normalization step and control characters were never taken care of. For example:

    ~$ touch aaabd $(printf 'aaabc\bd')  "$(printf 'aaabc\nd')"
    ~$ ls -lt | head -5
    total 3686968
    -rw-r--r--  1 ale ale                 0 Dec  3 13:21 aaabd
    -rw-r--r--  1 ale ale                 0 Dec  3 13:21 aaabc
    d
    -rw-r--r--  1 ale ale                 0 Dec  3 13:21 aaabd

Control characters where never forbidden. Consider that human beings are sometimes uncertain about the name they're typing and type a backspace (\b) in it. So, why isn't that beautiful too? Perhaps, users should have a clue. In the words of the Ancient Philosophy, rubbish in, rubbish out.

Filesystems and case-insensitivity

Posted Dec 3, 2018 20:24 UTC (Mon) by flussence (guest, #85566) [Link]

ls took care of that a few years ago…

~/test $ ls
'aaabc'$'\b''d'  'aaabc'$'\n''d'   aaabd
~/test $ ls --version
ls (GNU coreutils) 8.30
Packaged by Gentoo (8.30 (p01))

Filesystems and case-insensitivity

Posted Nov 30, 2018 9:09 UTC (Fri) by jezuch (subscriber, #52988) [Link] (3 responses)

I guess the concept of control characters should have been retired long time ago. I also think that it was a huge mistake to bring them to UTF-8 along with the rest of ASCII. But I'm pretty sure someone will explain to me that they are in fact critical and there are further control characters in the Unicode spec anyway :)

Filesystems and case-insensitivity

Posted Nov 30, 2018 16:25 UTC (Fri) by perennialmind (guest, #45817) [Link] (2 responses)

You mean end-of-string delimiters, end-of-line delimiters, tabs, and the codes needed for controlling a terminal such as escape and erase? Setting aside hurdles to adoption, one can imagine hoisting those into markup. Perhaps there's even a spec for plainer-than-plain-text for when such markup exists (i.e. HTML). If so, it might be perfect for filenames.

ASCII compatibility was the selling point for UTF-8. Beyond the above, even the oddballs are still in use. Take for example "group separator" which stands in for FNC1 in barcodes.

Somebody else will have to defend the C1 block though.

Filesystems and case-insensitivity

Posted Dec 1, 2018 11:24 UTC (Sat) by jezuch (subscriber, #52988) [Link] (1 responses)

I mean all the bytes below 0x20. This is not text, they have no place in a *character* encoding. Apart from that I'm totally fine with ASCII compatibility, even though it's typically American culturally insensitive invention ;)

Filesystems and case-insensitivity

Posted Dec 6, 2018 10:16 UTC (Thu) by Wol (subscriber, #4433) [Link]

I believe there are two control characters RS1 and RS2? Basically standing for "Repeat String"? Which were used on a system I worked on, and actually were a damn good fix for "how many characters does a tab stand for?". So most lines in my FORTRAN source code would have been physically stored on disk as "<RS1><6>code..."

And if you had a lot of spaces it saved a fair few bytes over tab-encoding, plus being completely unambiguous.

Cheers,
Wol

Filesystems and case-insensitivity

Posted Dec 4, 2018 8:13 UTC (Tue) by pr1268 (guest, #24648) [Link] (4 responses)

one true space space character (U+0020)

Um, there's more than one space: ' ' and ' '. One is \u0020 (good ol' ASCII 0x20) and the other is \u00a0.

I was personally burned by the second "space" above appearing in an Excel spreadsheet (to the exclusion of the "one true space character" you mentioned). >:-(

Filesystems and case-insensitivity

Posted Dec 4, 2018 10:24 UTC (Tue) by smurf (subscriber, #17840) [Link] (3 responses)

Don't worry – Unicode has a bunch more spaces, including zero-width ones.

On second thought: do worry.

Filesystems and case-insensitivity

Posted Dec 4, 2018 13:27 UTC (Tue) by hummassa (subscriber, #307) [Link] (2 responses)

The initial argument still holds: there is no reasonable rationale for those other space characters (including U+00a0) in file names.

Filesystems and case-insensitivity

Posted Dec 12, 2018 23:45 UTC (Wed) by pr1268 (guest, #24648) [Link] (1 responses)

there is no reasonable rationale for those other space characters (including U+00a0) in file names.

Agreed, but try telling that to those fools who auto-generated the spreadsheet with \u00a0 spaces. </angry rant>

Filesystems and case-insensitivity

Posted Dec 13, 2018 10:33 UTC (Thu) by james (subscriber, #1325) [Link]

Would it calm your anger to point out that LibreOffice can search using regexps?

Filesystems and case-insensitivity

Posted Nov 29, 2018 4:29 UTC (Thu) by raven667 (subscriber, #5198) [Link]

This was basically my thinking too, I'm an amateur when it comes to filesystem/vfs design or kernel/user interface, it seems you could limit the scope of what the kernel tries to do to something reliably implementable, such as storing an encoding hint and validating the encoding on read/write, that would be useful for a reference implementation in userspace to handle normalization and locale and all the wooly corner cases that will probably require frequent patching, but I'm not sure that you could outlaw anything at the kernel interface except for strings that are not valid for the hinted encoding, or a very small blacklist of control characters. I don't know what the state of the art is in userspace, but it seems that a lot of these challenges have already been faced by web browsers, database engines and others, and it would make sense to me to re-use as much of that experience, or even implementations, as possible to build a consensus, conventions and a reference implementation for the libc's and others to use.