Filesystems and case-insensitivity
Filesystems and case-insensitivity
Posted Nov 28, 2018 19:09 UTC (Wed) by perennialmind (guest, #45817)Parent article: Filesystems and case-insensitivity
Another problem area is how to deal with invalid byte sequences. He proposes falling back to the previous behavior, just treating the names as sequences of bytes, when a sequence is invalid for the encoding.
Please no – this is the best opportunity yet to outlaw pernicious byte sequences! Once you decide to accept and present a set of path components as text, why would you want to allow mixing in random binary garbage? Once you've taken the step of blocking a new Makefile when there's a makefile, you clear the way to refusing to accept linebreaks, escape characters, and all the other control characters. Windows already blocks those, so it's a portability win.
The last time I read anything on the topic was an old lwn article(1) on a proposal by David Wheeler(2) . Back then it was clear that there would need to be a way to opt-in to such screening. Maybe that happened when I wasn't looking? If not, this looks like the perfect time.
Posted Nov 28, 2018 20:20 UTC (Wed)
by saffroy (guest, #43999)
[Link] (10 responses)
However, I wouldn't go as far as forbidding valid characters, that would be a different feature; blocking sequences that invalid for the selected encoding is sufficient.
Posted Nov 28, 2018 21:07 UTC (Wed)
by perennialmind (guest, #45817)
[Link] (1 responses)
If the semantics really are to be changed – if a differentiable set of path elements are to contain text and only text – that's a useful feature from an application developer's perspective. If it's to be a new kind of hard-to-predict weirdness, that's less useful.
I'd prefer it if text-only filenames were limited to printable graphemes only. That might be too high a bar. I would hope that control characters (C0,C1) would be disallowed. I don't consider \x7F (DELETE) or \x09 (TAB) to be valid in a natural-language name.
Posted Dec 6, 2018 10:02 UTC (Thu)
by Wol (subscriber, #4433)
[Link]
Bearing in mind users weren't supposed to go anywhere near them, it was a pretty good way of stopping people scanning the filesystem and messing about with them. I agree for user visible files, it's a good idea, but not all files are meant to be user visible and some of them can't be hidden.
Cheers,
Posted Nov 28, 2018 22:10 UTC (Wed)
by quotemstr (subscriber, #45331)
[Link] (7 responses)
Posted Nov 29, 2018 6:39 UTC (Thu)
by lkundrak (subscriber, #43452)
[Link] (1 responses)
Posted Nov 29, 2018 21:04 UTC (Thu)
by perennialmind (guest, #45817)
[Link]
Posted Nov 29, 2018 9:11 UTC (Thu)
by dgm (subscriber, #49227)
[Link] (4 responses)
You cannot have formfeeds in a file name. You can have bytes with the decimal value 12, but reading it as a formfeed or somethig else is completely up to you.
Posted Nov 29, 2018 10:46 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (3 responses)
Seriously, the idiocy with free-form filenames should be fixed.
Posted Nov 29, 2018 12:43 UTC (Thu)
by hkario (subscriber, #94864)
[Link] (2 responses)
Just because a byte string is an invalid sequence in one encoding doesn't mean it's an invalid sequence in all encodings.
Posted Nov 29, 2018 18:28 UTC (Thu)
by quotemstr (subscriber, #45331)
[Link]
Posted Nov 29, 2018 20:47 UTC (Thu)
by perennialmind (guest, #45817)
[Link]
Posted Nov 28, 2018 21:06 UTC (Wed)
by smurf (subscriber, #17840)
[Link] (15 responses)
Basically IMHO there are two sane choices – (a) the current situation: the kernel does not attach any semantics to any bytes other than '/' and '\0' (thus there is no chance for case insensitivity beyond ASCII), or (b) you use clean and preferably pre-normalized UTF-8 on the userspace/kernel boundary, outlaw anything nonconforming, and do everything else in userspace. Anything else is a recipe for long-term desaster.
Posted Nov 28, 2018 22:28 UTC (Wed)
by perennialmind (guest, #45817)
[Link] (13 responses)
Newline, tab, and bel codepoints are perfectly valid UTF-8 plain text, but I'd prefer to push that out to userspace as well. I don't much care whether
... but not control characters. To me, a natural language filename would comprise user-perceived characters and the one true space space character (U+0020). Flexibility beyond that does more harm than good. Leave those footguns to the bytestring paths. 😉
Posted Nov 29, 2018 13:41 UTC (Thu)
by utoddl (guest, #1232)
[Link] (3 responses)
Posted Nov 30, 2018 9:04 UTC (Fri)
by jezuch (subscriber, #52988)
[Link] (2 responses)
Posted Dec 3, 2018 12:30 UTC (Mon)
by ale2018 (guest, #128727)
[Link] (1 responses)
Ah, poorly written shell scripts, eh? Because you obviously think that being slave of over-complicated command lines is fine? A good percentage of my command lines start with When I find a filename with spaces I just move it away.
For the record, the normalization step and control characters were never taken care of. For example:
Posted Dec 3, 2018 20:24 UTC (Mon)
by flussence (guest, #85566)
[Link]
Posted Nov 30, 2018 9:09 UTC (Fri)
by jezuch (subscriber, #52988)
[Link] (3 responses)
Posted Nov 30, 2018 16:25 UTC (Fri)
by perennialmind (guest, #45817)
[Link] (2 responses)
You mean end-of-string delimiters, end-of-line delimiters, tabs, and the codes needed for controlling a terminal such as escape and erase? Setting aside hurdles to adoption, one can imagine hoisting those into markup. Perhaps there's even a spec for plainer-than-plain-text for when such markup exists (i.e. HTML). If so, it might be perfect for filenames.
ASCII compatibility was the selling point for UTF-8. Beyond the above, even the oddballs are still in use. Take for example "group separator" which stands in for FNC1 in barcodes.
Somebody else will have to defend the C1 block though.
Posted Dec 1, 2018 11:24 UTC (Sat)
by jezuch (subscriber, #52988)
[Link] (1 responses)
Posted Dec 6, 2018 10:16 UTC (Thu)
by Wol (subscriber, #4433)
[Link]
And if you had a lot of spaces it saved a fair few bytes over tab-encoding, plus being completely unambiguous.
Cheers,
Posted Dec 4, 2018 8:13 UTC (Tue)
by pr1268 (guest, #24648)
[Link] (4 responses)
Um, there's more than one space: ' ' and ' '. One is \u0020 (good ol' ASCII 0x20) and the other is \u00a0. I was personally burned by the second "space" above appearing in an Excel spreadsheet (to the exclusion of the "one true space character" you mentioned). >:-(
Posted Dec 4, 2018 10:24 UTC (Tue)
by smurf (subscriber, #17840)
[Link] (3 responses)
On second thought: do worry.
Posted Dec 4, 2018 13:27 UTC (Tue)
by hummassa (subscriber, #307)
[Link] (2 responses)
Posted Dec 12, 2018 23:45 UTC (Wed)
by pr1268 (guest, #24648)
[Link] (1 responses)
Agreed, but try telling that to those fools who auto-generated the spreadsheet with \u00a0 spaces. </angry rant>
Posted Dec 13, 2018 10:33 UTC (Thu)
by james (subscriber, #1325)
[Link]
Posted Nov 29, 2018 4:29 UTC (Thu)
by raven667 (subscriber, #5198)
[Link]
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Wol
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
curl -O gives me filenames with spaces or %20s, but I do object if I see files with newlines in the names. I don't mind if I'm left with sneaky left-to-right, right-to-left marks or explicitly red hearts. I see the need for parentheses and question marks...
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
find . -name whatever | xargs... Yes, I know I can write -print0 and -0, I do that when I write shell scripts.
~$ touch aaabd $(printf 'aaabc\bd') "$(printf 'aaabc\nd')"
~$ ls -lt | head -5
total 3686968
-rw-r--r-- 1 ale ale 0 Dec 3 13:21 aaabd
-rw-r--r-- 1 ale ale 0 Dec 3 13:21 aaabc
d
-rw-r--r-- 1 ale ale 0 Dec 3 13:21 aaabd
Control characters where never forbidden. Consider that human beings are sometimes uncertain about the name they're typing and type a backspace (\b) in it. So, why isn't that beautiful too? Perhaps, users should have a clue. In the words of the Ancient Philosophy, rubbish in, rubbish out.
ls took care of that a few years ago…
Filesystems and case-insensitivity
~/test $ ls
'aaabc'$'\b''d' 'aaabc'$'\n''d' aaabd
~/test $ ls --version
ls (GNU coreutils) 8.30
Packaged by Gentoo (8.30 (p01))
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Wol
Filesystems and case-insensitivity
one true space space character (U+0020)
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
there is no reasonable rationale for those other space characters (including U+00a0) in file names.
Would it calm your anger to point out that LibreOffice can search using regexps?
Filesystems and case-insensitivity
Filesystems and case-insensitivity
