Filesystems and case-insensitivity
A recurring topic in filesystem-developer circles is on handling case-insensitive file names. Filesystems for other operating systems do so but, by and large, Linux filesystems do not. In the Kernel Summit track of the 2018 Linux Plumbers Conference (LPC), Gabriel Krisman Bertazi described his plans for making Linux filesystems encoding-aware as part of an effort to make ext4, and possibly other filesystems, interoperable with case-insensitivity in Android, Windows, and macOS.
Case-insensitive file names for Linux have been discussed for a long time. The oldest reference he could find was from 2002, but it has come up at several Linux Storage, Filesystem, and Memory-Management Summits (LSFMM), including in 2016 and in Krisman's presentation this year. It has languished so long without a real solution because the problem has many corner cases and it is "tricky to get it right".
![Gabriel Krisman Bertazi [Gabriel Krisman Bertazi]](https://static.lwn.net/images/2018/lpc-krisman-sm.jpg)
An attendee asked about XFS and its handling of case-insensitive file names. Krisman said that when an XFS filesystem is created, it can be configured to handle them. It is ASCII-only, though a proposal from SGI in 2014 would have added full UTF-8 support for XFS and extended the case-handling to Unicode file names.
The traditional Unix approach is that file names are opaque byte sequences that cannot contain "/" characters. He is proposing to add encoding awareness to filesystems, but, he asked, what are the advantages of doing so? For one thing, Windows and macOS have encoding-aware filesystems; it is a feature that Linux lacks. There are "real world use cases" as well: porting from the Windows world, dealing with the case-insensitive tree that Android exposes, and, in general, providing better support for exported filesystems. Android has a user-space hack for case handling, but it is slow and has many race conditions. An encoding-aware filesystem is a better way to expose this functionality to users, he said.
Unicode can represent the "same" string in multiple different ways, via composition for example, but that is confusing. Multiple files with the same-appearing name in a directory, as he showed in his slides [PDF], will be difficult to deal with. That means some kind of normalization will need to be applied. Beyond that, "case" is really only defined in terms of an encoding—it is meaningless for a byte sequence. That is why he implemented encoding awareness before tackling case insensitivity.
The kernel has a Native Language Support (NLS) subsystem but it has multiple limitations. It has trouble dealing with invalid character sequences—in some situations it returns zero, in others something else. It can't deal with multi-byte sequences or code points; for example, to_upper() and to_lower() return a single byte. There is no support for dealing with the evolution of encodings, which is not really a problem for UTF-8 except for unmapped code points—case folding for unmapped points is not stable, he said. In addition, NLS is missing support for normalization and has only partly implemented case folding; the latter is "almost ASCII only".
Start with NLS
So he has been proposing improvements to NLS as part of his encoding and case-insensitive support patch set that has been posted to the ext4 mailing list. It provides a new load_nls_version() function that allows the caller to define the encoding and version that it wants to use. It has a flags argument that allows filesystems to specify the normalization type, case-fold type, and permissiveness mode they want. That version and behavior information would be stored in the superblock of the filesystem.
Krisman's changes would add support for multi-byte characters by adding a new API for comparisons, normalization, and case folding. It will support UTF-8 NFKD normalization that is based on code from the 2014 SGI patch set. It uses a decoding trie and the mechanism is extendable to other normalization types. For example, if support for the Apple filesystem was needed, NFD normalization could be added. The changes he is making are all backward compatible with existing NLS tables and users, Krisman said.
He currently has patches for the kernel, e2fsprogs, and xfstests out for review. This effort is quite different from what he presented at LSFMM back in May.
There was some discussion among attendees about the changes. The original file name will be preserved when it is created, Krisman said, so that makes the filesystem "case preserving" like NTFS. Concern was expressed about containers sharing a filesystem with encoded file names, but having different user-space encodings. That is not a use case that is envisioned, he said; root filesystems will not normally be encoding aware. The most common use cases, Ted Ts'o said, are USB sticks with a FAT filesystem that does case folding or users of other operating systems accessing the filesystem through Samba. A storage appliance will be able to create a case-folding filesystem and Samba can turn off its expensive user-space case-handling solution.
Another use case that Krisman brought up was for SteamOS, which would have a separate partition for game data that would be encoding aware. Ts'o said that there are some inherent assumptions in this work. The primary users will be like the SteamOS or Samba appliance examples and that "all the world is UTF-8". It would be hugely complicated to support different directories with different encodings, he said. He invited those present to point out any problems they see with those assumptions.
James Bottomley asked if the user-space side had been consulted on these choices. He noted that European distributions typically use single-byte encodings and that the Chinese hate UTF-8 because all characters become four bytes in size. Ts'o said that the problem is essentially being handed off to the distributions. POSIX does not have a way for filesystems to communicate the encoding of their file names; if that existed, glibc could handle the differences.
There is no good solution for that problem, Ts'o continued. There will be information in the superblock, which should be exposed via statfs(). That will take some time to happen, so perhaps a sysfs field could be used in the interim.
Krisman said that his implementation tries to make good use of the directory entry (dentry) cache. Equivalent names do not create multiple dentries, there is just one per file. The d_hash() and d_compare() routines needed to be made encoding aware. For now, negative dentries (asserting the absence of a given file name) are not cached; it will require some work to carefully invalidate negative dentries during file creation.
On to case-insensitivity
Supporting case-insensitive file names requires the encoding-awareness changes in order to define what case folding means for a given character. A per-directory inode attribute can be set to turn on case-insensitivity, but that is only allowed on empty directories to avoid name collisions. Case-insensitivity is trivial to implement once the encoding support is available, he said; it is effectively just a special case of encoding.
There are some limitations of the current implementation, starting with the lack of negative dentries in the cache. Directory encryption is not supported since the lookup is based on the hash of the name, but the same hash cannot be generated from two names that normalize to the same name. He proposed storing the file using the hash of the normalization, but was not sure if that would solve the problem.
Another problem area is how to deal with invalid byte sequences. He proposes falling back to the previous behavior, just treating the names as sequences of bytes, when a sequence is invalid for the encoding. There may be some user-space breakage due to normalization or case folding of file names that will need to be handled as well.
The current implementation is for the ext4 filesystem, but the main part is the NLS changes. The ext4-specific changes give other filesystems a roadmap to adding encoding-awareness and case-insensitivity, Krisman said. Ts'o noted that there is no active NLS maintainer currently, so he will take Krisman's changes through the ext4 tree. He will try to test other users of NLS, but explicitly is not volunteering to take on NLS maintenance going forward.
Boaz Harrosh pointed out that Linus Torvalds called negative dentries important for performance reasons. He wondered if there were plans to add them for encoding-aware filesystems. Krisman said that invalidating negative dentries needs careful thought and code but that it should be doable. The path for file renames is particularly tricky. Bottomley asked why negative dentries needed to be handled differently than positive ones. The problem is that many people want case-preserving filesystems, so looking up FOO when foo exists should generate a negative dentry for FOO but that will interfere with case-insensitive lookups for Foo or even foo.
The reaction to this proposal was much more positive than to Krisman's earlier attempt. It would seem that we will soon have the ability to handle case-insensitive ext4 filesystems and the potential is there to add it for others.
[I would like to thank LWN's travel sponsor, The Linux Foundation, for
assistance in traveling to Vancouver for LPC.]
Index entries for this article | |
---|---|
Kernel | Filesystems/Case-independent lookups |
Conference | Linux Plumbers Conference/2018 |
Posted Nov 28, 2018 14:07 UTC (Wed)
by sorokin (guest, #88478)
[Link] (9 responses)
The problem I see is that filesystems serve two purposes:
For the first usage I see the merit of having case-insensitive filesystem. It depends on personal preference though.
For the second usage case-insensitiveness is a downside. When program lookup some its internal file/resource, case-insensitive comparison is both unnecessary and potentially incorrect. When I scan a directory with readdir/statat I don't want statat to be case-insensitive.
Posted Nov 28, 2018 14:28 UTC (Wed)
by bandrami (guest, #94229)
[Link] (3 responses)
Posted Nov 28, 2018 14:43 UTC (Wed)
by sorokin (guest, #88478)
[Link] (1 responses)
Unfortunately it does. See "pathname" parameter: int fstatat(int dirfd, const char *pathname, struct stat *statbuf, int flags);
It stats the file "pathname" relative to directory "dirfd". Normally when readdir returns DT_UNKNOWN one has to statat the filename relative to the directory to figure out the real d_type.
Posted Nov 28, 2018 15:42 UTC (Wed)
by bandrami (guest, #94229)
[Link]
Posted Nov 28, 2018 21:05 UTC (Wed)
by madscientist (subscriber, #16861)
[Link]
Posted Nov 28, 2018 16:09 UTC (Wed)
by sorokin (guest, #88478)
[Link] (1 responses)
For example I can imagine that save file dialog can ask the following question: "You are trying to create file Foo.txt while foo.txt already exists. Do you want to create another file that differs only in
Correspondingly open file dialog first look for exact match and if no file is found search for file case insensitively. I would like having convenience feature like this even now.
My point is that this should be done only in limited number of user facing dialogs. Doing this for most existing system calls would be inefficient and can be incorrect if filenames are used as opaque keys.
Posted Nov 28, 2018 20:34 UTC (Wed)
by saffroy (guest, #43999)
[Link]
Posted Nov 28, 2018 16:43 UTC (Wed)
by smcv (subscriber, #53363)
[Link] (2 responses)
For the SteamOS use case, it's desirable that this lookup can be case-insensitive: game developers typically do most of their testing and development on Windows, where opening "level3.MAP" will successfully find a file named "Level3.map". If the obvious port of that game fails to work on Linux, it makes Linux gaming look bad, and makes porting games to Linux less appealing.
Emulations of case-insensitive enviroments, like Wine and Samba, also need to match the case-insensitive behaviour of the environment they're emulating.
Posted Nov 28, 2018 20:28 UTC (Wed)
by HenrikH (subscriber, #31152)
[Link]
Posted Nov 28, 2018 21:12 UTC (Wed)
by madscientist (subscriber, #16861)
[Link]
Posted Nov 28, 2018 14:41 UTC (Wed)
by mgedmin (subscriber, #34497)
[Link] (10 responses)
Am I living in a bubble? What are the European distributions that don't use UTF-8 by default in 2018?
Posted Nov 28, 2018 16:14 UTC (Wed)
by niner (subscriber, #26151)
[Link] (8 responses)
The prevalent GBK encoding uses 2 bytes for such characters, so we're talking about a ~ 50 % increase in storage size. For text. I really wonder who cares about that in 2018. And even more I wonder, who'd care about the storage requirements for file names.
Posted Nov 28, 2018 16:58 UTC (Wed)
by rahulsundaram (subscriber, #21946)
[Link] (1 responses)
Careful there. This was not a comment from the developer working on text encoding support but from James Bottomley.
Posted Nov 28, 2018 17:02 UTC (Wed)
by niner (subscriber, #26151)
[Link]
Posted Nov 28, 2018 19:40 UTC (Wed)
by roc (subscriber, #30627)
[Link]
For example, for Chinese Web pages UTF8 is a win over UTF16 because the majority of the text of a typical Chinese HTML document is actually ASCII.
Posted Nov 28, 2018 19:41 UTC (Wed)
by roc (subscriber, #30627)
[Link] (4 responses)
Posted Nov 29, 2018 0:36 UTC (Thu)
by willy (subscriber, #9762)
[Link] (3 responses)
But I don't think UTF-8 per se is controversial in China. More so in Russia where it is an evil tool of US oppression.
Posted Nov 29, 2018 2:04 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
UTF-8 has finally solved the problem with the veritable zoo of commonly used Russian encodings (KOI-8, Win-1251, GOST, GOST-ALT, ISO).
Posted Nov 29, 2018 11:04 UTC (Thu)
by andrewsh (subscriber, #71043)
[Link] (1 responses)
Posted Nov 29, 2018 11:07 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link]
ISO was used sometimes in the Internet. It was rare but it existed.
Posted Nov 28, 2018 20:30 UTC (Wed)
by HenrikH (subscriber, #31152)
[Link]
Posted Nov 28, 2018 14:59 UTC (Wed)
by gioele (subscriber, #61675)
[Link] (3 responses)
> Supporting case-insensitive file names requires the encoding-awareness changes in order to define what case folding means for a given character.
"Case" is properly defined only in terms of locale, not of encoding. Knowing the encoding (say, UTF-8+NFD vs UTF-16+NFKC) is necessary, but not sufficient. The user locale is needed as well.
In English "istanBUL" matches case-insensitively "Istanbul", in Turkish it does not. (In Turkish the uppercase version of "i" is "İ".)
What the developers could do is a kind of case-insensitive look-up that also clusters together "similar" letters. Defining which characters are similar opens, however, another can of worms (see `confusables.txt` from Unicode or all the discussions around IDNA and its Nameprep algorithm).
Maybe we should come up with another technical name for these locale-independent imprecise implementations of case insensitiveness?
Posted Nov 28, 2018 16:03 UTC (Wed)
by anselm (subscriber, #2796)
[Link]
Here's an interesting post by James Bennett (Django core developer) on the topic of “case”: Truths programmers should know about case
Posted Dec 2, 2018 16:42 UTC (Sun)
by epa (subscriber, #39769)
[Link] (1 responses)
I am not asking whether they are the same in all uses. I know that in Turkish i and ı are different letters. What I'm suggesting is that for making a case-insensitive filesystem lookup -- where you have already waved goodbye to a strict 1-1 mapping between byte sequences and directory entries -- it surely doesn't matter that much to gloss over the distinction and treat all these four characters the same. Similarly I would consider it a feature, not a bug, if accented characters could be preserved in filenames, but ignored when matching. There are pairs of words in German that differ only in accent, but it's very unlikely an accent would be the only difference between two human-written document names.
Now, you may with some justice argue that loose matching like this belongs in user space, not the kernel. But in the end it's not my preferences or anyone else's that matter. What matters is to efficiently implement the existing (de facto or de jure) standards. What behaviour is Samba required to support with the Turkish uppercase and lowercase letters? The kernel should provide the semantics that Samba needs so it doesn't have to laboriously scan the whole directory to match a filename.
Posted Dec 2, 2018 17:19 UTC (Sun)
by gioele (subscriber, #61675)
[Link]
Sure they could. But doing it is hard (and computationally expensive).
This is what I meant with
> What the developers could do is a kind of case-insensitive look-up that also clusters together "similar" letters. Defining which characters are similar opens, however, another can of worms (see `confusables.txt` from Unicode or all the discussions around IDNA and its Nameprep algorithm).
Posted Nov 28, 2018 15:34 UTC (Wed)
by bokr (guest, #58369)
[Link] (1 responses)
It appears that globbing is case-sensitive but a complete name is
BTW, I like a default case-insensitive search like you get from emacs,
Posted Nov 28, 2018 17:31 UTC (Wed)
by mina86 (guest, #68442)
[Link]
Posted Nov 28, 2018 16:11 UTC (Wed)
by willy (subscriber, #9762)
[Link] (1 responses)
There are Extended blocks in U+20000 space which will use 4 bytes, but my understanding is that those are rare characters (the most common 27,000 characters are below FFFF).
The language groups who were worst affected by UTF-8 were Cyrillic and Greek who now need two bytes for every letter. But I don't see what better choice there was.
Posted Nov 29, 2018 11:17 UTC (Thu)
by andrewsh (subscriber, #71043)
[Link]
There wasn’t, since before UTF-8 there were at least three popular one-byte encoding with totally incompatible character layouts. Historically, DOS systems had three encodings for Russian, of which two were almost never used since one of them wasn’t compatible with line/block graphics characters (so no Norton Commander for its users), and another one was developed after the third one gained widespread usage. Of that remaining one, which Microsoft branded as CP866, there were multiple versions differing in certain characters missing or present (e.g. ё vs ± or Ў vs ÷). Next, Windows came with this wonderful distinction between ‘ANSI’, ‘OEM’ and ‘Unicode’. ‘OEM’, of course, was CP866 for Russian-localised Windows but something else in other versions, and ‘ANSI’ was CP1251 (again, in Russian Windows only). So historically there's been created a lot of documents in those two encodings. Most importantly, most ZIP archives had file names encoded in CP866, but some of them in CP1251. Since ‘ANSI’ was the default in Windows XP and that is still being used in lots of places, such documents are still being produced as we speak™. Oh, and the third encoding an absolute minority still uses is KOI-8R/U, which only die-hard Unixoids use these days because it allows them to strip bit 7 and still being able to read text in Russian since that’s been the design decision when the charset was developed (they mapped most letters to their phonetic equivalents in inverse case but with bit 7 set: A → а, B → б, C → ц, but, for example, Q → я and V → ж). That encoding has traditionally been used on Unix and Linux systems in the past.
Posted Nov 28, 2018 17:24 UTC (Wed)
by smurf (subscriber, #17840)
[Link]
Meh? Above, it was said that Android's userspace hack is … bad. So why should doing it in glibc be any different?
Posted Nov 28, 2018 17:38 UTC (Wed)
by dgm (subscriber, #49227)
[Link] (5 responses)
Does it? Why can't you use an attribute to store the case-folded name? Or use the case-folded name as the file name and add a name-as-provided attribute? This way you can move case interpretation to user space, where it belongs.
What I'm missing?
Posted Nov 29, 2018 12:15 UTC (Thu)
by alonz (subscriber, #815)
[Link] (4 responses)
This actually looks like one of the sanest proposals – I really wonder why the existing user-space solutions are not using this scheme.
Posted Nov 29, 2018 12:43 UTC (Thu)
by smurf (subscriber, #17840)
[Link] (1 responses)
Posted Nov 29, 2018 15:11 UTC (Thu)
by alonz (subscriber, #815)
[Link]
(I may be tempted to write a PoC of this idea and see how it performs, just for curiosity's sake)
Posted Nov 29, 2018 15:57 UTC (Thu)
by zdzichu (subscriber, #17118)
[Link] (1 responses)
Posted Nov 29, 2018 18:34 UTC (Thu)
by smurf (subscriber, #17840)
[Link]
A non-malicious source of the same problem is somebody renaming the file with an old non-extended-filename-compatible tool (or libc).
Posted Nov 28, 2018 17:54 UTC (Wed)
by tnoo (subscriber, #20427)
[Link] (10 responses)
Posted Nov 28, 2018 20:42 UTC (Wed)
by saffroy (guest, #43999)
[Link] (2 responses)
Posted Nov 29, 2018 17:05 UTC (Thu)
by cesarb (subscriber, #6266)
[Link] (1 responses)
Posted Nov 29, 2018 17:29 UTC (Thu)
by bfields (subscriber, #19510)
[Link]
Posted Nov 28, 2018 21:00 UTC (Wed)
by smurf (subscriber, #17840)
[Link] (3 responses)
Posted Nov 29, 2018 9:06 UTC (Thu)
by dgm (subscriber, #49227)
[Link] (2 responses)
Yes, you can. You only need to use the canonical representation of the file name (the case-folded one) and check that file name.
Posted Nov 29, 2018 12:20 UTC (Thu)
by smurf (subscriber, #17840)
[Link] (1 responses)
Posted Nov 29, 2018 15:23 UTC (Thu)
by dgm (subscriber, #49227)
[Link]
Posted Nov 29, 2018 1:02 UTC (Thu)
by bfields (subscriber, #19510)
[Link] (2 responses)
Git, like most applications, already has to be prepared to work on case-insensitive filesystems.
Posted Nov 29, 2018 4:06 UTC (Thu)
by smurf (subscriber, #17840)
[Link] (1 responses)
Posted Nov 29, 2018 15:31 UTC (Thu)
by bfields (subscriber, #19510)
[Link]
File and directory names are a problem too, of course. If a directory you're storing in git includes both foo and FOO, then you'll have a problem when you try to check it out on a case-insensitive filesystem.
I don't think that's really fixable; some people actually do have such content which they need to track in git, others can't deal with it, it's up to the user to decide what they care about.
But git works on case-insensitive filesystems if the stuff you put into it does.
Posted Nov 28, 2018 19:09 UTC (Wed)
by perennialmind (guest, #45817)
[Link] (27 responses)
Please no – this is the best opportunity yet to outlaw pernicious byte sequences! Once you decide to accept and present a set of path components as text, why would you want to allow mixing in random binary garbage? Once you've taken the step of blocking a new Makefile when there's a makefile, you clear the way to refusing to accept linebreaks, escape characters, and all the other control characters. Windows already blocks those, so it's a portability win.
The last time I read anything on the topic was an old lwn article(1) on a proposal by David Wheeler(2) . Back then it was clear that there would need to be a way to opt-in to such screening. Maybe that happened when I wasn't looking? If not, this looks like the perfect time.
Posted Nov 28, 2018 20:20 UTC (Wed)
by saffroy (guest, #43999)
[Link] (10 responses)
However, I wouldn't go as far as forbidding valid characters, that would be a different feature; blocking sequences that invalid for the selected encoding is sufficient.
Posted Nov 28, 2018 21:07 UTC (Wed)
by perennialmind (guest, #45817)
[Link] (1 responses)
If the semantics really are to be changed – if a differentiable set of path elements are to contain text and only text – that's a useful feature from an application developer's perspective. If it's to be a new kind of hard-to-predict weirdness, that's less useful.
I'd prefer it if text-only filenames were limited to printable graphemes only. That might be too high a bar. I would hope that control characters (C0,C1) would be disallowed. I don't consider \x7F (DELETE) or \x09 (TAB) to be valid in a natural-language name.
Posted Dec 6, 2018 10:02 UTC (Thu)
by Wol (subscriber, #4433)
[Link]
Bearing in mind users weren't supposed to go anywhere near them, it was a pretty good way of stopping people scanning the filesystem and messing about with them. I agree for user visible files, it's a good idea, but not all files are meant to be user visible and some of them can't be hidden.
Cheers,
Posted Nov 28, 2018 22:10 UTC (Wed)
by quotemstr (subscriber, #45331)
[Link] (7 responses)
Posted Nov 29, 2018 6:39 UTC (Thu)
by lkundrak (subscriber, #43452)
[Link] (1 responses)
Posted Nov 29, 2018 21:04 UTC (Thu)
by perennialmind (guest, #45817)
[Link]
Posted Nov 29, 2018 9:11 UTC (Thu)
by dgm (subscriber, #49227)
[Link] (4 responses)
You cannot have formfeeds in a file name. You can have bytes with the decimal value 12, but reading it as a formfeed or somethig else is completely up to you.
Posted Nov 29, 2018 10:46 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (3 responses)
Seriously, the idiocy with free-form filenames should be fixed.
Posted Nov 29, 2018 12:43 UTC (Thu)
by hkario (subscriber, #94864)
[Link] (2 responses)
Just because a byte string is an invalid sequence in one encoding doesn't mean it's an invalid sequence in all encodings.
Posted Nov 29, 2018 18:28 UTC (Thu)
by quotemstr (subscriber, #45331)
[Link]
Posted Nov 29, 2018 20:47 UTC (Thu)
by perennialmind (guest, #45817)
[Link]
Posted Nov 28, 2018 21:06 UTC (Wed)
by smurf (subscriber, #17840)
[Link] (15 responses)
Basically IMHO there are two sane choices – (a) the current situation: the kernel does not attach any semantics to any bytes other than '/' and '\0' (thus there is no chance for case insensitivity beyond ASCII), or (b) you use clean and preferably pre-normalized UTF-8 on the userspace/kernel boundary, outlaw anything nonconforming, and do everything else in userspace. Anything else is a recipe for long-term desaster.
Posted Nov 28, 2018 22:28 UTC (Wed)
by perennialmind (guest, #45817)
[Link] (13 responses)
Newline, tab, and bel codepoints are perfectly valid UTF-8 plain text, but I'd prefer to push that out to userspace as well. I don't much care whether
... but not control characters. To me, a natural language filename would comprise user-perceived characters and the one true space space character (U+0020). Flexibility beyond that does more harm than good. Leave those footguns to the bytestring paths. 😉
Posted Nov 29, 2018 13:41 UTC (Thu)
by utoddl (guest, #1232)
[Link] (3 responses)
Posted Nov 30, 2018 9:04 UTC (Fri)
by jezuch (subscriber, #52988)
[Link] (2 responses)
Posted Dec 3, 2018 12:30 UTC (Mon)
by ale2018 (guest, #128727)
[Link] (1 responses)
Ah, poorly written shell scripts, eh? Because you obviously think that being slave of over-complicated command lines is fine? A good percentage of my command lines start with When I find a filename with spaces I just move it away.
For the record, the normalization step and control characters were never taken care of. For example:
Posted Dec 3, 2018 20:24 UTC (Mon)
by flussence (guest, #85566)
[Link]
Posted Nov 30, 2018 9:09 UTC (Fri)
by jezuch (subscriber, #52988)
[Link] (3 responses)
Posted Nov 30, 2018 16:25 UTC (Fri)
by perennialmind (guest, #45817)
[Link] (2 responses)
You mean end-of-string delimiters, end-of-line delimiters, tabs, and the codes needed for controlling a terminal such as escape and erase? Setting aside hurdles to adoption, one can imagine hoisting those into markup. Perhaps there's even a spec for plainer-than-plain-text for when such markup exists (i.e. HTML). If so, it might be perfect for filenames.
ASCII compatibility was the selling point for UTF-8. Beyond the above, even the oddballs are still in use. Take for example "group separator" which stands in for FNC1 in barcodes.
Somebody else will have to defend the C1 block though.
Posted Dec 1, 2018 11:24 UTC (Sat)
by jezuch (subscriber, #52988)
[Link] (1 responses)
Posted Dec 6, 2018 10:16 UTC (Thu)
by Wol (subscriber, #4433)
[Link]
And if you had a lot of spaces it saved a fair few bytes over tab-encoding, plus being completely unambiguous.
Cheers,
Posted Dec 4, 2018 8:13 UTC (Tue)
by pr1268 (guest, #24648)
[Link] (4 responses)
Um, there's more than one space: ' ' and ' '. One is \u0020 (good ol' ASCII 0x20) and the other is \u00a0. I was personally burned by the second "space" above appearing in an Excel spreadsheet (to the exclusion of the "one true space character" you mentioned). >:-(
Posted Dec 4, 2018 10:24 UTC (Tue)
by smurf (subscriber, #17840)
[Link] (3 responses)
On second thought: do worry.
Posted Dec 4, 2018 13:27 UTC (Tue)
by hummassa (subscriber, #307)
[Link] (2 responses)
Posted Dec 12, 2018 23:45 UTC (Wed)
by pr1268 (guest, #24648)
[Link] (1 responses)
Agreed, but try telling that to those fools who auto-generated the spreadsheet with \u00a0 spaces. </angry rant>
Posted Dec 13, 2018 10:33 UTC (Thu)
by james (subscriber, #1325)
[Link]
Posted Nov 29, 2018 4:29 UTC (Thu)
by raven667 (subscriber, #5198)
[Link]
Posted Nov 28, 2018 20:54 UTC (Wed)
by saffroy (guest, #43999)
[Link]
Posted Nov 29, 2018 8:30 UTC (Thu)
by nim-nim (subscriber, #34454)
[Link] (11 responses)
Anyway, hope this gets fixed. Transition to UTF-8 was awful for *x filesystems, I sure hope there won't be a v2 with wide encoding problems added to the mix whenever UTF-8 gets deprecated in favour of something better.
Posted Nov 29, 2018 9:15 UTC (Thu)
by dgm (subscriber, #49227)
[Link] (8 responses)
Posted Nov 29, 2018 11:38 UTC (Thu)
by eru (subscriber, #2753)
[Link] (7 responses)
I would hope that is never. UTF-8 can represent all characters now in practical use. The main risk is designing emojis going totally out of hand, and they insist each of them should have a UNICODE code point... oh wait...
Posted Nov 29, 2018 12:41 UTC (Thu)
by chithanh (guest, #52801)
[Link] (6 responses)
That is not correct. In particular, Unicode (and by extension UTF-8) is deficient regarding some characters in African languages, due to the Unicode consortium's policy regarding precomposed characters vs. combining diacritics. They don't want to introduce new equivalences.
Posted May 29, 2019 23:00 UTC (Wed)
by Serentty (guest, #132335)
[Link] (5 responses)
Posted May 30, 2019 14:18 UTC (Thu)
by smurf (subscriber, #17840)
[Link] (4 responses)
(How many primitives would you need for Chinese?)
On the other hand, in that case we wouldn't all use UTF-8 by now – simply because that would require twice the storage space for Chnese text, more or less. Nowadays that doesn't really matter, but at the time it was a problem.
Posted May 30, 2019 14:54 UTC (Thu)
by excors (subscriber, #95769)
[Link] (2 responses)
Maybe the 64K limit could have lasted for many more years if they had made some different design choices early on, but given the goal of being a universal standard for all text, it seems inevitable the limit would be broken eventually. It's better to have broken it earlier than later.
Posted May 31, 2019 15:06 UTC (Fri)
by smurf (subscriber, #17840)
[Link]
Seems that quite a few of Chinese people with interesting names (i.e. using archaic characters) suddenly couldn't get an official document any more because, surprise, their name wasn't in the "official" charset …
Posted May 31, 2019 18:37 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Jun 6, 2019 5:09 UTC (Thu)
by Serentty (guest, #132335)
[Link]
Posted Nov 29, 2018 12:29 UTC (Thu)
by smurf (subscriber, #17840)
[Link] (1 responses)
Before UTF-8, there never was an encoding that could represent "all non-English languages". At most it could store one other language, or ten (Windows and its brain-dead decision to use 16-bit characters), and that is a subset of Unicode/utf-8.
> whenever UTF-8 gets deprecated
It won't be. There's no reason at all to do that, and several billion reasons not to.
Posted Nov 29, 2018 12:43 UTC (Thu)
by eru (subscriber, #2753)
[Link]
To be fair, that was the UNICODE spec at the time. Similarly, Java originally used 16-bit characters (and a
Posted Nov 29, 2018 12:16 UTC (Thu)
by skissane (subscriber, #38675)
[Link] (1 responses)
Maybe it should? Or, maybe, at least, Linux kernel should add some API to report the filesystem path name encoding. If Linux does it, maybe it could be added to POSIX by Austin Group.
Or, something I'd like even better – it must always be UTF-8, and filesystem has to translate to/from if anything else. But, that's probably going to cause backwards compatibility issues for some people, whereas just reporting filesystem encoding to user space won't.
Posted Dec 6, 2018 12:34 UTC (Thu)
by rleigh (guest, #14622)
[Link]
Case sensitivity is selectable. You can force it to only store valid UTF-8, and you can choose whether the UTF-8 is normalised or not. Or you can allow it to store anything. This gives you full backward compatibility with historical UNIX behaviour should you want it, or you can restrict it case sensitive normalised UTF-8. Having this selectable on a per-filesystem basis gives you a great deal of flexibility, and it defaults to something sensible (the above are the defaults).
Posted Nov 29, 2018 12:21 UTC (Thu)
by Sesse (subscriber, #53779)
[Link] (3 responses)
The only way I know of to deal with these kinds of issues is to specify a collation when creating the filesystem. Windows does (and many other things) this based on installation language, which causes all kinds of funky issues on large installations where you could have multiple users with different languages.
Posted Nov 29, 2018 14:23 UTC (Thu)
by willy (subscriber, #9762)
[Link] (2 responses)
Posted Nov 29, 2018 14:26 UTC (Thu)
by Sesse (subscriber, #53779)
[Link] (1 responses)
Posted Nov 30, 2018 17:55 UTC (Fri)
by k8to (guest, #15413)
[Link]
This leads to a problem where a user or process in one locale should get different results from the kernel than another. This traditionally was viewed as a rathole and I've seen many situations where osx behaves in bizarre ways due to this sort of thing.
The proposal here seems to be to push the rules into the filesystem or directory, which effectively means having locale behavior independent of the user / process, which means we will get a fun matrix of file name locale vs user locale. I'm not a fan.
Posted Nov 29, 2018 17:28 UTC (Thu)
by cesarb (subscriber, #6266)
[Link]
I wonder how many of these will be security vulnerabilities. We already had one in git not too long ago (CVE-2014-9390 git: arbitrary command execution vulnerability on case-insensitive file systems).
Posted Nov 30, 2018 17:38 UTC (Fri)
by ScottMinster (subscriber, #67541)
[Link] (5 responses)
But leaving that aside for the moment, what is the gain? I've been working on case sensitive Linux file systems for many years, and never felt the need to have "level1.MAP" really load "level1.map". I've occasionally had to deal with writing software that expects extensions in lower case and been given files with those extensions in upper case, and that is annoying, but it's more due to laziness on the file creator's part of not using the standard extension case.
The only two justifications I can come up with for wanting case insensitivity is to avoid problems with unexpectedly cased files and to avoid user confusion (i.e., "Document1.doc" and "document1.doc" being different files). Those are worthy goals, but it seems like there are so many tricky problems with case insensitivity that it is hardly worth the trouble. It's relatively easy in English, but as many people point out, there are problems in many other languages.
But even in English, it could cause problems. Can you "mv makefile Makefile" in a case insensitive filesystem, or would you get the error "'makefile' and 'Makefile' are the same file"? Also, as another poster pointed out, globbing in various programs would likely have to change to be consistent.
And once you enable it, you can never really disable it without inevitably breaking programs.
So why do systems like Windows and MacOS do it? Did they underestimate the difficulty and are stuck with that decision?
I could see how it could be useful for things like Samba servers, but given all the complicated edge cases, it doesn't seem like it's a good idea for general use. Though once it's working for Samba, some distribution will likely turn it on everywhere, to try to be more user friendly.
Posted Nov 30, 2018 18:36 UTC (Fri)
by mpr22 (subscriber, #60784)
[Link]
And MS-DOS was written with the (probably unconscious) assumption that natural-language text was in ASCII, where case-insensitivity is cheap to implement.
Posted Dec 7, 2018 11:09 UTC (Fri)
by Wol (subscriber, #4433)
[Link] (3 responses)
Because you're a programmer used to thinking in byte strings.
I *still* have problems because users insist on knowing whether email addresses have capital letters or not (they are case-insensitive, for historical reasons, because a lot of the early systems mangled case).
So the short answer is, YOU may not feel the gain, but a lot of other people WILL.
(Along the same lines, I remember being sent a second copy of some newsletter because "some people said they couldn't read the attachment". Ie pretty much all Windows systems, because the sender had somehow lost the extension and those systems didn't recognise the file "newsletter" as a pdf. Of course, my gentoo system didn't give a monkeys :-)
Cheers,
Posted Dec 7, 2018 11:45 UTC (Fri)
by gioele (subscriber, #61675)
[Link] (2 responses)
These users are right: the local-part of an email address _is_ case sensitive. Only the domain is case insensitive.
RFC 5321, section 2.4 [1]:
> The local-part of a mailbox MUST BE treated as case sensitive. Therefore, SMTP implementations MUST take care to preserve the case of mailbox local-parts. In particular, for some hosts, the user "smith" is different from the user "Smith". However, exploiting the case sensitivity of mailbox local-parts impedes interoperability and is discouraged. Mailbox domains follow normal DNS rules and are hence not case sensitive.
Posted Dec 7, 2018 16:31 UTC (Fri)
by smurf (subscriber, #17840)
[Link] (1 responses)
Posted Dec 8, 2018 23:37 UTC (Sat)
by zlynx (guest, #2285)
[Link]
> lowercase_local:
Posted Dec 2, 2018 8:00 UTC (Sun)
by marcH (subscriber, #57642)
[Link]
As a coincidence I recently got hurt by this NTFS "feature": *per-directory* case-sensitivity
In other words: filenames are case sensitive in some directories but not in other directories next or below them on the very same filesystem.
Posted Dec 10, 2018 15:04 UTC (Mon)
by davez (guest, #104707)
[Link] (1 responses)
http://drewthaler.blogspot.com/2007/12/case-against-insen...
Also, consider the fact that Apple's iOS filesystem is case sensitive, *not* case insensitive.
Posted Dec 10, 2018 16:07 UTC (Mon)
by corbet (editor, #1)
[Link]
Filesystems and case-insensitivity
1. They are a place where user store his files.
2. They are a place where programs store some internal data. Kind of (key, value) storage with hierarchical key.
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
letter case."
Doing this for most existing system calls would be inefficient
Well, it really depends on the use case (pun intended). Once I added case-insensitivity support to a proprietary filesystem specifically to improve performance, with great success.
Consider a case-sensitive folder with 10.000 files (this is not rare at all), shared over Samba. Every time a Samba client requests creation of a new file, and because the client requires case insensitivity, Samba has to scan the entire folder to check if the new name collides with an existing name. Yes, that's for every new file.
If the filesystem is actually case-insensitive, Samba can skip these scans, which is a huge performance boost.
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Uhm, no.
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
NAME TYPE FSTYPE MOUNTPOINT
sda disk
├─sda1 part vfat /boot
├─sda2 part ext4 /
├─sda3 part
└─sda4 part ext4
[16:01 ~/bs]$ ls -ld /boot/ef*
ls: cannot access '/boot/ef*': No such file or directory
[16:02 ~/bs]$ ls -ld /boot/E*
drwxr-xr-x 7 root root 4096 May 6 2018 /boot/EFI
[16:04 ~/bs]$ ls -ld /boot/V*
ls: cannot access '/boot/V*': No such file or directory
[16:06 ~/bs]$ ls -ld /boot/v*
-rwxr-xr-x 1 root root 5838720 Nov 23 10:05 /boot/vmlinuz-linux
[16:07 ~/bs]$ ls -ld /boot/VMLINUZ-LINUX
-rwxr-xr-x 1 root root 5838720 Nov 23 10:05 /boot/VMLINUZ-LINUX
case-insensitive.
whereas query-replace using the same regex works case-sensitively unless
you say otherwise.
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
glibc?
Filesystems and case-insensitivity
> use the case-folded name as the file name and add a name-as-provided attribute
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Since kernel support for case insensitivity is currently even less if a thing, I still wonder...
Filesystems and case-insensitivity
Filesystems and case-insensitivity
% getfattr --name=user.name-as-provided malware.sh
# file: malware.sh
user.name-as-provided="Safe.pdf"
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Consider hash values 0xabcdef and 0xABCDEF: well, they are the same value. :) Hash-based names in hex actually don't care about case.
Besides, see my other comment about when and why it is needed.
Filesystems and case-insensitivity
Filesystems and case-insensitivity
What's an example of an application that does that?
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
- use the filesystem as case preserving (and give up case insensitivity)
- use the filesystem as canse-insensitive (and forget about case preservation)
- use xattrs
And all that, without touching a single line of kernel code.
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Git has a "packed-refs" file these days, so the problem *should* be solved, but I haven't checked.
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Another problem area is how to deal with invalid byte sequences. He proposes falling back to the previous behavior, just treating the names as sequences of bytes, when a sequence is invalid for the encoding.
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Wol
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
curl -O
gives me filenames with spaces or %20
s, but I do object if I see files with newlines in the names. I don't mind if I'm left with sneaky left-to-right, right-to-left marks or explicitly red hearts. I see the need for parentheses and question marks...
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
find . -name whatever | xargs
... Yes, I know I can write -print0
and -0
, I do that when I write shell scripts.
~$ touch aaabd $(printf 'aaabc\bd') "$(printf 'aaabc\nd')"
~$ ls -lt | head -5
total 3686968
-rw-r--r-- 1 ale ale 0 Dec 3 13:21 aaabd
-rw-r--r-- 1 ale ale 0 Dec 3 13:21 aaabc
d
-rw-r--r-- 1 ale ale 0 Dec 3 13:21 aaabd
Control characters where never forbidden. Consider that human beings are sometimes uncertain about the name they're typing and type a backspace (\b)
in it. So, why isn't that beautiful too? Perhaps, users should have a clue. In the words of the Ancient Philosophy, rubbish in, rubbish out.
ls took care of that a few years ago…
Filesystems and case-insensitivity
~/test $ ls
'aaabc'$'\b''d' 'aaabc'$'\n''d' aaabd
~/test $ ls --version
ls (GNU coreutils) 8.30
Packaged by Gentoo (8.30 (p01))
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Wol
Filesystems and case-insensitivity
one true space space character (U+0020)
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
there is no reasonable rationale for those other space characters (including U+00a0) in file names.
Would it calm your anger to point out that LibreOffice can search using regexps?
Filesystems and case-insensitivity
Filesystems and case-insensitivity
The question of negative cache entries is interesting, and I confirm it is very important for performance.
I didn't look at the dentry cache in a very long time, though I suppose that looking up a dcache entry by name is essentially: compute a hash of the (name, directory ino) tuple, then look it up in a hash table, by comparing the requested tuple with the cached tuples in the bucket.
Then, I suppose it is sufficient to have a per-filesystem hash function, which can compute a hash over the normalized name. And a per-filesystem comparison function can then compare the requested tuple with the cached tuples.
Probably that would work, right?
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
*when* (not if) the next transition happens
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
(Windows and its brain-dead decision to use 16-bit characters), and that is a subset of Unicode/utf-8.
Filesystems and case-insensitivity
char
type is still 16 bits wide there). Now Java internally encodes strings as UTF-16 in order to support the expansion of UNICODE.
Filesystems and case-insensitivity
Example for ZFS:
Filesystems and case-insensitivity
% zfs get all rpool/ROOT/default
NAME PROPERTY VALUE SOURCE
rpool/ROOT/default type filesystem -
rpool/ROOT/default creation Sun Jun 12 10:46 2016 -
…
rpool/ROOT/default utf8only on -
rpool/ROOT/default normalization formD -
rpool/ROOT/default casesensitivity sensitive -
…
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Wol
Filesystems and case-insensitivity
Filesystems and case-insensitivity
Filesystems and case-insensitivity
> driver = redirect
> data = ${lc:${local_part}}
Filesystems^H directories and case-insensitivity
https://github.com/vector-of-bool/vscode-cmake-tools/issu...
Filesystems and case-insensitivity
For the curious, Linus has also sounded off on the idea; he is not impressed either.
Filesystems and case-insensitivity